Incremental Load by using ADF

Introduction

lot of my friend from different group is asking to provide a blog post related to incremental load. If you search Google related to this tropics, you can find bundle of blog post.

All are the same using water mark technique. But in my view point it is not a practical approach for incremental load. When we talk about incremental load, we assume that we have billions of records in source which is ever changing and we need to move on destination table.

Here in this demonstration we used ADF control flow but it can be done with any ETL or ELT tools like SSIS.

One last thing that I would like to mention if we try to load huge records from source to designation, it definitely gives us performance problem or stuck in between. So, we have to load the data in a smaller chunk or batch.

We have to keep in track that data that we already loaded in the destination is going to updated in source or not. If it is going too updated, we have to update the destination accordingly. You can modify this design pattern according to your situation.

Hope it will be interesting.

 

Case Study

We have a huge source table. The source table must have two columns. One is Created Date and others have Updated Date.

We have to prepare two control tables. First table contains the Batch Number, From Date and To Date columns.

The second table contains Batch Number, Source Record Count and Destination Records Count. The second table is used for reconciliation purpose after loading is completed.

 


 

Approach

First we look at the control table and pick the records from Source table according to the From Date and To Date. We are comparing From and To Date with Created Date of Source table and load the data to the Destination table.

 

After successful load, we have to update the Reconciliation table columns named Destination Count. By this way we can understand that for each batch all the data is loaded successfully or not. We also update the Control table IsLoaded flag to “D” to indicate the loading of this batch is done.

By comparing Created date and Updated Date in source we can easily understand which records is modified after load and we have to update those records in destination.

 

Stored Procedure Used

 


 

ADF Control Flow

 


 

 

Outer Lookup Activity

 


 

 

ForEach Activity

 


 

Copy Data Activity (Under ForEach)

 


 

Expression:

 

@concat('SELECT * FROM StudentRecordList WHERE createddate BETWEEN  ' , '''',formatDateTime(item().FromDate,'yyyy-MM-dd'),'''', ' AND ', '''',formatDateTime(item().ToDate,'yyyy-MM-dd'),'''')

 

 

Stored Procedure (Under ForEach)

 


 

 

OUTPUT

 

SELECT * FROM StudentRecordListDestination

SELECT * FROM ControlTable

SELECT * FROM Reconciliation

 


 

 

Hope you like it.


Comments

  1. Thanks for this post as other post in Google tend to show only watermark technique ,which is not useful if dataset is huge

    ReplyDelete
  2. Here I am not showing the updated portion. .Hope you can do that.

    ReplyDelete

Post a Comment

Popular Posts

Working with Python -- File Handling

Copying Multiple File from Blob Storage to Single SQL Table – Part-2