Incremental Load by using ADF
Introduction
lot of my friend from different group
is asking to provide a blog post related to incremental load. If you search Google
related to this tropics, you can find bundle of blog post.
All are the same using water mark technique. But in my view point it is not a practical approach for incremental load. When we talk about incremental load, we assume that we have billions of records in source which is ever changing and we need to move on destination table.
Here in this demonstration we used ADF control flow but it can be done with any ETL or ELT tools like SSIS.
One last thing that I would like to mention if we try to load huge records from source to designation, it definitely gives us performance problem or stuck in between. So, we have to load the data in a smaller chunk or batch.
We have to keep in track that data that we already loaded in the destination is going to updated in source or not. If it is going too updated, we have to update the destination accordingly. You can modify this design pattern according to your situation.
Hope it will be interesting.
Case
Study
We have a huge source table. The source table must have two
columns. One is Created Date and others have Updated Date.
We have to prepare two control tables. First table contains the Batch Number, From Date and To Date columns.
The second table contains Batch
Number, Source Record Count and Destination Records Count. The second
table is used for reconciliation purpose after loading is completed.
Approach
First we look at the control table and pick the records from
Source table according to the From Date
and To Date. We are comparing From
and To Date with Created Date of
Source table and load the data to the Destination table.
After successful load, we have to update the Reconciliation table
columns named Destination Count. By
this way we can understand that for each batch all the data is loaded
successfully or not. We also update the Control table IsLoaded flag to “D” to indicate the loading of this batch is done.
By comparing Created date and Updated Date in source we can easily
understand which records is modified after load and we have to update those
records in destination.
Stored
Procedure Used
ADF
Control Flow
Outer
Lookup Activity
ForEach
Activity
Copy
Data Activity (Under ForEach)
Expression:
@concat('SELECT * FROM StudentRecordList
WHERE createddate BETWEEN ' ,
'''',formatDateTime(item().FromDate,'yyyy-MM-dd'),'''',
' AND ', '''',formatDateTime(item().ToDate,'yyyy-MM-dd'),'''')
Stored
Procedure (Under ForEach)
OUTPUT
SELECT * FROM StudentRecordListDestination
SELECT * FROM ControlTable
SELECT * FROM Reconciliation
Hope you like it.
Thanks for this post as other post in Google tend to show only watermark technique ,which is not useful if dataset is huge
ReplyDeleteThanks
DeleteHere I am not showing the updated portion. .Hope you can do that.
ReplyDelete