CDC set up in Azure Data Factory for Onprem Sql to Blob Storage

Pds 46 Reputation points
2025-07-31T02:31:55.8933333+00:00

Hello,

We are looking to use Azure Data Factory new feature CDC to load data from On prem Sql Server to Data lake Blob Storage.

We have already both Self hosted and Managed Integration Run time exists for other process and we have also brand new Self hosted IR created.

We have current Data lake Blob storage also available.

We have On prem Sql server and from which need few tables data constantly (frequency we can set up either hourly or 15-30 minutes too) load into Azure Data lake Blob storage.

I am looking process to set up CDC as will be using first time in ADF and I think CDC is still in preview.

Could someone please guide us how we can set up?

I was looking some information in google and found that needs to be created VM in Azure, Load balancer, Private link and other stuff so got confused as on prem I have set up and it's so easy.

Appreciate your guidance!

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 35,766 Reputation points Volunteer Moderator
    2025-07-31T11:32:47.2566667+00:00

    Hello !

    Thank you for posting on Microsoft Learn.

    Verify that the source tables in SQL Server have CDC enabled:

    -- Enable CDC at DB level
    EXEC sys.sp_cdc_enable_db;
    -- Enable CDC for specific table
    EXEC sys.sp_cdc_enable_table   
    @source_schema = N'dbo',   
    @source_name   = N'MyTable',   
    @role_name     = NULL;
    

    This will create change tracking tables and functions for deltas.

    Then create a Linked Service in ADF for your on-prem SQL Server using your SHIR and another one to your Blob Storage (choose either ADLS Gen2 or regular Blob depending on your setup).

    In ADF, go to Data Flows and create a new Mapping Data Flow and add a CDC source:

    • Choose your SQL Server linked service.
    • Select "Change Data Capture" as the source type.
    • Select the correct capture instance (created by enabling CDC).
    • Choose between: all changes (Insert/Update/Delete) or net changes (deduplicated changes)

    Then add a sink pointing to your Blob Storage. where you can write as parquet, JSON, or CSV and partition by date or primary key as needed.

    You can optionally use a derived column or filter to transform or enrich the data before writing.

    If you don’t want a real-time stream, you can create a pipeline that runs your Data Flow where you use a Tumbling Window Trigger with a recurrence of 15 or 30 minutes and enable dependency tracking to avoid overlap and configure watermarking using a field like __$start_lsn or __$seqval.

    To avoid re-reading data you can ise a parameterized watermark column (__$start_lsn or timestamp) and Store the last successful value in a metadata table or file and pass it to the pipeline on next run using parameters.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.