CDC set up in Azure Data Factory for Onprem Sql to Blob Storage

Question

CDC set up in Azure Data Factory for Onprem Sql to Blob Storage

Pds 46

Hello,

We are looking to use Azure Data Factory new feature CDC to load data from On prem Sql Server to Data lake Blob Storage.

We have already both Self hosted and Managed Integration Run time exists for other process and we have also brand new Self hosted IR created.

We have current Data lake Blob storage also available.

We have On prem Sql server and from which need few tables data constantly (frequency we can set up either hourly or 15-30 minutes too) load into Azure Data lake Blob storage.

I am looking process to set up CDC as will be using first time in ADF and I think CDC is still in preview.

Could someone please guide us how we can set up?

I was looking some information in google and found that needs to be created VM in Azure, Load balancer, Private link and other stuff so got confused as on prem I have set up and it's so easy.

Appreciate your guidance!

1 answer

Your answer

Answer 1

Amira Bedhiafi 35,766 Volunteer Moderator

Hello !

Thank you for posting on Microsoft Learn.

Verify that the source tables in SQL Server have CDC enabled:

-- Enable CDC at DB level
EXEC sys.sp_cdc_enable_db;
-- Enable CDC for specific table
EXEC sys.sp_cdc_enable_table   
@source_schema = N'dbo',   
@source_name   = N'MyTable',   
@role_name     = NULL;

This will create change tracking tables and functions for deltas.

Then create a Linked Service in ADF for your on-prem SQL Server using your SHIR and another one to your Blob Storage (choose either ADLS Gen2 or regular Blob depending on your setup).

In ADF, go to Data Flows and create a new Mapping Data Flow and add a CDC source:

Choose your SQL Server linked service.
Select "Change Data Capture" as the source type.
Select the correct capture instance (created by enabling CDC).
Choose between: all changes (Insert/Update/Delete) or net changes (deduplicated changes)

Then add a sink pointing to your Blob Storage. where you can write as parquet, JSON, or CSV and partition by date or primary key as needed.

You can optionally use a derived column or filter to transform or enrich the data before writing.

If you don’t want a real-time stream, you can create a pipeline that runs your Data Flow where you use a Tumbling Window Trigger with a recurrence of 15 or 30 minutes and enable dependency tracking to avoid overlap and configure watermarking using a field like __$start_lsn or __$seqval.

To avoid re-reading data you can ise a parameterized watermark column (__$start_lsn or timestamp) and Store the last successful value in a metadata table or file and pass it to the pipeline on next run using parameters.

Pds 46 Reputation points

2025-07-31T17:09:02.66+00:00
Amira,

Thank you for your quick response and providing detail steps.

If I am correct then this is not exactly CDC in preview, right?

This is regular ADF which Data flow building with CDC as newer preview CDC doesn't have to create data flow, right?

Please correct me if i am missing or have misunderstanding.

We have already CDC enabled in source On prem sql server for requiring tables.

We have already created Linked Service in ADF for our on-prem SQL Server using SHIR which I am using that one I have CDC enabled.

I am also using existing one for Blob Storage - ADLS Gen2.

When I tried to creating new data flow, I added above linked service and tried to test it it's throwing an error as below:

Error code: DF-MSSQL-InvalidFirewallSetting Message: The TCP/IP connection to the host has failed. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall. Cause: The SQL database's firewall setting blocks the data flow to access. Recommendation: Check the firewall setting for your SQL database, and allow Azure services and resources to access this server.

My On prem Sql Server has port open (1433, 1434) and allowed Firewall settings.

I was googling and one of the articles saying "Azure Data Factory (ADF) Data Flows do not directly execute on a Self-Hosted Integration Runtime (IR)."
Amira Bedhiafi 35,766 Reputation points Volunteer Moderator

2025-07-31T19:27:39.7133333+00:00

There are currently two approaches that people often refer to as CDC in ADF. The first is using Mapping Data Flows with a source that reads from SQL Server CDC enabled tables. This method relies on you creating a data flow in ADF, and it uses a managed IR, which is hosted in Azure. The limitation here is that Mapping Data Flows do not run on your Self-hosted IR they can only execute on Azure-hosted compute. This is why you're seeing that error.

The second and more appropriate option for your case is the new CDC Preview feature in ADF, which was introduced to provide native support for change tracking without needing to build and maintain data flows. This feature does not require a Mapping Data Flow. It works with Copy Data activities and is designed to use self hosted IR, which you already have configured.

So, you can define a pipeline that performs an initial load followed by periodic delta loads . The ADF engine will automatically track changes using the SQL Server CDC metadata and handle watermarking, filtering, and incremental loading behind the scenes. You simply choose your CDC-enabled tables as the source, and a Blob Storage location as the sink. You can also choose your file format (Parquet, JSON, or CSV) and partitioning strategy.
Pds 46 Reputation points

2025-07-31T21:37:44.08+00:00

Amira,

Thanks once again for providing both the options.

For 1st option, We have already Azure managed IR.

For 2nd option, new CDC Preview feature in ADF and also we have self hosted IR.

As I mentioned, We have also linked service created for Blob Storage which is using "AutoResolveIR"

Do I need anything else?

I am also confused as do It needs to be created VM in Azure, Load balancer, Private link etc?

If you have proper link that I can follow?

I really appreciate your help and how you providing support with valuable information.

Thank you!

Pds 46

Amira,

I tried creating Linked service as below but it's not working:

Created Linked Service using AutoresolveIR option connecting on prem Sql server using proper login credentials but failed

I also Created Linked Service using Managed Azure IR option connecting on prem Sql server using proper login credentials but failed In both the cases, User credential have the proper permission and already working with Self Hosted IR I am getting following Error:

   Cannot connect to SQL Database. Please contact SQL server team for further support. Server: 'SQLDB01', Database: 'DB01', User: 'SqlDBSvc'. Check the linked service configuration is correct, and make sure the SQL Database firewall allows the integration runtime to access. A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) The network path was not found

Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-08-06T17:10:56.6966667+00:00
Hi Pds Thank you for the detailed follow-up and for clearly outlining what you've tried so far, that’s very helpful.

You're absolutely right to be focusing on the new CDC (preview) feature in Azure Data Factory, especially since you're working with an on-prem SQL Server and already have a Self-hosted Integration Runtime (SHIR) in place.

Why Autoresolve IR / Managed IR fails:

Both AutoResolve IR and Managed IR are Azure-hosted compute, and they cannot directly connect to on-premises SQL Server. That’s why you're getting connectivity/firewall errors.

Only SHIR (Self-hosted IR) can be used to connect to on-prem SQL Servers for both Copy activity and CDC (Preview) pipelines.

Does CDC (Preview) require VM, Load Balancer, or Private Link?

No, you do not need to provision a separate Azure VM, load balancer, or private link if:

Your SHIR is running on a machine that can connect to your on-prem SQL Server.

SHIR can also reach the internet to communicate with Azure Data Factory.

The confusion you saw online likely applies to complex hybrid network scenarios or secure environments needing private networking. For standard setups, SHIR alone is sufficient.

What you need for CDC (Preview) to work:

CDC-enabled tables in your on-prem SQL Server which you confirmed is already done.

Linked Service using SHIR to connect to on-prem SQL Server.

Make sure your SHIR is up and running on a reachable machine.

Linked Service to ADLS Gen2 (Blob Storage) can remain on AutoResolve IR.

Create a CDC-enabled Copy Activity pipeline (no Data Flow needed).

Set your on-prem SQL with SHIR as the source.

Set Blob Storage as the sink.

ADF will handle the watermarking and incremental logic automatically.

Here are the official Microsoft Learn articles that confirm the above setup:

Overview of CDC in ADF & Synapse (updated Feb 13, 2025): https://learn.microsoft.com/en-us/azure/data-factory/concepts-change-data-capture

How to create a CDC resource in ADF Studio (Preview): https://learn.microsoft.com/en-us/azure/data-factory/how-to-change-data-capture-resource

These confirm that:

CDC (Preview) is now a top-level resource in ADF Studio.

It supports SQL Server (on-prem and cloud), Cosmos DB, etc.

It allows you to create CDC pipelines without using Mapping Data Flows.

SHIR is supported as the runtime for reading from on-prem sources.

ADF handles watermarking, filtering, and change tracking automatically.

Hope this helps. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-08-07T12:37:46.1166667+00:00

Hi Pds Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Pds 46 Reputation points

2025-08-07T22:01:05.8533333+00:00
Venkat,

Thanks so much for providing very useful step by step details which looks like pretty straightforward.

Sorry for the late reply.

I tried to follow your instruction, please help me out for the following?

CDC-enabled tables in your on-prem SQL Server which you confirmed is already done. ==> It's already there as we are using

Linked Service using SHIR to connect to on-prem SQL Server. ==> We have already Linked service created to connect On Prem Sql Server

Make sure your SHIR is up and running on a reachable machine.

==> Self hosed IR is running, I want to let you know and just make sure that it's shared, it's ok?

Linked Service to ADLS Gen2 (Blob Storage) can remain on AutoResolve IR. ==> Blob Storage we have in place and it's working using AutoResolve IR

Create a CDC-enabled Copy Activity pipeline (no Data Flow needed).

I have clicked on Change Data Capture (Preview) ==> New CDC (Preview) => then into Choose your Sources Set your on-prem SQL with SHIR as the source. ==> When I tried to use previously created source Linked Service, I can't see in drop down in Source screen drop down and when I click to create new Linked service then after selecting Source, only showing AutoResolve IR or Managed Vnet IR I am stuck here please let me know what's wrong? Appreciate your help!
Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-08-11T11:47:48.3+00:00
Hi Pds You're very close, sounds like most of your setup is already in place.

The reason you can't see your existing SHIR-linked service in the CDC (Preview) source dropdown is because only Linked Services created within the CDC UI (using SHIR) will appear there. Even if you already have a working SHIR connection, it won’t show up unless it was created specifically during the CDC setup.

Also, if SHIR isn’t showing as an option when creating a new linked service, that usually means the SHIR isn’t properly linked to your current Data Factory or was registered using a different factory’s key.

A quick fix:

Go to Manage > Integration Runtimes in ADF and confirm your SHIR is registered under this ADF.

If needed, re-register SHIR using the key from this specific ADF.

Then go back to CDC, create a new linked service (make sure SHIR shows), and continue.

If SHIR still doesn’t show, try creating a new SHIR from scratch in ADF, install it on your machine with the new key, and try again. That usually clears up any linking or visibility issues.

Hope this helps. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
pdsqsql 431 Reputation points

2025-08-11T15:26:30.66+00:00
Venkat,

Thank you so much for providing key details and agreed with you that I am almost there, only concern is i am somewhere lost or missing something in my environment.

I tried following:

I have clicked on Change Data Capture (Preview) ==> New CDC (Preview) => then into Choose your Sources Set your on-prem SQL with SHIR as the source. ==> When I tried to use previously created source Linked Service, I can't see in drop down in Source screen drop down and when I click to create new Linked service then after selecting Source, only showing AutoResolve IR and Managed Vnet IR

When I click on +New in Connect via Integration Runtime, I can only see Azure option in Integration Runtime setup (Use this for running data flows, data movement, external and pipeline activities in a fully managed, serverless compute in Azure.) Not other Self-Hosted or Linked Self-Hosted as it's grayed out.

When I click on Learn under Self-Hosted, it has note: "Data flows are only supported on Azure integration runtime. You can use self-hosted integration runtime to stage the data on cloud storage and then use data flows to transform it"

Share via

CDC set up in Azure Data Factory for Onprem Sql to Blob Storage

1 answer

Your answer