Applying Merged CDC Delta Data from ADLS to Azure SQL Hyperscale at Scale

Question

Applying Merged CDC Delta Data from ADLS to Azure SQL Hyperscale at Scale

Janice Chi 580

As part of our enterprise-scale data migration project, we are migrating data from IBM DB2 (on-prem) to Azure SQL Hyperscale. The migration consists of three phases:

One-time Historical Load

CDC Catch-Up Phase (Batch)

CDC Streaming Phase (Real-Time)

We are currently focused on the Catch-Up and Real-Time phases and would like Microsoft’s guidance on how to best apply the already-merged CDC records (INSERT, UPDATE, DELETE) from ADLS (Delta format) into Azure SQL Hyperscale.

Current Design Highlights:

We are consuming bounded CDC data from Kafka and writing to the raw layer in ADLS (Delta format).

Separately, we have historical data stored in ADLS.

We perform the MERGE logic in Databricks and write the final merged output to a Bronze layer in ADLS.

Now, we need to apply this merged Bronze Delta data into Azure SQL Hyperscale.

Our Question:

What is the most efficient, scalable, and recommended approach to apply these already-merged CDC records (in ADLS Delta format) into the Azure SQL Hyperscale target tables,

or

How we should move INSERT , UPDATE AND DELETE operations , better to say how we should move Partitions from ADLS bronze layer inside which in some of the rows we have I/U/D operations correctly to Hyperscale main tables , please note that since we are doing MERGE in ADLS bronze layer we are not maintaining STAGING LAYER in HS in CATCHUP/CDC.

Design Trade-Offs and Concerns:

Our Delta files in ADLS allow us to leverage features like z-ordering, min/max pruning, and schema enforcement.

Does Azure SQL Hyperscale offer equivalent features or techniques like indexed lookups or other options to support key-based ingestion at this scale?

How do we maintain performance while doing key-based operations for very large tables in Hyperscale?

What best practices are recommended for ingesting merged CDC data into Hyperscale — especially when full MERGE logic is already completed upstream in Databricks?
- considering:
- Very large table sizes (some >20 TB),
- Over 200 partitions,
- Full CDC semantics (INSERT, UPDATE, DELETE),
- Primary-key–based merge requirements,
- The need for performance + correctness + simplicity — without requiring full-table comparisons in Hyperscale?

Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-08-04T11:49:11.3633333+00:00
Hi Janice Chi Thanks for sharing the details of your migration plan. To apply your already-merged CDC data (INSERT, UPDATE, DELETE) from ADLS (Delta format) into Azure SQL Hyperscale efficiently.

Ingestion of Merged CDC Data:

PolyBase is the most efficient option for loading large data from ADLS into Hyperscale. It supports direct ingestion from Delta format in ADLS, minimizing overhead and handling large datasets (like your 20+ TB tables).

If you need more control over the data flow or transformations, Azure Data Factory (ADF) can help you build a pipeline that reads from ADLS and applies the necessary MERGE logic to your target tables in Hyperscale.

Handling CDC Operations (Insert, Update, Delete):

Since you're already merging CDC records upstream in Databricks, you can skip using a staging layer in Hyperscale. After loading data into Hyperscale, you can apply the MERGE logic directly:

Ensure that your tables are partitioned effectively. Partitioning by a relevant key (like time or an incremental key) ensures you only process the relevant data during the MERGE, which improves performance.

Use non-clustered indexes on the key columns involved in the MERGE to speed up lookups for INSERT, UPDATE, and DELETE operations.

Optimizing Performance at Scale:

Clustered Columnstore Indexes (CCI) in Hyperscale can help with large tables by optimizing both storage and query performance, similar to Delta's Z-ordering and min/max pruning.

For large-scale data loads, batch processing is critical. Consider breaking the ingestion process into smaller batches to optimize parallel processing and reduce load time.

Indexed lookups and efficient partitioning will minimize the need for full-table scans during the MERGE process, ensuring performance at scale.

Data Integration:

Given the size and complexity of your data (20+ TB and over 200 partitions), focus on incremental data loading and partitioning to ensure only relevant partitions are processed. This will help you avoid full-table comparisons and ensure fast key-based operations.

Use monitoring (via Azure Monitor or SQL Extended Events) to track the performance of your ingestion and MERGE processes, adjusting as needed for optimal performance.

Finally:

PolyBase or ADF should be your main tools for moving data from ADLS to Hyperscale.

Focus on partitioning and indexing for efficient key-based CDC operations.

Leverage clustered columnstore indexes and batching for performance, while ensuring your MERGE logic is applied only to the necessary partitions.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Janice Chi 580 Reputation points

2025-08-04T15:15:31.9366667+00:00

PolyBase or ADF should be your main tools for moving data from ADLS to Hyperscale----------------------------Can't we use DBR instead of ADF ?

second and most important

how each one (INSERT, UPDATE, DELETE) from ADLS (Delta format) into Azure SQL Hyperscale will move to Hyperscale , please tell how I will be moved meaning code and logic for new inserted row from bronze to HS

please tell how UPDATE will be moved meaning code and logic for UPDATED row from bronze to HS , meaning this row is already there so what to do and how ?

please tell how DELETE will happen finally on HS Main table when same has been done already in BRONZE MERGE layer ?
Janice Chi 580 Reputation points

2025-08-05T06:29:46.8666667+00:00

waiting for reply on the basis of real experience please
Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-08-05T12:50:15.37+00:00
Janice Chi Thanks for your follow-up

Can you use Databricks instead of ADF?

Yes! Databricks is actually an excellent choice here, especially since you’re already doing your CDC merge logic upstream in Delta Lake on Databricks. Using Databricks to orchestrate and push data directly into Hyperscale via JDBC provides flexibility and avoids additional tooling.

How to handle each CDC operation (INSERT, UPDATE, DELETE) when applying merged CDC data?

Since your Bronze Delta layer already has the final merged state (with operation flags like I, U, D), here’s a typical pattern you can implement in Databricks:

Read your Bronze Delta data with all CDC operations flagged.

Split the data into upserts (INSERT + UPDATE) and deletes.

Write upsert data into a staging table in Hyperscale using JDBC from Databricks.

Run a SQL MERGE statement in Hyperscale to atomically apply INSERTs and UPDATEs to your main table.

Execute batch DELETE statements based on keys for rows flagged as deletes.

This approach leverages Hyperscale’s ability to efficiently run MERGE operations and indexed lookups at scale.

Example SQL MERGE pattern to apply in Hyperscale

MERGE dbo.TargetTable AS T USING dbo.StagingTable AS S ON T.PrimaryKey = S.PrimaryKey WHEN MATCHED AND S.OperationType = 'U' THEN UPDATE SET T.Col1 = S.Col1, T.Col2 = S.Col2, ... WHEN NOT MATCHED BY TARGET AND S.OperationType = 'I' THEN INSERT (PrimaryKey, Col1, Col2, ...) VALUES (S.PrimaryKey, S.Col1, S.Col2, ...) WHEN MATCHED AND S.OperationType = 'D' THEN DELETE;

Why this approach works well at scale

Databricks handles the complex CDC merge upstream, so Hyperscale doesn’t need to reprocess raw CDC streams.

Batching and partitioning your data allows incremental processing and keeps MERGE operations performant.

Clustered columnstore indexes and proper non-clustered indexes on primary keys in Hyperscale enable fast key-based merges and deletes, similar in spirit to Delta Lake’s pruning and indexing features.

This approach avoids full table scans and leverages Hyperscale’s distributed query engine efficiently.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Janice Chi 580 Reputation points

2025-08-05T16:57:22.8533333+00:00

please read carefully - we do not want to have SAGING LAYER IN HS as we already done MERGE in ADLS -How we should move INSERT , UPDATE AND DELETE operations , better to say how we should move Partitions from ADLS bronze layer inside which in some of the rows we have I/U/D operations correctly to Hyperscale main tables , please note that since we are doing MERGE in ADLS bronze layer we are not maintaining STAGING LAYER in HS in CATCHUP/CDC.
Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-08-06T13:52:03.83+00:00
Hi Janice Chi thank you for the clarification and for reiterating the architectural constraint around not using a staging layer in Hyperscale.

You're absolutely right, if your merged CDC logic is already finalized in the Delta Bronze layer in ADLS, there’s no need to reintroduce a staging table in SQL Hyperscale. Apologies for not aligning with that earlier.

You can apply the INSERT, UPDATE, and DELETE operations directly from ADLS Bronze to your Hyperscale main tables using Databricks JDBC, partition-wise.

Here’s the updated approach, aligned with your design:

Read Partitioned Delta from Bronze (e.g., based on date/offset):

df = spark.read.format("delta").load(bronze_path).filter("partition_date = '2025-08-04'")

Split CDC Rows:

inserts_updates_df = df.filter("operation_type IN ('I', 'U')") deletes_df = df.filter("operation_type = 'D'")

For INSERT & UPDATE (Upserts):

Loop through in batches and run parameterized MERGE statements via JDBC directly on the main table (no staging).

MERGE INTO dbo.TargetTable AS T USING (SELECT ? AS PK, ? AS Col1, ? AS Col2) AS S ON T.PK = S.PK WHEN MATCHED THEN UPDATE SET T.Col1 = S.Col1, T.Col2 = S.Col2 WHEN NOT MATCHED THEN INSERT (PK, Col1, Col2) VALUES (S.PK, S.Col1, S.Col2);

For DELETEs:

Run direct DELETE statements by batching primary keys from the deletes_df:

DELETE FROM dbo.TargetTable WHERE PrimaryKey = ?

Performance Tips:

Use primary key indexes and optionally clustered columnstore indexes.

Apply data partition filters to only process changed partitions.

Use Databricks JDBC batching and connection pooling for throughput.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-08-07T14:53:06.5266667+00:00

Hi Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help

1 answer

Your answer

Janice Chi 580 Reputation points

2025-08-04T15:15:31.9366667+00:00

PolyBase or ADF should be your main tools for moving data from ADLS to Hyperscale----------------------------Can't we use DBR instead of ADF ?

second and most important

how each one (INSERT, UPDATE, DELETE) from ADLS (Delta format) into Azure SQL Hyperscale will move to Hyperscale , please tell how I will be moved meaning code and logic for new inserted row from bronze to HS

please tell how UPDATE will be moved meaning code and logic for UPDATED row from bronze to HS , meaning this row is already there so what to do and how ?

please tell how DELETE will happen finally on HS Main table when same has been done already in BRONZE MERGE layer ?
Janice Chi 580 Reputation points

2025-08-05T06:29:46.8666667+00:00

waiting for reply on the basis of real experience please
Janice Chi 580 Reputation points

2025-08-05T16:57:22.8533333+00:00

please read carefully - we do not want to have SAGING LAYER IN HS as we already done MERGE in ADLS -How we should move INSERT , UPDATE AND DELETE operations , better to say how we should move Partitions from ADLS bronze layer inside which in some of the rows we have I/U/D operations correctly to Hyperscale main tables , please note that since we are doing MERGE in ADLS bronze layer we are not maintaining STAGING LAYER in HS in CATCHUP/CDC.
Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-08-06T13:52:03.83+00:00

Hi Janice Chi thank you for the clarification and for reiterating the architectural constraint around not using a staging layer in Hyperscale.

You're absolutely right, if your merged CDC logic is already finalized in the Delta Bronze layer in ADLS, there’s no need to reintroduce a staging table in SQL Hyperscale. Apologies for not aligning with that earlier.

You can apply the INSERT, UPDATE, and DELETE operations directly from ADLS Bronze to your Hyperscale main tables using Databricks JDBC, partition-wise.

Here’s the updated approach, aligned with your design:

Read Partitioned Delta from Bronze (e.g., based on date/offset):

df = spark.read.format("delta").load(bronze_path).filter("partition_date = '2025-08-04'")

Split CDC Rows:

inserts_updates_df = df.filter("operation_type IN ('I', 'U')") deletes_df = df.filter("operation_type = 'D'")

For INSERT & UPDATE (Upserts):

Loop through in batches and run parameterized MERGE statements via JDBC directly on the main table (no staging).

MERGE INTO dbo.TargetTable AS T USING (SELECT ? AS PK, ? AS Col1, ? AS Col2) AS S ON T.PK = S.PK WHEN MATCHED THEN UPDATE SET T.Col1 = S.Col1, T.Col2 = S.Col2 WHEN NOT MATCHED THEN INSERT (PK, Col1, Col2) VALUES (S.PK, S.Col1, S.Col2);

For DELETEs:

Run direct DELETE statements by batching primary keys from the deletes_df:

DELETE FROM dbo.TargetTable WHERE PrimaryKey = ?

Performance Tips:

Use primary key indexes and optionally clustered columnstore indexes.

Apply data partition filters to only process changed partitions.

Use Databricks JDBC batching and connection pooling for throughput.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-08-07T14:53:06.5266667+00:00

Hi Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help

Answer 1

Janice Chi 580

Let me check thanks

Share via

Applying Merged CDC Delta Data from ADLS to Azure SQL Hyperscale at Scale

1 answer

Your answer