Applying Merged CDC Delta Data from ADLS to Azure SQL Hyperscale at Scale

Janice Chi 580 Reputation points
2025-08-04T10:42:29.11+00:00

As part of our enterprise-scale data migration project, we are migrating data from IBM DB2 (on-prem) to Azure SQL Hyperscale. The migration consists of three phases:

  1. One-time Historical Load

CDC Catch-Up Phase (Batch)

CDC Streaming Phase (Real-Time)

We are currently focused on the Catch-Up and Real-Time phases and would like Microsoft’s guidance on how to best apply the already-merged CDC records (INSERT, UPDATE, DELETE) from ADLS (Delta format) into Azure SQL Hyperscale.


Current Design Highlights:

We are consuming bounded CDC data from Kafka and writing to the raw layer in ADLS (Delta format).

Separately, we have historical data stored in ADLS.

We perform the MERGE logic in Databricks and write the final merged output to a Bronze layer in ADLS.

Now, we need to apply this merged Bronze Delta data into Azure SQL Hyperscale.


Our Question:

What is the most efficient, scalable, and recommended approach to apply these already-merged CDC records (in ADLS Delta format) into the Azure SQL Hyperscale target tables,

or

How we should move INSERT , UPDATE AND DELETE operations , better to say how we should move Partitions from ADLS bronze layer inside which in some of the rows we have I/U/D operations correctly to Hyperscale main tables , please note that since we are doing MERGE in ADLS bronze layer we are not maintaining STAGING LAYER in HS in CATCHUP/CDC.


Design Trade-Offs and Concerns:

Our Delta files in ADLS allow us to leverage features like z-ordering, min/max pruning, and schema enforcement.

Does Azure SQL Hyperscale offer equivalent features or techniques like indexed lookups or other options to support key-based ingestion at this scale?

How do we maintain performance while doing key-based operations for very large tables in Hyperscale?

  • What best practices are recommended for ingesting merged CDC data into Hyperscale — especially when full MERGE logic is already completed upstream in Databricks?
    • considering:
    • Very large table sizes (some >20 TB),
    • Over 200 partitions,
    • Full CDC semantics (INSERT, UPDATE, DELETE),
    • Primary-key–based merge requirements,
    • The need for performance + correctness + simplicity — without requiring full-table comparisons in Hyperscale?


Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
{count} votes

1 answer

Sort by: Most helpful
  1. Janice Chi 580 Reputation points
    2025-08-04T15:08:29.1933333+00:00

    Let me check thanks

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.