MERGE INTO

Question

MERGE INTO

Janice Chi 580

Is MERGE INTO between ADLS Delta layer and Azure SQL Hyperscale (relational) tables possible and effective?

Saraswathi Devadula 10,855 Reputation points Microsoft External Staff Moderator

2025-08-05T13:58:22.9833333+00:00
Hello Janice Chi

MERGE INTO is only supported when both source and target use the same storage engine (for example, Delta to Delta in ADLS or SQL to SQL in Azure SQL). You cannot perform a direct MERGE INTO between ADLS Delta Lake and Azure SQL Hyperscale, as Delta Lake is file-based and Azure SQL is a relational system.

Recommended Approach:

To synchronize or reconcile data between these systems:

Use Apache Spark (Databricks) to read data from Delta and write to Azure SQL using JDBC, applying upsert logic (MERGE or INSERT/UPDATE via Spark).

Alternatively, load Delta data into a temporary SQL table and then MERGE into the target Azure SQL Hyperscale table.
Saraswathi Devadula 10,855 Reputation points Microsoft External Staff Moderator

2025-08-06T03:52:11.5166667+00:00

Hello Janice Chi
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Janice Chi 580 Reputation points

2025-08-06T17:02:48.2833333+00:00
We are migrating ~80 TB of data from IBM DB2 to Azure SQL Hyperscale. CDC is coming from Kafka → Databricks. Around 750 of 800 tables only require inserts, while the remaining 50 tables have inserts and updates. We are currently writing the CDC data from ADLS DELTA LAYER to Azure SQL Hyperscale using Databricks via JDBC.

Questions:

Is it mandatory to use a staging table inside Hyperscale before applying MERGE INTO from Databricks?- I believe YES as u already replied above -correct me if am wrong

For insert-only tables, can we directly insert from Databricks (without any MERGE or staging)?

for remaining 50 heavy tables ,Can we avoid the staging table completely by writing PySpark logic that does a lookup from Hyperscale and then applies insert/update manually — or is that an anti-pattern due to JDBC performance and parallelism limits? please explian in detail with code , logic and strong justifications

Is there any recommended approach for doing upsert-style logic into Azure SQL Hyperscale without relying on staging tables?
Saraswathi Devadula 10,855 Reputation points Microsoft External Staff Moderator

2025-08-07T04:33:25.91+00:00

Hello Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Saraswathi Devadula 10,855 Reputation points Microsoft External Staff Moderator

2025-08-05T13:58:22.9833333+00:00

Hello Janice Chi

MERGE INTO is only supported when both source and target use the same storage engine (for example, Delta to Delta in ADLS or SQL to SQL in Azure SQL). You cannot perform a direct MERGE INTO between ADLS Delta Lake and Azure SQL Hyperscale, as Delta Lake is file-based and Azure SQL is a relational system.

Recommended Approach:

To synchronize or reconcile data between these systems:

Use Apache Spark (Databricks) to read data from Delta and write to Azure SQL using JDBC, applying upsert logic (MERGE or INSERT/UPDATE via Spark).

Alternatively, load Delta data into a temporary SQL table and then MERGE into the target Azure SQL Hyperscale table.
Saraswathi Devadula 10,855 Reputation points Microsoft External Staff Moderator

2025-08-06T03:52:11.5166667+00:00

Hello Janice Chi
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Janice Chi 580 Reputation points

2025-08-06T17:02:48.2833333+00:00

We are migrating ~80 TB of data from IBM DB2 to Azure SQL Hyperscale. CDC is coming from Kafka → Databricks. Around 750 of 800 tables only require inserts, while the remaining 50 tables have inserts and updates. We are currently writing the CDC data from ADLS DELTA LAYER to Azure SQL Hyperscale using Databricks via JDBC.

Questions:

Is it mandatory to use a staging table inside Hyperscale before applying MERGE INTO from Databricks?- I believe YES as u already replied above -correct me if am wrong

For insert-only tables, can we directly insert from Databricks (without any MERGE or staging)?

for remaining 50 heavy tables ,Can we avoid the staging table completely by writing PySpark logic that does a lookup from Hyperscale and then applies insert/update manually — or is that an anti-pattern due to JDBC performance and parallelism limits? please explian in detail with code , logic and strong justifications

Is there any recommended approach for doing upsert-style logic into Azure SQL Hyperscale without relying on staging tables?
Saraswathi Devadula 10,855 Reputation points Microsoft External Staff Moderator

2025-08-07T04:33:25.91+00:00

Hello Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Hi @Janice Chi

Your current pattern - ingesting CDC data via Kafka → Databricks → Delta → Azure SQL Hyperscale - is well-structured, and the questions on insert-only vs. insert/update handling are very valid for large-scale, high-throughput ingestion pipelines.

Let me address each of your questions, while also confirming the parts already covered earlier:

As mentioned before, direct MERGE INTO between Delta and Azure SQL Hyperscale is not supported, since the two are based on different engines (file vs. relational). Instead, you must process Delta in Spark, and use JDBC to write into SQL.

Now building on that - here's how to handle upserts and inserts.

Do we need a staging table in Hyperscale before MERGE INTO?

Yes – your assumption is correct.

Using a staging table in Azure SQL Hyperscale is a recommended practice when:

You want to run native MERGE or MERGE INTO in T-SQL.
You need deterministic and performant reconciliation using SQL-side logic.

Typical pattern:

# Step 1: Write Delta CDC data to staging table in Hyperscale via JDBC
df.write \
  .format("jdbc") \
  .option("url", sql_jdbc_url) \
  .option("dbtable", "staging_schema.table_xyz_stg") \
  .mode("append") \
  .save()
# Step 2: In Azure SQL, use stored procedure or logic to MERGE
MERGE target_table AS T
USING staging_table AS S
ON T.primary_key = S.primary_key
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

This pattern helps with:

Ensuring set-based operations (faster).
Avoiding round-trips between Databricks and SQL during upsert evaluation.
Clear separation of CDC landing and commit.

Can insert-only tables be directly written without staging or MERGE?

Yes.

For the 750 insert-only tables:

You can directly write from Databricks into Azure SQL using .mode("append") over JDBC.
No staging or upsert logic is needed.
Just ensure idempotency or handle any duplicate prevention if applicable.

Can we avoid staging and do upsert logic via PySpark by looking up into Hyperscale?

Technically yes, but this is not recommended - here's why:

Challenges:

JDBC doesn’t scale well for high-volume lookups or updates from Spark.
Each lookup or update becomes a network-bound operation, which breaks Spark’s distributed nature.
JDBC writes are not parallelized natively across partitions for UPDATE operations - leading to bottlenecks.
This can result in slowness, locking issues, and brittle logic at scale (especially at your CDC volume: ~180K events/minute).

Anti-pattern Example (not recommended):

# Load existing keys from Hyperscale
existing_keys_df = spark.read \
  .format("jdbc") \
  .option("url", sql_jdbc_url) \
  .option("dbtable", "prod_schema.table_xyz") \
  .load()
# Join with CDC Delta data to determine inserts/updates
final_df = delta_df.join(existing_keys_df, "primary_key", "left_outer") \
  .withColumn("is_insert", col("existing_col").isNull())
# Split and write via JDBC (separately for insert/update) - HIGHLY inefficient!

This approach introduces complexity, poor scaling, and often leads to timeouts or transaction failures.

Recommended approach for upserts without relying fully on staging tables?

If you want to avoid staging, the only relatively efficient approach is:

Use stored procedures in Hyperscale that accept TVPs (table-valued parameters) or bulk insert into memory-optimized tables, followed by native SQL MERGE.
But again, you still need some temporary or intermediate write.

For high-throughput, commit-consistent ingestion, having staging tables remains the most maintainable, observable, and performance-safe choice.

Official guidance on JDBC best practices for Databricks: https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/sql-databases#performance-tips

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Share via

MERGE INTO

1 answer

Your answer