Hi @Janice Chi
Your current pattern - ingesting CDC data via Kafka → Databricks → Delta → Azure SQL Hyperscale - is well-structured, and the questions on insert-only vs. insert/update handling are very valid for large-scale, high-throughput ingestion pipelines.
Let me address each of your questions, while also confirming the parts already covered earlier:
As mentioned before, direct MERGE INTO
between Delta and Azure SQL Hyperscale is not supported, since the two are based on different engines (file vs. relational). Instead, you must process Delta in Spark, and use JDBC to write into SQL.
Now building on that - here's how to handle upserts and inserts.
Do we need a staging table in Hyperscale before MERGE INTO?
Yes – your assumption is correct.
Using a staging table in Azure SQL Hyperscale is a recommended practice when:
- You want to run native
MERGE
orMERGE INTO
in T-SQL. - You need deterministic and performant reconciliation using SQL-side logic.
Typical pattern:
# Step 1: Write Delta CDC data to staging table in Hyperscale via JDBC
df.write \
.format("jdbc") \
.option("url", sql_jdbc_url) \
.option("dbtable", "staging_schema.table_xyz_stg") \
.mode("append") \
.save()
# Step 2: In Azure SQL, use stored procedure or logic to MERGE
MERGE target_table AS T
USING staging_table AS S
ON T.primary_key = S.primary_key
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
This pattern helps with:
- Ensuring set-based operations (faster).
- Avoiding round-trips between Databricks and SQL during upsert evaluation.
- Clear separation of CDC landing and commit.
Can insert-only tables be directly written without staging or MERGE?
Yes.
For the 750 insert-only tables:
- You can directly write from Databricks into Azure SQL using
.mode("append")
over JDBC. - No staging or upsert logic is needed.
- Just ensure idempotency or handle any duplicate prevention if applicable.
Can we avoid staging and do upsert logic via PySpark by looking up into Hyperscale?
Technically yes, but this is not recommended - here's why:
Challenges:
- JDBC doesn’t scale well for high-volume lookups or updates from Spark.
- Each lookup or update becomes a network-bound operation, which breaks Spark’s distributed nature.
- JDBC writes are not parallelized natively across partitions for UPDATE operations - leading to bottlenecks.
- This can result in slowness, locking issues, and brittle logic at scale (especially at your CDC volume: ~180K events/minute).
Anti-pattern Example (not recommended):
# Load existing keys from Hyperscale
existing_keys_df = spark.read \
.format("jdbc") \
.option("url", sql_jdbc_url) \
.option("dbtable", "prod_schema.table_xyz") \
.load()
# Join with CDC Delta data to determine inserts/updates
final_df = delta_df.join(existing_keys_df, "primary_key", "left_outer") \
.withColumn("is_insert", col("existing_col").isNull())
# Split and write via JDBC (separately for insert/update) - HIGHLY inefficient!
This approach introduces complexity, poor scaling, and often leads to timeouts or transaction failures.
Recommended approach for upserts without relying fully on staging tables?
If you want to avoid staging, the only relatively efficient approach is:
- Use stored procedures in Hyperscale that accept TVPs (table-valued parameters) or bulk insert into memory-optimized tables, followed by native SQL MERGE.
- But again, you still need some temporary or intermediate write.
For high-throughput, commit-consistent ingestion, having staging tables remains the most maintainable, observable, and performance-safe choice.
Official guidance on JDBC best practices for Databricks: https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/sql-databases#performance-tips
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.