Real-time Reconciliation Between Live DB2 and Azure SQL in CDC Streaming

Janice Chi 580 Reputation points
2025-07-31T08:06:28.55+00:00

Question:

We’re ingesting CDC data from IBM DB2 into Azure SQL Hyperscale via Kafka + Databricks Structured Streaming. Our challenge is reconciling live DB2 tables with their corresponding Hyperscale rows. Since DB2 is a live system, a row may be updated multiple times while its CDC event is still in-flight to Hyperscale. Hence, real-time value comparison is non-deterministic. Our question is: What is Microsoft’s recommended approach to perform accurate row-level or hash-based reconciliation in such real-time CDC use cases? Is it advisable to use commit timestamp ranges (JRN_TIMESTAMP) from the CDC payload to snapshot-filter DB2 source rows for matching? Are there any built-in tools or connectors (in ADF or Azure SQL) that support commit-aware reconciliation logic? In the absence of full commit sync, is it acceptable to reconcile only between CDC payload vs Hyperscale, skipping DB2 for real-time verification? What are the known limitations of reconciling a live source like DB2 against a downstream system like Hyperscale, and how can they be mitigated?

Azure Event Hubs
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 28,310 Reputation points Microsoft External Staff Moderator
    2025-07-31T09:07:11.61+00:00

    Hi Janice Chi
    Thanks for posting your query.
    Reconciling real-time CDC data with a live DB2 source can’t guarantee accuracy because rows can change after the event is captured but before it lands in Hyperscale. In most CDC implementations, real-time reconciliation is done between CDC payload and the target, not with the live source.

    A few points based on what you have shared:

    • Using JRN_TIMESTAMP for snapshot-based reconciliation is a common practice. You can define soft windows (e.g., “up to T”) and compare only the merged data in Hyperscale vs. staged CDC events in that window.
    • Since DB2 is live, trying to compare values at runtime with source will give inconsistent results. Instead, validate:
      • CDC completeness (i.e., no drops or duplicates)
      • Merge correctness (row count or hash comparison)
    • If your CDC stream doesn’t include txn_id, and commit timestamps are all you have, that’s fine - just make sure they’re used consistently to bracket the window.
    • There’s no out-of-box commit-aware reconciliation in ADF or Azure SQL today. This logic usually lives in your framework - and from earlier threads, it looks like you already have control tables tracking offsets and SHA checks, so you’re on the right path.

    As for real-time vs. catch-up: most customers run strict reconciliation during catch-up (bounded) and limit real-time checks to Hyperscale ingestion consistency only.

    Also - would really appreciate it if you could mark any earlier answers that helped as Accepted, so we can keep things clean and useful for others. And if future queries are unrelated, posting them as new threads helps us give more targeted replies.

    I hope this information helps. Please do let us know if you have any further queries.


    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.