Real-time Reconciliation Between Live DB2 and Azure SQL in CDC Streaming

Question

Real-time Reconciliation Between Live DB2 and Azure SQL in CDC Streaming

Janice Chi 580

Question:

We’re ingesting CDC data from IBM DB2 into Azure SQL Hyperscale via Kafka + Databricks Structured Streaming. Our challenge is reconciling live DB2 tables with their corresponding Hyperscale rows. Since DB2 is a live system, a row may be updated multiple times while its CDC event is still in-flight to Hyperscale. Hence, real-time value comparison is non-deterministic. Our question is: What is Microsoft’s recommended approach to perform accurate row-level or hash-based reconciliation in such real-time CDC use cases? Is it advisable to use commit timestamp ranges (JRN_TIMESTAMP) from the CDC payload to snapshot-filter DB2 source rows for matching? Are there any built-in tools or connectors (in ADF or Azure SQL) that support commit-aware reconciliation logic? In the absence of full commit sync, is it acceptable to reconcile only between CDC payload vs Hyperscale, skipping DB2 for real-time verification? What are the known limitations of reconciling a live source like DB2 against a downstream system like Hyperscale, and how can they be mitigated?

Smaran Thoomu 28,310 Reputation points Microsoft External Staff Moderator

2025-08-04T12:03:29.75+00:00

@Janice Chi Just checking in to see if the below answer provided helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

Smaran Thoomu 28,310 Reputation points Microsoft External Staff Moderator

2025-08-04T12:03:29.75+00:00

@Janice Chi Just checking in to see if the below answer provided helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hi Janice Chi
Thanks for posting your query.
Reconciling real-time CDC data with a live DB2 source can’t guarantee accuracy because rows can change after the event is captured but before it lands in Hyperscale. In most CDC implementations, real-time reconciliation is done between CDC payload and the target, not with the live source.

A few points based on what you have shared:

Using JRN_TIMESTAMP for snapshot-based reconciliation is a common practice. You can define soft windows (e.g., “up to T”) and compare only the merged data in Hyperscale vs. staged CDC events in that window.
Since DB2 is live, trying to compare values at runtime with source will give inconsistent results. Instead, validate:
- CDC completeness (i.e., no drops or duplicates)
- Merge correctness (row count or hash comparison)
If your CDC stream doesn’t include txn_id, and commit timestamps are all you have, that’s fine - just make sure they’re used consistently to bracket the window.
There’s no out-of-box commit-aware reconciliation in ADF or Azure SQL today. This logic usually lives in your framework - and from earlier threads, it looks like you already have control tables tracking offsets and SHA checks, so you’re on the right path.

As for real-time vs. catch-up: most customers run strict reconciliation during catch-up (bounded) and limit real-time checks to Hyperscale ingestion consistency only.

Also - would really appreciate it if you could mark any earlier answers that helped as Accepted, so we can keep things clean and useful for others. And if future queries are unrelated, posting them as new threads helps us give more targeted replies.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Janice Chi 580 Reputation points

2025-07-31T14:57:38.6533333+00:00

how most customers run strict reconciliation during catch-up (bounded)? Is this possible without query on live DB2 ?
Smaran Thoomu 28,310 Reputation points Microsoft External Staff Moderator

2025-08-01T11:25:00.2533333+00:00

Hi @Janice Chi
Yes, most customers avoid querying live DB2 during catch-up phase and still achieve strict reconciliation by anchoring it on captured CDC payloads and control logic, not live source reads.

Here’s how that typically works in practice:

Strict Reconciliation During Catch-Up - without Live DB2 Queries

In CDC-based migrations like yours, reconciliation during the bounded Catch-Up phase is usually done using:

CDC Payload (Kafka):

Hash or value-based checksum derived from the CDC events per batch/offset range.

Often includes keys + operation types + JRN_TIMESTAMP.

Target (Azure SQL Hyperscale):

After merge (via MERGE, UPSERT, or APPLY CHANGES INTO), compute the same hash per key or batch.

Control Tables:

Compare row counts and/or SHA2 hash values between CDC source (staged) and target (post-merge).

Store offset ranges, timestamps, and success status in your process_log_* tables (which you're already doing).

This avoids real-time dependency on DB2 and ensures:

You detect missing/malformed CDC events.

You ensure deterministic hand-off to the Streaming phase.

Why no Live DB2 needed?

Because FlashCopy already gave you a historical snapshot, and Kafka holds the event trail post-snapshot - that’s your complete truth window for Catch-Up.

Live DB2 will have already moved ahead - querying it won’t help validate what the state was at the time of the event. Instead:

Trust your CDC stream as the record of truth.

Validate correctness of transformation and merge via deterministic hashes.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Share via

Real-time Reconciliation Between Live DB2 and Azure SQL in CDC Streaming

1 answer

Your answer