Kafka offset window from DB2 snapshot timestamp for commit-aware CDC Catch-Up

Janice Chi 580 Reputation points
2025-07-31T14:34:36.3866667+00:00

Question: In our CDC ingestion pipeline, IBM CDC pushes change data from DB2 to Kafka, and we initiate the Catch-Up phase by aligning with the DB2 snapshot timestamp. To ensure accurate commit-aware reconciliation, we need to extract a precise Kafka offset window (start_offset to end_offset) based on commit timestamps (e.g., jrn_timestamp in CDC payload).

Our approach uses Databricks startingOffsetsByTimestamp API to determine the offset corresponding to the DB2 snapshot timestamp. We then extract CDC messages between this start_offset and an end_offset frozen just before the batch starts. Each Kafka message includes jrn_timestamp inside the payload for commit-aware filtering and reconciliation.

Could you please confirm:

Is using startingOffsetsByTimestamp the recommended way to derive Kafka start_offset aligned to a DB2 snapshot timestamp?

Is it acceptable to treat the CDC payload’s jrn_timestamp as the commit time anchor for reconciliation windows?

Are there best practices for freezing the end_offset boundary to maintain consistency in bounded Catch-Up batches?

Does Microsoft recommend reconciling against DB2 for these bounded windows, or only against the CDC payload?

Any limitations in Databricks or Kafka APIs we should be aware of for timestamp-to-offset lookups across many partitions?

Azure Event Hubs
{count} votes

1 answer

Sort by: Most helpful
  1. Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator
    2025-07-31T16:21:27.24+00:00

    Hi Janice Chi

    Is startingOffsetsByTimestamp the right way to align with a DB2 snapshot timestamp

    Yes, that’s a standard and practical approach. It uses Kafka’s offsetsForTimes under the hood to find the earliest offset at or after a given timestamp. As long as the DB2 snapshot timestamp aligns with the commit time of records, this method works well to mark your starting point.

    Can the jrn_timestamp from the CDC payload be used as the commit anchor

    Absolutely using jrn_timestamp is a good way to drive commit-aware logic, especially if you need to reconcile or filter out-of-order records. Just make sure the field is consistently populated and reflects DB2 commit time.

    Any best practices for freezing end_offset for bounded catch-up batches

    A common pattern is to capture the latest offsets (or wall-clock time) before starting the batch, then filter records using jrn_timestamp to stay within that window. You can store this timestamp as a metadata marker and use it to prevent overlap between batches.

    Should reconciliation happen against DB2 or just based on the CDC payload

    Typically, reconciliation is done only against the CDC payload, especially in streaming pipelines. It avoids putting extra load on DB2 and keeps your pipeline loosely coupled. As long as jrn_timestamp is reliable, that should be sufficient.

    Any known limitations in timestamp-to-offset lookups across many partitions

    Yes, there are a few things to be aware of:

    • offsetsForTimes performs one call per partition, so it can be slow across a large number of partitions.
    • If a partition has no messages after the given timestamp, the result will be null, which your logic must handle.
    • Retention policies can cause older offsets to be unavailable (leading to OffsetOutOfRange errors).
    • Kafka timestamps may not match your jrn_timestamp field, depending on how they are assigned.

    You can find more details in the official Microsoft docs here: https://learn.microsoft.com/en-us/azure/databricks/connect/streaming/kafka#startingoffsetsbytimestamp


    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.