Offset Alignment Between DB2 Snapshot and Kafka Catch-Up Phase

Janice Chi 580 Reputation points
2025-07-31T08:37:47.2766667+00:00

In our current CDC-based migration project, we are moving data from on-prem IBM DB2 to Azure SQL Hyperscale using a hybrid approach that involves:

  • A FlashCopy-based historical snapshot

A bounded CDC Catch-Up phase (Kafka)

Followed by real-time CDC streaming (also via Kafka)

The snapshot process may take multiple hours, and therefore, we must precisely capture and extract only the Kafka CDC events that occurred after snapshot started and before streaming begins — without data overlap or loss.

  1. Offset Window Determination:
    • What are the best practices to reliably determine the start_offset and end_offset for Catch-Up? MAINLY END OFFSET WHERE IT SHOULD BE MENTIONED , PLEASE NOTE THAT WE HAVE 800 KAFKA TOPICS FOR 800 TABLES how to start
    Specifically, how can we derive the exact offset corresponding to the snapshot start time, and freeze the offset range for Catch-Up CDC? Kafka Timestamp vs Offset Accuracy: Is it recommended to use Kafka timestamp-based lookups (e.g., startingOffsetsByTimestamp) to identify the first CDC message after snapshot initiation?
      Are there known pitfalls when relying on timestamp-aligned offsets from Kafka, especially in the context of tools like IBM InfoSphere CDC?
    
      **CDC Event Capture Boundaries**:
    
         What mechanism do you recommend to avoid overlapping ingestion between:
    
               Historical load (snapshot),
    
                     Catch-Up phase (bounded CDC),
    
                           Real-Time Streaming (unbounded)?
    
                              How to ensure seamless, commit-consistent handoff between Catch-Up and Streaming phases?
    
                              **Kafka Partition Management**:
    
                                 In our case, Kafka topics are partitioned per DB2 table, and we maintain per-partition control in a metadata table.
    
                                    Is there a recommended way to ensure **each Kafka partition’s offset range is aligned to the snapshot window**, across hundreds of topics?
    
                                    **Tools/Utilities for Offset Capture**:
    
                                       Are there any recommended tools/utilities/APIs (e.g., Kafka Admin Client or spark structured streaming APIs) that can help us:
    
                                             Programmatically derive timestamp-based start offsets
    
                                                   Freeze end offsets just before Catch-Up starts
    
                                                         Validate the offset range across retries?
    

💡 Additional Clarifications (Project-Specific):

We are using bounded offsets for Catch-Up (i.e., start–end offsets set in a control table)

We avoid Kafka Connect — CDC engine (IBM InfoSphere) directly pushes changes to Kafka

  • The streaming job will resume from where and how ?

All pipelines (Historical, Catch-Up, Streaming) are modular and Databricks-driven

No intermediate ADLS layer in Streaming; Catch-Up uses Raw + Bronze staging

Azure Event Hubs
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 28,225 Reputation points Microsoft External Staff Moderator
    2025-07-31T09:29:44.45+00:00

    Hi @Janice Chi
    Thanks again for the detailed context. This topic closely connects to your earlier queries on Catch-Up to Streaming transitions, checkpointing, and offset handoff - so I’ll keep this scoped to avoid contradictions.

    Offset Alignment Strategy (Snapshot → Catch-Up → Streaming)

    Start Offset (Catch-Up):

    • The safest approach is to capture the Kafka offsets by timestamp just after the FlashCopy snapshot begins.
    • You can use the Kafka consumer API (offsetsForTimes) or Spark’s Structured Streaming Kafka source with startingOffsetsByTimestamp.
    • This gives you the "start_offset" for each partition/topic in a programmatic and timestamp-aligned way.

    End Offset (Catch-Up):

    • End offset must be frozen immediately before starting real-time streaming to avoid overlap.
    • This should also be timestamp-based: query offsetsForTimes using the Streaming job start time.
    • Capture and persist the end offsets to your control table per topic/partition - as you’re already doing.

    Note: Don’t rely on Kafka committed offsets alone - always freeze offset ranges explicitly.

    Catch-Up to Streaming Handoff (Bounded to Unbounded)

    Set your Streaming job startingOffsets explicitly to the frozen end_offset of Catch-Up.

    • That ensures continuity without overlap or loss - assuming Kafka retention hasn't expired.
    • You’re already staging Catch-Up CDC to Bronze and streaming directly - this hybrid works well if the boundary is precise and commit-aware.

    Timestamp accuracy considerations (IBM CDC + Kafka)

    • Kafka timestamps are typically ingestion-time, not DB commit-time. That’s fine as long as you use them consistently across both ends (Catch-Up start and Streaming start).
    • With IBM CDC (non-Kafka Connect), make sure events are serialized in order and that Kafka ingestion timestamp roughly aligns with commit time. If not, prefer using any custom metadata field like JRN_TIMESTAMP.

    Recommended Tools

    • Kafka Consumer APIs: Use KafkaConsumer.offsetsForTimes() to derive offsets by timestamp (for each topic-partition).
    • Databricks Structured Streaming (Manual Bounded Read):
      • Use .option("startingOffsets", <your_controlled_start>)
      • Use .option("endingOffsets", <your_frozen_end>)
    • Validation: Add checks to compare record counts or hash summaries from snapshot → Catch-Up → Streaming, using your control table.

    This model gives you seamless, partition-aligned, commit-consistent CDC flow across all 800 topics - and is in line with your modular Databricks-driven orchestration.

    If any of the previous responses helped clarify your questions, please consider marking them as Accepted. It helps us keep track and supports the broader community too.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.