Hi @Janice Chi
Thanks again for the detailed context. This topic closely connects to your earlier queries on Catch-Up to Streaming transitions, checkpointing, and offset handoff - so I’ll keep this scoped to avoid contradictions.
Offset Alignment Strategy (Snapshot → Catch-Up → Streaming)
Start Offset (Catch-Up):
- The safest approach is to capture the Kafka offsets by timestamp just after the FlashCopy snapshot begins.
- You can use the Kafka consumer API (
offsetsForTimes
) or Spark’s Structured Streaming Kafka source withstartingOffsetsByTimestamp
. - This gives you the "start_offset" for each partition/topic in a programmatic and timestamp-aligned way.
End Offset (Catch-Up):
- End offset must be frozen immediately before starting real-time streaming to avoid overlap.
- This should also be timestamp-based: query
offsetsForTimes
using the Streaming job start time. - Capture and persist the end offsets to your control table per topic/partition - as you’re already doing.
Note: Don’t rely on Kafka committed offsets alone - always freeze offset ranges explicitly.
Catch-Up to Streaming Handoff (Bounded to Unbounded)
Set your Streaming job startingOffsets
explicitly to the frozen end_offset
of Catch-Up.
- That ensures continuity without overlap or loss - assuming Kafka retention hasn't expired.
- You’re already staging Catch-Up CDC to Bronze and streaming directly - this hybrid works well if the boundary is precise and commit-aware.
Timestamp accuracy considerations (IBM CDC + Kafka)
- Kafka timestamps are typically ingestion-time, not DB commit-time. That’s fine as long as you use them consistently across both ends (Catch-Up start and Streaming start).
- With IBM CDC (non-Kafka Connect), make sure events are serialized in order and that Kafka ingestion timestamp roughly aligns with commit time. If not, prefer using any custom metadata field like
JRN_TIMESTAMP
.
Recommended Tools
- Kafka Consumer APIs: Use
KafkaConsumer.offsetsForTimes()
to derive offsets by timestamp (for each topic-partition). - Databricks Structured Streaming (Manual Bounded Read):
- Use
.option("startingOffsets", <your_controlled_start>)
- Use
.option("endingOffsets", <your_frozen_end>)
- Use
- Validation: Add checks to compare record counts or hash summaries from snapshot → Catch-Up → Streaming, using your control table.
This model gives you seamless, partition-aligned, commit-consistent CDC flow across all 800 topics - and is in line with your modular Databricks-driven orchestration.
If any of the previous responses helped clarify your questions, please consider marking them as Accepted
. It helps us keep track and supports the broader community too.