Kafka CDC Topic Does Not Include Primary Key or Full Row

Janice Chi 580 Reputation points
2025-07-27T16:00:22.3566667+00:00

In our current Azure migration project, we are processing CDC events from IBM DB2 into Kafka using IBM CDC engine. However, for certain DB2 tables:

  • There is no primary key defined in the source table.

Kafka CDC topics do not include a full-row image — they only contain the changed columns.

Kafka payloads also do not include "before" values (no before-after image structure).

This poses a challenge in our Bronze layer (Databricks) during the MERGE INTO operation, since:

We cannot identify which row in the target the change belongs to.

Row-level hash logic fails, because the incoming CDC record is partial (only 2–3 columns out of 25), and does not match with existing full rows.

Given these limitations:

Is there any supported approach in Databricks/Azure ecosystem to handle CDC merges in the absence of both PK and full-row image?

Can we reconstruct the row using metadata or any helper logic in Azure (e.g., from Hyperscale or Delta history)?

If not, is append-only ingestion the only valid fallback (with timestamp partitioning + data expiry)?

Does Microsoft recommend enabling full-row image or before-after image in upstream CDC engine for better compatibility with Delta Lake MERGE?

We want to ensure correctness while avoiding unnecessary duplicate records or lost updates.

Azure Event Hubs
{count} votes

1 answer

Sort by: Most helpful
  1. Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator
    2025-07-28T03:34:42.09+00:00

    Hi Janice Chi Looks like you’re running into a common issue with CDC (Change Data Capture) events, especially when you don’t have a primary key or full row images. Here are a few suggestions and resources that might help you handle CDC merges more effectively in the Azure Databricks environment.

    Handling CDC Merges without PK or Full Row Image in Databricks/Azure:

    When there’s no primary key or full row image, Delta Lake's MERGE INTO functionality can still work, but you’ll have to rely on some additional logic to handle the partial columns in the CDC records. One possible approach is to:

    Can We Reconstruct Rows Using Metadata or Helper Logic in Azure?

    Azure doesn’t directly provide built-in tools to reconstruct rows without primary keys or full-row images. However, you can make use of partitioning and timestamp-based strategies to help with reconstructing data. For instance, using Azure Data Factory with proper transformation logic could help, but you still won’t have the full accuracy that primary keys or row images would provide.

    Partitioning in Delta Lake: Use partitioning strategies to ingest and process CDC data incrementally. This is particularly useful for append-only ingestion.

    Is Append-Only Ingestion the Valid Fallback?

    If you’re not able to get full-row images or PKs, append-only ingestion could be your fallback. This would involve partitioning your data (e.g., by timestamp) and using data expiry to manage older records. However, be mindful of potential duplicates, so you’ll need some kind of deduplication logic.

    For best practices around handling Delta tables in these scenarios, you can refer to Delta Lake Best Practices.

    Does Microsoft Recommend Enabling Full-Row or Before-After Image for Compatibility with Delta Lake MERGE?

    Yes, enabling a full-row image or before-after image in the upstream CDC engine would significantly improve compatibility with Delta Lake's MERGE INTO operation. This would ensure that your upsert logic correctly identifies target rows and applies the necessary changes. If feasible, this is highly recommended to simplify your CDC pipeline and avoid issues with partial updates.


    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.