Kafka CDC Topic Does Not Include Primary Key or Full Row

Question

Kafka CDC Topic Does Not Include Primary Key or Full Row

Janice Chi 580

In our current Azure migration project, we are processing CDC events from IBM DB2 into Kafka using IBM CDC engine. However, for certain DB2 tables:

There is no primary key defined in the source table.

Kafka CDC topics do not include a full-row image — they only contain the changed columns.

Kafka payloads also do not include "before" values (no before-after image structure).

This poses a challenge in our Bronze layer (Databricks) during the MERGE INTO operation, since:

We cannot identify which row in the target the change belongs to.

Row-level hash logic fails, because the incoming CDC record is partial (only 2–3 columns out of 25), and does not match with existing full rows.

Given these limitations:

Is there any supported approach in Databricks/Azure ecosystem to handle CDC merges in the absence of both PK and full-row image?

Can we reconstruct the row using metadata or any helper logic in Azure (e.g., from Hyperscale or Delta history)?

If not, is append-only ingestion the only valid fallback (with timestamp partitioning + data expiry)?

Does Microsoft recommend enabling full-row image or before-after image in upstream CDC engine for better compatibility with Delta Lake MERGE?

We want to ensure correctness while avoiding unnecessary duplicate records or lost updates.

Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-07-29T16:02:53.3366667+00:00

Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-07-29T16:02:53.3366667+00:00

Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Hi Janice Chi Looks like you’re running into a common issue with CDC (Change Data Capture) events, especially when you don’t have a primary key or full row images. Here are a few suggestions and resources that might help you handle CDC merges more effectively in the Azure Databricks environment.

Handling CDC Merges without PK or Full Row Image in Databricks/Azure:

When there’s no primary key or full row image, Delta Lake's MERGE INTO functionality can still work, but you’ll have to rely on some additional logic to handle the partial columns in the CDC records. One possible approach is to:

Use Delta Lake’s MERGE INTO feature, which allows upserts, but since you don't have full rows, you’ll need to combine the incoming partial data with other sources, like historical data or metadata.
If your CDC events are partitioned by timestamp, you could also leverage Delta's transaction logs to stitch together data from the CDC updates.
You can get more details on using MERGE with Delta Lake here:https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/delta-merge-into?source=recommendations

Can We Reconstruct Rows Using Metadata or Helper Logic in Azure?

Azure doesn’t directly provide built-in tools to reconstruct rows without primary keys or full-row images. However, you can make use of partitioning and timestamp-based strategies to help with reconstructing data. For instance, using Azure Data Factory with proper transformation logic could help, but you still won’t have the full accuracy that primary keys or row images would provide.

Partitioning in Delta Lake: Use partitioning strategies to ingest and process CDC data incrementally. This is particularly useful for append-only ingestion.

Is Append-Only Ingestion the Valid Fallback?

If you’re not able to get full-row images or PKs, append-only ingestion could be your fallback. This would involve partitioning your data (e.g., by timestamp) and using data expiry to manage older records. However, be mindful of potential duplicates, so you’ll need some kind of deduplication logic.

For best practices around handling Delta tables in these scenarios, you can refer to Delta Lake Best Practices.

Does Microsoft Recommend Enabling Full-Row or Before-After Image for Compatibility with Delta Lake MERGE?

Yes, enabling a full-row image or before-after image in the upstream CDC engine would significantly improve compatibility with Delta Lake's MERGE INTO operation. This would ensure that your upsert logic correctly identifies target rows and applies the necessary changes. If feasible, this is highly recommended to simplify your CDC pipeline and avoid issues with partial updates.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Share via

Kafka CDC Topic Does Not Include Primary Key or Full Row

1 answer

Your answer