How do custom lookback columns in Azure Data Explorer deduplication Materialized Views work?

Question

How do custom lookback columns in Azure Data Explorer deduplication Materialized Views work?

01725609 85

I was wondering if anybody has experience using lookback columns within deduplication Materialized Views in Adx

https://learn.microsoft.com/en-us/kusto/management/materialized-views/materialized-view-create?view=microsoft-fabric#lookback-column

Let's say we have minute based records from the past 4 years (which means a lot of data). Each of those records have a custom datetime

Let's say I configure my custom datetime as the lookback column, and my lookback period on 2 minutes.

When newly ingested records are added to the MV (e.g. from last year).

will my materialize view only check for records based on my deduplication mechanism(DuplicateDetectionHash) for the specific custom datetime value of the newly ingested records? So it will not go over the complete Materialized part of the view for deduplication but

e.g. (custom datetime column with a value of last year minus 2minutes) ?

This would drastically reduce computing needed for deduplication. Because there would be billions of rows and ingestion rate of around 10gb per day.

Deduplication based on a 2 minute range from the newly ingested records would only be a small fraction of what is actually present within the database.

The documentation itself is unclear for me and I wonder if anybody has actual experience with this in large datasets.

To really test performance myself I would have to re-ingest a lot of data which I would potentially try to avoid if not necessary.

Thanks!

Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-07-14T08:05:54.7633333+00:00

Nass Just checking in to see if the below answer provided by @Vinodh247 helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Accepted answer

0 additional answers

Your answer

Venkat Reddy Navari 5,255 Reputation points Microsoft External Staff Moderator

2025-07-14T08:05:54.7633333+00:00

Nass Just checking in to see if the below answer provided by @Vinodh247 helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hi ,

Thanks for reaching out to Microsoft Q&A.

Yes, your understanding is correct and this feature does significantly reduce the amount of data scanned for deduplication, especially on large datasets with historical backfill scenarios.

only a small time range (lookback) around the custom datetime of each ingested record is scanned for deduplication.
This avoids scanning billions of rows and supports efficient deduplication even for massive datasets.
Proper configuration of the custom datetime column and the hash function is essential.

Core Idea of Lookback Column in Deduplication MVs:

When you define a Materialized View in ADX with deduplication (i.e. using DuplicateDetectionHash), the engine must check for duplicates to avoid materializing the same row again.

By default, without a lookback column, the system looks back over a fixed time window (like the last 5 mins) based on the ingestion time.

When you define a custom lookback column, such as a custom_event_datetime, the deduplication scope is shifted from ingestion time to your chosen datetime column.

What Happens at Ingestion?

For each incoming record:
- ADX reads the value in the lookback column (for ex: custom_event_datetime).
- It then searches the existing MV data for duplicates only within the lookback period before that custom datetime.

So if your lookback column value for a record is:

custom_event_datetime = 2023-07-11 13:00:00 lookback = 2 minutes

Then ADX will check only for duplicates between 12:58:00 and 13:00:00, not the entire MV dataset.

Implications for Your Use Case

You have:

Historical minute-level data spanning 4 years

Custom datetime on records

Data ingestion rate around 10 GB per day

MV with deduplication and lookback = 2 minutes on the custom datetime

In this scenario:

If you ingest a batch of records from 2023-07-11 13:00:00, deduplication will only scan the 2-min window around those timestamps (ie., small subset of data).

It will not scan the entire 4-year materialized dataset.

This drastically reduces the compute load during ingestion, making it very efficient for late arriving data or historical backfills.

Key Considerations

Lookback Accuracy: Ensure the lookback column has sufficient granularity and accurate values. If it is skewed or inconsistent, deduplication can miss duplicates or become inefficient.

Hash Quality: Choose the right DuplicateDetectionHash to reduce collisions and false positives. A good hash ensures deduplication is accurate.

MV Refresh Strategy: When using long backfill data, batch your ingestion based on the data's custom datetime to avoid unnecessary refreshes.

Testing: To avoid re-ingesting all data for testing:
- Ingest a representative sample of backfilled data (~1 to 2 days worth).
- Enable diagnostics or query the MV using .show materialized-view and ingestion metadata to observe deduplication behavior.

Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

Share via

How do custom lookback columns in Azure Data Explorer deduplication Materialized Views work?

0 additional answers

Your answer