Hi ,
Thanks for reaching out to Microsoft Q&A.
Yes, your understanding is correct and this feature does significantly reduce the amount of data scanned for deduplication, especially on large datasets with historical backfill scenarios.
- only a small time range (lookback) around the
custom datetime
of each ingested record is scanned for deduplication. - This avoids scanning billions of rows and supports efficient deduplication even for massive datasets.
- Proper configuration of the custom datetime column and the hash function is essential.
Core Idea of Lookback Column in Deduplication MVs:
When you define a Materialized View in ADX with deduplication (i.e. using DuplicateDetectionHash
), the engine must check for duplicates to avoid materializing the same row again.
By default, without a lookback column, the system looks back over a fixed time window (like the last 5 mins) based on the ingestion time.
When you define a custom lookback column, such as a custom_event_datetime
, the deduplication scope is shifted from ingestion time to your chosen datetime column.
What Happens at Ingestion?
- For each incoming record:
- ADX reads the value in the lookback column (for ex:
custom_event_datetime
). - It then searches the existing MV data for duplicates only within the lookback period before that custom datetime.
- ADX reads the value in the lookback column (for ex:
So if your lookback column value for a record is:
custom_event_datetime = 2023-07-11 13:00:00 lookback = 2 minutes
Then ADX will check only for duplicates between 12:58:00 and 13:00:00, not the entire MV dataset.
Implications for Your Use Case
You have:
Historical minute-level data spanning 4 years
Custom datetime on records
- Data ingestion rate around 10 GB per day
MV with deduplication and lookback = 2 minutes
on the custom datetime
In this scenario:
- If you ingest a batch of records from
2023-07-11 13:00:00
, deduplication will only scan the 2-min window around those timestamps (ie., small subset of data).
It will not scan the entire 4-year materialized dataset.
This drastically reduces the compute load during ingestion, making it very efficient for late arriving data or historical backfills.
Key Considerations
Lookback Accuracy: Ensure the lookback column
has sufficient granularity and accurate values. If it is skewed or inconsistent, deduplication can miss duplicates or become inefficient.
Hash Quality: Choose the right DuplicateDetectionHash
to reduce collisions and false positives. A good hash ensures deduplication is accurate.
MV Refresh Strategy: When using long backfill data, batch your ingestion based on the data's custom datetime to avoid unnecessary refreshes.
- Testing: To avoid re-ingesting all data for testing:
- Ingest a representative sample of backfilled data (~1 to 2 days worth).
- Enable diagnostics or query the MV using
.show materialized-view
and ingestion metadata to observe deduplication behavior.
Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.