How do custom lookback columns in Azure Data Explorer deduplication Materialized Views work?

01725609 85 Reputation points
2025-07-12T08:03:21.56+00:00

I was wondering if anybody has experience using lookback columns within deduplication Materialized Views in Adx

Let's say we have minute based records from the past 4 years (which means a lot of data). Each of those records have a custom datetime

Let's say I configure my custom datetime as the lookback column, and my lookback period on 2 minutes.

When newly ingested records are added to the MV (e.g. from last year).

will my materialize view only check for records based on my deduplication mechanism(DuplicateDetectionHash) for the specific custom datetime value of the newly ingested records? So it will not go over the complete Materialized part of the view for deduplication but

e.g. (custom datetime column with a value of last year minus 2minutes) ?

This would drastically reduce computing needed for deduplication. Because there would be billions of rows and ingestion rate of around 10gb per day.

Deduplication based on a 2 minute range from the newly ingested records would only be a small fraction of what is actually present within the database.

The documentation itself is unclear for me and I wonder if anybody has actual experience with this in large datasets.

To really test performance myself I would have to re-ingest a lot of data which I would potentially try to avoid if not necessary.

Thanks!

Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
{count} votes

Accepted answer
  1. Vinodh247 36,031 Reputation points MVP Volunteer Moderator
    2025-07-12T13:44:38.6566667+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    Yes, your understanding is correct and this feature does significantly reduce the amount of data scanned for deduplication, especially on large datasets with historical backfill scenarios.

    1. only a small time range (lookback) around the custom datetime of each ingested record is scanned for deduplication.
    2. This avoids scanning billions of rows and supports efficient deduplication even for massive datasets.
    3. Proper configuration of the custom datetime column and the hash function is essential.

    Core Idea of Lookback Column in Deduplication MVs:

    When you define a Materialized View in ADX with deduplication (i.e. using DuplicateDetectionHash), the engine must check for duplicates to avoid materializing the same row again.

    By default, without a lookback column, the system looks back over a fixed time window (like the last 5 mins) based on the ingestion time.

    When you define a custom lookback column, such as a custom_event_datetime, the deduplication scope is shifted from ingestion time to your chosen datetime column.

    What Happens at Ingestion?

    • For each incoming record:
      • ADX reads the value in the lookback column (for ex: custom_event_datetime).
      • It then searches the existing MV data for duplicates only within the lookback period before that custom datetime.

    So if your lookback column value for a record is:

    custom_event_datetime = 2023-07-11 13:00:00 lookback = 2 minutes

    Then ADX will check only for duplicates between 12:58:00 and 13:00:00, not the entire MV dataset.

    Implications for Your Use Case

    You have:

    Historical minute-level data spanning 4 years

    Custom datetime on records

    • Data ingestion rate around 10 GB per day

    MV with deduplication and lookback = 2 minutes on the custom datetime

    In this scenario:

    • If you ingest a batch of records from 2023-07-11 13:00:00, deduplication will only scan the 2-min window around those timestamps (ie., small subset of data).

    It will not scan the entire 4-year materialized dataset.

    This drastically reduces the compute load during ingestion, making it very efficient for late arriving data or historical backfills.


    Key Considerations

    Lookback Accuracy: Ensure the lookback column has sufficient granularity and accurate values. If it is skewed or inconsistent, deduplication can miss duplicates or become inefficient.

    Hash Quality: Choose the right DuplicateDetectionHash to reduce collisions and false positives. A good hash ensures deduplication is accurate.

    MV Refresh Strategy: When using long backfill data, batch your ingestion based on the data's custom datetime to avoid unnecessary refreshes.

    1. Testing: To avoid re-ingesting all data for testing:
      • Ingest a representative sample of backfilled data (~1 to 2 days worth).
      • Enable diagnostics or query the MV using .show materialized-view and ingestion metadata to observe deduplication behavior.

    Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.