How to handle an identifier being repeated on each page when training a custom extraction model?

Bogdan Pechounov 85 Reputation points
2025-08-07T12:45:23.4433333+00:00

We have documents that contain an identifier/number at the bottom right of each page. If we annotate it only on the first page, I am concerned that the model will treat the other pages as negative examples.

My understanding is that each word is classified and that there is a hidden "_other"/"none" class. Therefore, the model will classify "ID008" as "document_id" on the first page, but as "_other" on the other pages, which can cause confusion.


Is the order of pages important or is each page relatively independent? (the pages might not be in order, but this is an edge case)

Azure AI Document Intelligence
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Pavankumar Purilla 10,350 Reputation points Microsoft External Staff Moderator
    2025-08-08T08:23:09.98+00:00

    Hi Bogdan Pechounov,
    In a custom extraction model, each page is processed independently during training, meaning the model does not retain context or memory from previous pages. If an identifier appears on every page but is annotated only on the first page, the model will learn to classify it as the target field on page one but as “_other” on subsequent pages, creating conflicting training signals and reducing accuracy. To avoid this, the identifier should be annotated consistently on all pages where it appears, even if it is repeated. This ensures the model always learns the correct label for that field, and any deduplication of repeated values can be handled later during post-processing. Page order is generally not important for training in this scenario, so even if pages are out of order, consistent labeling will still produce reliable results.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.