Hi Bogdan Pechounov,
In a custom extraction model, each page is processed independently during training, meaning the model does not retain context or memory from previous pages. If an identifier appears on every page but is annotated only on the first page, the model will learn to classify it as the target field on page one but as “_other” on subsequent pages, creating conflicting training signals and reducing accuracy. To avoid this, the identifier should be annotated consistently on all pages where it appears, even if it is repeated. This ensures the model always learns the correct label for that field, and any deduplication of repeated values can be handled later during post-processing. Page order is generally not important for training in this scenario, so even if pages are out of order, consistent labeling will still produce reliable results.
How to handle an identifier being repeated on each page when training a custom extraction model?
Bogdan Pechounov
85
Reputation points
We have documents that contain an identifier/number at the bottom right of each page. If we annotate it only on the first page, I am concerned that the model will treat the other pages as negative examples.
My understanding is that each word is classified and that there is a hidden "_other"/"none" class. Therefore, the model will classify "ID008" as "document_id" on the first page, but as "_other" on the other pages, which can cause confusion.
Is the order of pages important or is each page relatively independent? (the pages might not be in order, but this is an edge case)
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1 answer
Sort by: Most helpful
-
Pavankumar Purilla 10,350 Reputation points Microsoft External Staff Moderator
2025-08-08T08:23:09.98+00:00