How to handle an identifier being repeated on each page when training a custom extraction model?

Question

How to handle an identifier being repeated on each page when training a custom extraction model?

Bogdan Pechounov 85

We have documents that contain an identifier/number at the bottom right of each page. If we annotate it only on the first page, I am concerned that the model will treat the other pages as negative examples.

My understanding is that each word is classified and that there is a hidden "_other"/"none" class. Therefore, the model will classify "ID008" as "document_id" on the first page, but as "_other" on the other pages, which can cause confusion.

Is the order of pages important or is each page relatively independent? (the pages might not be in order, but this is an edge case)

1 answer

Your answer

Answer 1

Pavankumar Purilla 10,350 Microsoft External Staff Moderator

Hi Bogdan Pechounov,
In a custom extraction model, each page is processed independently during training, meaning the model does not retain context or memory from previous pages. If an identifier appears on every page but is annotated only on the first page, the model will learn to classify it as the target field on page one but as “_other” on subsequent pages, creating conflicting training signals and reducing accuracy. To avoid this, the identifier should be annotated consistently on all pages where it appears, even if it is repeated. This ensures the model always learns the correct label for that field, and any deduplication of repeated values can be handled later during post-processing. Page order is generally not important for training in this scenario, so even if pages are out of order, consistent labeling will still produce reliable results.

Bogdan Pechounov 85 Reputation points

2025-08-08T13:09:54.3566667+00:00

@Pavankumar Purilla Thank you for your response. Is it possible to annotate a field on different pages? I seem to be getting an error: "Sorry, we don't support cross-page labeling with the same field. You have label regions with same field name testing across 2 pages."
Pavankumar Purilla 10,350 Reputation points Microsoft External Staff Moderator

2025-08-11T05:05:03.4866667+00:00

Hi Bogdan Pechounov,
It’s not possible to create a single annotation that spans multiple pages in the Document Intelligence labeling tool — each page is treated as a separate, self-contained unit for labeling. If a field appears on multiple pages, you’ll need to annotate it individually on each page using the same field name. This way, the model still learns that the same type of field can appear on different pages, but each annotation is tied to the page it’s on. The error you’re seeing occurs when trying to create one continuous region for a field that crosses a page boundary, which isn’t supported. Instead, add separate label regions on each page for that field; they can share the same field name without issue, as long as each region stays within a single page.
Pavankumar Purilla 10,350 Reputation points Microsoft External Staff Moderator

2025-08-12T04:03:43.0433333+00:00

Hi Bogdan Pechounov,
Did you get any chance to check the response. Thank you!

Share via

How to handle an identifier being repeated on each page when training a custom extraction model?

1 answer

Your answer