Training a Custom Extraction Model to recognize special characters

Question

Training a Custom Extraction Model to recognize special characters

Christopher Grubb 0

I am trying to extract lexical information from scanned dictionary definitions such as:

fOXFc56t The AI Document Intelligence's OCR utility works pretty well out of the box, but it has trouble with uncommon characters that appear frequently in my text, especially bullet points (often mistaken for periods or dashes) and pronunciation symbols (e.g. schwas and emphasis markers).

I labeled the keyword, pronunciation, and part of speech for each definition
User's image

and carefully edited the .ocr.json and the .labels.json fiiles for accuracy.

Here is the resulting training data set: https://seattlephysicstutor.com/documents/training-data.zip

The words appear correctly in the Document Intelligence Studio side bar.
User's image

I click "Train" and am told that the process completed successfully. However no matter how many variations I use in my training set (I've tried full pages, dictionary entries like the one above, and keyword/pronunciation pairs) the result is never any better than the default (untrained) OCR.

For example, if I test another dictionary entry, I might get something like this: User's image

User's image

It seems like the model has completely ignored all the training data. In particular the AI never recognizes a single schwa (ə) or emphasis (′) character.

It isn't especially important to me if the labeling correctly identifies the parts of the entry (I can parse that later on my own later if necessary), but I do need the accuracy of the individual characters to be better. Am I doing something wrong? Or are special characters something that custom models are not designed to handle?

Share via

Training a Custom Extraction Model to recognize special characters

Your answer