Training a Custom Extraction Model to recognize special characters
I am trying to extract lexical information from scanned dictionary definitions such as:
The AI Document Intelligence's OCR utility works pretty well out of the box, but it has trouble with uncommon characters that appear frequently in my text, especially bullet points (often mistaken for periods or dashes) and pronunciation symbols (e.g. schwas and emphasis markers).
I labeled the keyword, pronunciation, and part of speech for each definition
and carefully edited the .ocr.json and the .labels.json fiiles for accuracy.
Here is the resulting training data set: https://seattlephysicstutor.com/documents/training-data.zip
The words appear correctly in the Document Intelligence Studio side bar.
I click "Train" and am told that the process completed successfully. However no matter how many variations I use in my training set (I've tried full pages, dictionary entries like the one above, and keyword/pronunciation pairs) the result is never any better than the default (untrained) OCR.
For example, if I test another dictionary entry, I might get something like this:
It seems like the model has completely ignored all the training data. In particular the AI never recognizes a single schwa (ə) or emphasis (′) character.
It isn't especially important to me if the labeling correctly identifies the parts of the entry (I can parse that later on my own later if necessary), but I do need the accuracy of the individual characters to be better. Am I doing something wrong? Or are special characters something that custom models are not designed to handle?