How to differentiate text and diagrams/images from a scanned PDF and crop the images?

Question

How to differentiate text and diagrams/images from a scanned PDF and crop the images?

Riddhi Patel 30

I'm working with scanned PDF files and want to process them using computer vision techniques to:

Differentiate between text and diagrams/images.

Accurately detect and crop the images/diagrams from the PDF.

Optionally, keep track of the position of each image so that I can later replace it in the text with a reference or URL.

I want to do this using vision-based approaches, not just OCR like Tesseract (which only gives me the text). Are there any proven methods, models, or open-source tools (in Python or any language) that can help identify and extract visual (non-text) elements from a scanned PDF?

Any insights or code samples would be really helpful!

3 answers

Your answer

Answer 1

Hi there,

Great question! Differentiating text vs diagrams/images in scanned PDFs using Azure AI Document Intelligence (formerly Form Recognizer) depends on the approach you use.

🧠 Option 1: Layout Model (Prebuilt)

The Layout model can analyze scanned PDFs and images to extract:

Text (lines, words, tables)

Bounding box coordinates for each line or word

Information about selection marks and reading order

However, it does not directly tag images or diagrams. But you can infer non-text regions (i.e., diagrams/images) by detecting areas without extracted text, especially large blank bounding boxes or content with no recognized OCR.
👉 Use this API:

https://<endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze
🧠 Option 2: Custom Neural Model with Image Classifier (Hybrid Approach)

If you want to go further:

Combine Azure Document Intelligence for text extraction AND

Use Azure Computer Vision or Custom Vision to classify image areas (detect diagrams, logos, illustrations, etc.)

For example:

Use Document Intelligence to get page layout and text bounding boxes.

Use that info to crop the non-text zones.

Send those cropped areas to Azure Vision APIs to classify them as diagrams or images.

🧠 Option 3: Use Page Content Tags (if using PDF SDK or AI Indexing)

Some advanced pipelines (like Azure AI Search + Cognitive Skills) allow "image content detection" by chaining:

OCR Skill

Layout Skill

Image Analysis Skill

This may help tag and extract diagram zones, especially in scanned engineering or academic documents.

🧪 Tip:

When working with scanned PDFs, always ensure that the PDF is readable and OCR-enabled (or set "readingOrder": "natural" in the layout API).

Let me know if you'd like a working Python or REST sample showing how to extract and infer these regions. And if this helps, please click “Accept Answer” so others can benefit too 😊

Best Regards,

Jerald Felix

Answer 2

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Answer 3

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Share via

How to differentiate text and diagrams/images from a scanned PDF and crop the images?

3 answers

Your answer