How to differentiate text and diagrams/images from a scanned PDF and crop the images?

Riddhi Patel 30 Reputation points
2025-07-06T06:38:51.5133333+00:00

I'm working with scanned PDF files and want to process them using computer vision techniques to:

Differentiate between text and diagrams/images.

Accurately detect and crop the images/diagrams from the PDF.

Optionally, keep track of the position of each image so that I can later replace it in the text with a reference or URL.

I want to do this using vision-based approaches, not just OCR like Tesseract (which only gives me the text). Are there any proven methods, models, or open-source tools (in Python or any language) that can help identify and extract visual (non-text) elements from a scanned PDF?

Any insights or code samples would be really helpful!

Computer Vision
Computer Vision
An Azure artificial intelligence service that analyzes content in images and video.
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. Jerald Felix 4,450 Reputation points
    2025-07-06T14:09:57.7+00:00

    Hi there,

    Great question! Differentiating text vs diagrams/images in scanned PDFs using Azure AI Document Intelligence (formerly Form Recognizer) depends on the approach you use.

    🧠 Option 1: Layout Model (Prebuilt)

    The Layout model can analyze scanned PDFs and images to extract:

    Text (lines, words, tables)

    Bounding box coordinates for each line or word

    Information about selection marks and reading order

    However, it does not directly tag images or diagrams. But you can infer non-text regions (i.e., diagrams/images) by detecting areas without extracted text, especially large blank bounding boxes or content with no recognized OCR.
    👉 Use this API:

    https://<endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze
    🧠 Option 2: Custom Neural Model with Image Classifier (Hybrid Approach)

    If you want to go further:

    Combine Azure Document Intelligence for text extraction AND

    Use Azure Computer Vision or Custom Vision to classify image areas (detect diagrams, logos, illustrations, etc.)

    For example:

    Use Document Intelligence to get page layout and text bounding boxes.

    Use that info to crop the non-text zones.

    Send those cropped areas to Azure Vision APIs to classify them as diagrams or images.

    🧠 Option 3: Use Page Content Tags (if using PDF SDK or AI Indexing)

    Some advanced pipelines (like Azure AI Search + Cognitive Skills) allow "image content detection" by chaining:

    OCR Skill

    Layout Skill

    Image Analysis Skill

    This may help tag and extract diagram zones, especially in scanned engineering or academic documents.

    🧪 Tip:

    When working with scanned PDFs, always ensure that the PDF is readable and OCR-enabled (or set "readingOrder": "natural" in the layout API).

    Let me know if you'd like a working Python or REST sample showing how to extract and infer these regions. And if this helps, please click “Accept Answer” so others can benefit too 😊

    Best Regards,

    Jerald Felix

    0 comments No comments

  2. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

  3. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.