Skip to main content
Inspiring
November 11, 2024
Answered

Service to detect if a PDF is scanned?

  • November 11, 2024
  • 1 reply
  • 1077 views

Hi,

 

Is there a service that will detect if a PDF is scanned?  I'd like to determine if a PDF is text-based or a scanned image before OCRing it.   I don't see what I'm looking for in PDFProperities.

 

Thanks,

 

Jeff

    Correct answer Joel Geraci

    In the output from PDF Properties API, look in the "pages" property. For each page you'll see something like the code below.

    Be sure to verify the "is_scanned" boolean by checking if the file has only one image and "only_images" is true. If the file has been OCRed, "has_text" will be true. 

     

    {
        "page_number": 0,
        "is_scanned": true,
        "width": 630,
        "has_structure": false,
        "content": {
            "number_of_images": 1,
            "only_images": true,
            "has_text": false,
            "has_images": true,
            "is_empty": false
        },
        "height": 810
    }

    1 reply

    Joel Geraci
    Joel GeraciCorrect answer
    Community Expert
    November 13, 2024

    In the output from PDF Properties API, look in the "pages" property. For each page you'll see something like the code below.

    Be sure to verify the "is_scanned" boolean by checking if the file has only one image and "only_images" is true. If the file has been OCRed, "has_text" will be true. 

     

    {
        "page_number": 0,
        "is_scanned": true,
        "width": 630,
        "has_structure": false,
        "content": {
            "number_of_images": 1,
            "only_images": true,
            "has_text": false,
            "has_images": true,
            "is_empty": false
        },
        "height": 810
    }
    Inspiring
    November 13, 2024

    Great thanks Joel.

    Inspiring
    February 20, 2025

    @Joel Geraci  We are working on implementing code for this now.  As you pointed out, is_scanned is a property on a page.   We plan to check if the first page is scanned and then OCR the PDF if it is.   Does that make sense?  The OCR service only works on the document level, there are no page parameters.