Service to detect if a PDF is scanned?

Forum|Forum|1 year ago
November 11, 2024
1 reply
1077 views

Hi,

Is there a service that will detect if a PDF is scanned? I'd like to determine if a PDF is text-based or a scanned image before OCRing it. I don't see what I'm looking for in PDFProperities.

Thanks,

Jeff

Correct answer Joel Geraci

In the output from PDF Properties API, look in the "pages" property. For each page you'll see something like the code below.

Be sure to verify the "is_scanned" boolean by checking if the file has only one image and "only_images" is true. If the file has been OCRed, "has_text" will be true.

{
    "page_number": 0,
    "is_scanned": true,
    "width": 630,
    "has_structure": false,
    "content": {
        "number_of_images": 1,
        "only_images": true,
        "has_text": false,
        "has_images": true,
        "is_empty": false
    },
    "height": 810
}

Joel GeraciCorrect answer

Community Expert

In the output from PDF Properties API, look in the "pages" property. For each page you'll see something like the code below.

Be sure to verify the "is_scanned" boolean by checking if the file has only one image and "only_images" is true. If the file has been OCRed, "has_text" will be true.

{
    "page_number": 0,
    "is_scanned": true,
    "width": 630,
    "has_structure": false,
    "content": {
        "number_of_images": 1,
        "only_images": true,
        "has_text": false,
        "has_images": true,
        "is_empty": false
    },
    "height": 810
}

A

Aftia_JeffAuthor

Inspiring

Great thanks Joel.

A

Aftia_JeffAuthor

Inspiring

@Joel Geraci We are working on implementing code for this now. As you pointed out, is_scanned is a property on a page. We plan to check if the first page is scanned and then OCR the PDF if it is. Does that make sense? The OCR service only works on the document level, there are no page parameters.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded