Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Service to detect if a PDF is scanned?

Explorer ,
Nov 11, 2024 Nov 11, 2024

Hi,

 

Is there a service that will detect if a PDF is scanned?  I'd like to determine if a PDF is text-based or a scanned image before OCRing it.   I don't see what I'm looking for in PDFProperities.

 

Thanks,

 

Jeff

933
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Community Expert , Nov 13, 2024 Nov 13, 2024

In the output from PDF Properties API, look in the "pages" property. For each page you'll see something like the code below.

Be sure to verify the "is_scanned" boolean by checking if the file has only one image and "only_images" is true. If the file has been OCRed, "has_text" will be true. 

 

{
    "page_number": 0,
    "is_scanned": true,
    "width": 630,
    "has_structure": false,
    "content": {
        "number_of_images": 1,
        "only_images": true,
        "has_text": false,
       
...
Translate
Community Expert ,
Nov 13, 2024 Nov 13, 2024

In the output from PDF Properties API, look in the "pages" property. For each page you'll see something like the code below.

Be sure to verify the "is_scanned" boolean by checking if the file has only one image and "only_images" is true. If the file has been OCRed, "has_text" will be true. 

 

{
    "page_number": 0,
    "is_scanned": true,
    "width": 630,
    "has_structure": false,
    "content": {
        "number_of_images": 1,
        "only_images": true,
        "has_text": false,
        "has_images": true,
        "is_empty": false
    },
    "height": 810
}
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Nov 13, 2024 Nov 13, 2024

Great thanks Joel.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Feb 20, 2025 Feb 20, 2025
LATEST

@Joel Geraci  We are working on implementing code for this now.  As you pointed out, is_scanned is a property on a page.   We plan to check if the first page is scanned and then OCR the PDF if it is.   Does that make sense?  The OCR service only works on the document level, there are no page parameters. 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources