Copy link to clipboard
Copied
Hi,
Is there a service that will detect if a PDF is scanned? I'd like to determine if a PDF is text-based or a scanned image before OCRing it. I don't see what I'm looking for in PDFProperities.
Thanks,
Jeff
In the output from PDF Properties API, look in the "pages" property. For each page you'll see something like the code below.
Be sure to verify the "is_scanned" boolean by checking if the file has only one image and "only_images" is true. If the file has been OCRed, "has_text" will be true.
{
"page_number": 0,
"is_scanned": true,
"width": 630,
"has_structure": false,
"content": {
"number_of_images": 1,
"only_images": true,
"has_text": false,
Copy link to clipboard
Copied
In the output from PDF Properties API, look in the "pages" property. For each page you'll see something like the code below.
Be sure to verify the "is_scanned" boolean by checking if the file has only one image and "only_images" is true. If the file has been OCRed, "has_text" will be true.
{
"page_number": 0,
"is_scanned": true,
"width": 630,
"has_structure": false,
"content": {
"number_of_images": 1,
"only_images": true,
"has_text": false,
"has_images": true,
"is_empty": false
},
"height": 810
}
Copy link to clipboard
Copied
Great thanks Joel.