Copy link to clipboard
Copied
Hi,
Is there a service that will detect if a PDF is scanned? I'd like to determine if a PDF is text-based or a scanned image before OCRing it. I don't see what I'm looking for in PDFProperities.
Thanks,
Jeff
In the output from PDF Properties API, look in the "pages" property. For each page you'll see something like the code below.
Be sure to verify the "is_scanned" boolean by checking if the file has only one image and "only_images" is true. If the file has been OCRed, "has_text" will be true.
{
"page_number": 0,
"is_scanned": true,
"width": 630,
"has_structure": false,
"content": {
"number_of_images": 1,
"only_images": true,
"has_text": false,
Copy link to clipboard
Copied
In the output from PDF Properties API, look in the "pages" property. For each page you'll see something like the code below.
Be sure to verify the "is_scanned" boolean by checking if the file has only one image and "only_images" is true. If the file has been OCRed, "has_text" will be true.
{
"page_number": 0,
"is_scanned": true,
"width": 630,
"has_structure": false,
"content": {
"number_of_images": 1,
"only_images": true,
"has_text": false,
"has_images": true,
"is_empty": false
},
"height": 810
}Copy link to clipboard
Copied
Great thanks Joel.
Copy link to clipboard
Copied
@Joel Geraci We are working on implementing code for this now. As you pointed out, is_scanned is a property on a page. We plan to check if the first page is scanned and then OCR the PDF if it is. Does that make sense? The OCR service only works on the document level, there are no page parameters.
Find more inspiration, events, and resources on the new Adobe Community
Explore Now