• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Is Extract PDF API Using OCR?

New Here ,
Dec 08, 2023 Dec 08, 2023

Copy link to clipboard

Copied

hello i am Using Extract API to analyze my PDF Files. I have a Question About this APIs are using OCR to Extract PDF Text?? Because in my workflow it's really important about Acuraacy But i already Checked in Demo. Some PDF Couldn't recognize Text . it goes like this  Text:"□H�Q!i " So if this APIs are using OCR i have to find another way to extract PDF Text.

Views

325

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 11, 2023 Dec 11, 2023

Copy link to clipboard

Copied

OCR is used when the entire page is an image. Otherwise, we extract the text from the PDF page. It's possible that the font encoding of your PDF is bad and that's why you are seeing the results you are getting.  Can you share the PDF in question?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Dec 11, 2023 Dec 11, 2023

Copy link to clipboard

Copied

Sure I will Post My PDF. And I am using ExtractTextTableInfoWithTableStructureFromPdf.java I saw on DeveloperLive Video They said APIs are using OCR and Aodbe Sensei for increase more Accuracy. So i Run the code for extract text from PDF(image) and PDF(Not Image) Then Can you answer me Am i Right?? 
APIs Are using only when the PDF File is Image and If PDF is not image APIs aren't use OCR?? 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Dec 11, 2023 Dec 11, 2023

Copy link to clipboard

Copied

And he PDF file is in Korean, but the API recognizes it in English. So that's the reason text is □H�Q! 

and i attached my PDF file. Thanks for Help 

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 12, 2023 Dec 12, 2023

Copy link to clipboard

Copied

Yes. At this time, Extract is tuned for English but other languages based on the Roman alphabet should work but not Korean.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Dec 12, 2023 Dec 12, 2023

Copy link to clipboard

Copied

But I extracted about 200 Korean PDFs. And it's only the case cause the error. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Dec 13, 2023 Dec 13, 2023

Copy link to clipboard

Copied

But I extracted about 200 Korean PDFs. And it's only the case cause the error. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Dec 13, 2023 Dec 13, 2023

Copy link to clipboard

Copied

LATEST

Right, so what your seeing is that it will sometimes work, but not consistently, for non-Roman languages. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources