• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Why is some text in pdf extracted as figure (image) and not text using adobe dc sdk extract

New Here ,
May 27, 2022 May 27, 2022

Copy link to clipboard

Copied

Some text in pdfs is extracted as figures and not as text. I was wondering why this is and if there are any settings that would get around this.

I am using the ExtractTextInfoFromPDFWithCustomTimeouts  python function from the adobe-dc-pdf-services-sdk-extract-python-sample package.

Views

541

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jun 03, 2022 Jun 03, 2022

Copy link to clipboard

Copied

Hi there,

You can read about the different element types that are extracted in this demo (Look under Summary of Element Types): 
https://documentcloud.adobe.com/dc-visualizer-app/index.html

According to the document in the demo, "Figures" are "Non-reflowable constructs like graphs, images, flowcharts".

If there is some text in a graph or image, it would be extracted as a Figure, despite having what seems to be text. The following image is a good example in the afrorementioned Text Extract demo:

KyleJul_0-1654273027813.png

Hope that helps. Let us know if you have any questions.


Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 19, 2023 Jul 19, 2023

Copy link to clipboard

Copied

LATEST

So is there still no way to extract text from the figure output image? Love this API but this is a major limitation as a lot of documents have valuable text inside. Are you possibly working on computer vision to extract the text?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jun 09, 2022 Jun 09, 2022

Copy link to clipboard

Copied

Extract uses AI to deconstruct the page so there are no settings and frankly because we just train the AI, we don't really know how it decides. We can just add more documents to the training so it can learn. Would you be comfortable sharing the PDF files with us? 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources