• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Programmatically Save PDF to Plain Text but maintain structure like PDF - File - Save As

New Here ,
Jun 21, 2022 Jun 21, 2022

Copy link to clipboard

Copied

Hi all, we have a cloud based NodeJS application and we are ingesting insurance schedules (policies) and getting the plain text out using PDFJS and then using Natural Language Understanding to perform entity extraction (named entity recogition) etc. Is there an Adobe or other cloud or on prem solution that could extract the plain text from a PDF but keep its structure somewhat exactly the way the Adobe Acrobat Pro DC allows for when you do File -> Save As -> Plain Text that we could leverage as the current solution we are using loses all structure and we just get long lines / blob of text. If we could mantain the structure somewhat we believe it will provide additional "metadata" to the natural language understanding...at least that is what we are hoping for.  

 

I have attached an example image to explain...

 

Thank you

Alessandro

TOPICS
How to , PDF Extract API , PDF Services API

Views

285

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jun 21, 2022 Jun 21, 2022

Copy link to clipboard

Copied

LATEST

Have you looked at our Extract API? I've done exactly what you describe (extract text, use NLP) - you can see a blog post on it here: https://medium.com/adobetech/natural-language-processing-adobe-pdf-extract-and-deep-pdf-intelligence...

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources