Skip to main content
Participant
June 21, 2022
Question

Programmatically Save PDF to Plain Text but maintain structure like PDF - File - Save As

  • June 21, 2022
  • 1 reply
  • 472 views

Hi all, we have a cloud based NodeJS application and we are ingesting insurance schedules (policies) and getting the plain text out using PDFJS and then using Natural Language Understanding to perform entity extraction (named entity recogition) etc. Is there an Adobe or other cloud or on prem solution that could extract the plain text from a PDF but keep its structure somewhat exactly the way the Adobe Acrobat Pro DC allows for when you do File -> Save As -> Plain Text that we could leverage as the current solution we are using loses all structure and we just get long lines / blob of text. If we could mantain the structure somewhat we believe it will provide additional "metadata" to the natural language understanding...at least that is what we are hoping for.  

 

I have attached an example image to explain...

 

Thank you

Alessandro

This topic has been closed for replies.

1 reply

Raymond Camden
Community Manager
Community Manager
June 21, 2022

Have you looked at our Extract API? I've done exactly what you describe (extract text, use NLP) - you can see a blog post on it here: https://medium.com/adobetech/natural-language-processing-adobe-pdf-extract-and-deep-pdf-intelligence-31ae07139b66