Programmatically Save PDF to Plain Text but maintain structure like PDF - File - Save As

Question

Hi all, we have a cloud based NodeJS application and we are ingesting insurance schedules (policies) and getting the plain text out using PDFJS and then using Natural Language Understanding to perform entity extraction (named entity recogition) etc. Is there an Adobe or other cloud or on prem solution that could extract the plain text from a PDF but keep its structure somewhat exactly the way the Adobe Acrobat Pro DC allows for when you do File -> Save As -> Plain Text that we could leverage as the current solution we are using loses all structure and we just get long lines / blob of text. If we could mantain the structure somewhat we believe it will provide additional "metadata" to the natural language understanding...at least that is what we are hoping for.

I have attached an example image to explain...

Thank you

Alessandro

Raymond Camden · Answer

Have you looked at our Extract API? I've done exactly what you describe (extract text, use NLP) - you can see a blog post on it here: https://medium.com/adobetech/natural-language-processing-adobe-pdf-extract-and-deep-pdf-intelligence-31ae07139b66

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.