Libraries or scripts to help parse PDF Extract API json

Question

Hi,I am looking to use the PDF Extract API to extract text for pdfs.  The json output that is returned is very granular.  That is helpful because it allows us to remove noisely text that doesn't have much meaning but it also means that text that should be grouped together for reading is not.  For example, a list of items is seperated into one element per item.  I am wondering if parsing this into meaningful text is something other have already solved or if I need to start from scratch.  Are there any libraries that do this or scripts that folks have written? Thanks!

Joel Geraci · Answer

Funny you should ask. As a personal project, I'm working on a "normalizer" for Extract. It's far from ready for prime time though. It's a few months away.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.