• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Libraries or scripts to help parse PDF Extract API json

New Here ,
Nov 03, 2023 Nov 03, 2023

Copy link to clipboard

Copied

Hi,

I am looking to use the PDF Extract API to extract text for pdfs.  The json output that is returned is very granular.  That is helpful because it allows us to remove noisely text that doesn't have much meaning but it also means that text that should be grouped together for reading is not.  For example, a list of items is seperated into one element per item.  I am wondering if parsing this into meaningful text is something other have already solved or if I need to start from scratch.  Are there any libraries that do this or scripts that folks have written?

 

Thanks!  

Views

229

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Nov 03, 2023 Nov 03, 2023

Copy link to clipboard

Copied

Funny you should ask. As a personal project, I'm working on a "normalizer" for Extract. It's far from ready for prime time though. It's a few months away.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Nov 16, 2023 Nov 16, 2023

Copy link to clipboard

Copied

LATEST

Will be interested to see it when it is ready!  For my inital purposes I just combined all the text into a single doc and then worked with it from there.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources