• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Does Extract API meet my needs?

Participant ,
Jun 05, 2023 Jun 05, 2023

Copy link to clipboard

Copied

I seek to extract data from PDFs that can be actionable in a python application.  I played around with python modules like pypdf2 enought to know that I can extract the text of the pdf- but the challenge is to associate that text with the meaning given  VISUALLY by the pdf.

 

Let me explain.  When one looks at say a pdf medical record, we instinctively gather meaning about the text  from the organization - the position, the presentation, the fonts, the colors. So for example my eye tells me that the date up on the right hand side of the first page of a meidcal treatment note is the date of treatment - even though there is no text to so indicate like "Treatment Date:" or "Visit Date:".  I know perhaps just because of its position, or its font, or its color.  The same applies to the sections:  I know that the text below the section text "Medical History" is the patient's medical history - that is, until we reach "Medications" - also in the same distinguishing font and text size as "Medical History".  

My goals regarding medical texts is to (i) identify and extract the treatment date, and (ii) chunk the texts based upon their "sections" - sections only from what I can tell discerned visually.

 

Would API extract be the right tool?

Views

168

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Jun 07, 2023 Jun 07, 2023

Copy link to clipboard

Copied

LATEST

The Extract API would give you the text, position on the page, and style of the text. From there, a developer could write an algorithm that could look for the kinds of "markers" that you describe. The more consistent the page layout of the various documents is the easier that would be but Extract can definitely give you enough of a description of what's on the page to accomplish your goal. It just won't do the hard work of parsing the content for the parts that you want.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources