Copy link to clipboard
Copied
I seek to extract data from PDFs that can be actionable in a python application. I played around with python modules like pypdf2 enought to know that I can extract the text of the pdf- but the challenge is to associate that text with the meaning given VISUALLY by the pdf.
Let me explain. When one looks at say a pdf medical record, we instinctively gather meaning about the text from the organization - the position, the presentation, the fonts, the colors. So for example my eye tells me that the date up on the right hand side of the first page of a meidcal treatment note is the date of treatment - even though there is no text to so indicate like "Treatment Date:" or "Visit Date:". I know perhaps just because of its position, or its font, or its color. The same applies to the sections: I know that the text below the section text "Medical History" is the patient's medical history - that is, until we reach "Medications" - also in the same distinguishing font and text size as "Medical History".
My goals regarding medical texts is to (i) identify and extract the treatment date, and (ii) chunk the texts based upon their "sections" - sections only from what I can tell discerned visually.
Would API extract be the right tool?
Copy link to clipboard
Copied
The Extract API would give you the text, position on the page, and style of the text. From there, a developer could write an algorithm that could look for the kinds of "markers" that you describe. The more consistent the page layout of the various documents is the easier that would be but Extract can definitely give you enough of a description of what's on the page to accomplish your goal. It just won't do the hard work of parsing the content for the parts that you want.