Copy link to clipboard
Copied
As of now anything in the headers and footers of pdfs gets ignored when the data is extracted. This is an issue when there is some very valuable information there.
Is there a plan to add this option to ExtractPdfOptions? Is there a work-around?
Copy link to clipboard
Copied
I've already added this to the feature request list. But out of curiosity, how would you want header/footer content to be represented? My idea was to take a cue from how accessibility tags are added to a PDF where the header/footer are considered "artifacts" so that when a paragraph or table spans pages, the footer doesn't interrupt the reading order but you'd still be able to easily access the header/footer on a page by page basis if you needed to.
What are your thoughts?
Copy link to clipboard
Copied
That is a great question. I like your idea of considering it separate artifacts.
On one side, you might need the data contained in the page-specific footer, for example if you need the page number when it doesn't correspond to the PDF page, or it may contain the current chapter, etc. On the other hand, most headers or footers contain the same information again and again, but it doesn't appear anywhere else, like for example, the name of the product the specsheet you're parsing, which is essential.
Perhaps it could still retain flexibility and cater to both cases by having options, like:
{headers:true, footers:true, inline: false, firstOnly: true}
So if you can have only headers or only footers, and you can choose whether to have them inline, so that a paragraph that spans two pages will be interrupted by a footer and then a header element. If you set inline to false it will put all the header and footer elements at the end. Whether it's just the first header or footer it encounters or all of them depend on the firstOnly option.
Copy link to clipboard
Copied
The other ticky situation is where content near the top and bottom of the page gets identified as a header/footer but isn't; there aren't any actual headers or footer, it's all body content. Even with the "artifact" concept, we'd at least know that the first artifact was the first piece of content on the page and the last artifact is the last item on the page.
Copy link to clipboard
Copied
Yes, definitely a lot of important information is getting lost.
Is there anywhere we can track whether this is going to be done at all?
Copy link to clipboard
Copied
This is a huge issue for me. I think it's much better to process the complete document and make people filter out content after the fact. It should be pretty easy just to filter content using the y coordinates. The API is pretty useless if the PDF has important information outside of the regular page.