Extracting Page Header and Footer

Forum|Forum|2 years ago
April 17, 2023
1 reply
1611 views

As of now anything in the headers and footers of pdfs gets ignored when the data is extracted. This is an issue when there is some very valuable information there.

Is there a plan to add this option to ExtractPdfOptions? Is there a work-around?

This topic has been closed for replies.

Joel Geraci

Community Expert

I've already added this to the feature request list. But out of curiosity, how would you want header/footer content to be represented? My idea was to take a cue from how accessibility tags are added to a PDF where the header/footer are considered "artifacts" so that when a paragraph or table spans pages, the footer doesn't interrupt the reading order but you'd still be able to easily access the header/footer on a page by page basis if you needed to.

What are your thoughts?

R

Reuven27686304lwioAuthor

Participant

That is a great question. I like your idea of considering it separate artifacts.

On one side, you might need the data contained in the page-specific footer, for example if you need the page number when it doesn't correspond to the PDF page, or it may contain the current chapter, etc. On the other hand, most headers or footers contain the same information again and again, but it doesn't appear anywhere else, like for example, the name of the product the specsheet you're parsing, which is essential.

Perhaps it could still retain flexibility and cater to both cases by having options, like:

{headers:true, footers:true, inline: false, firstOnly: true}

So if you can have only headers or only footers, and you can choose whether to have them inline, so that a paragraph that spans two pages will be interrupted by a footer and then a header element. If you set inline to false it will put all the header and footer elements at the end. Whether it's just the first header or footer it encounters or all of them depend on the firstOnly option.

Joel Geraci

Community Expert

The other ticky situation is where content near the top and bottom of the page gets identified as a header/footer but isn't; there aren't any actual headers or footer, it's all body content. Even with the "artifact" concept, we'd at least know that the first artifact was the first piece of content on the page and the last artifact is the last item on the page.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded