• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Extracting Page Header and Footer

Community Beginner ,
Apr 17, 2023 Apr 17, 2023

Copy link to clipboard

Copied

As of now anything in the headers and footers of pdfs gets ignored when the data is extracted. This is an issue when there is some very valuable information there.

 

Is there a plan to add this option to ExtractPdfOptions? Is there a work-around?

Views

514

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 17, 2023 Apr 17, 2023

Copy link to clipboard

Copied

I've already added this to the feature request list. But out of curiosity, how would you want header/footer content to be represented? My idea was to take a cue from how accessibility tags are added to a PDF where the header/footer are considered "artifacts" so that when a paragraph or table spans pages, the footer doesn't interrupt the reading order but you'd still be able to easily access the header/footer on a page by page basis if you needed to.

 

What are your thoughts?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Apr 17, 2023 Apr 17, 2023

Copy link to clipboard

Copied

That is a great question. I like your idea of considering it separate artifacts.

 

On one side, you might need the data contained in the page-specific footer, for example if you need the page number when it doesn't correspond to the PDF page, or it may contain the current chapter, etc. On the other hand, most headers or footers contain the same information again and again, but it doesn't appear anywhere else, like for example, the name of the product the specsheet you're parsing, which is essential.

 

Perhaps it could still retain flexibility and cater to both cases by having options, like:

{headers:true, footers:true, inline: false, firstOnly: true}

 

So if you can have only headers or only footers, and you can choose whether to have them inline, so that a paragraph that spans two pages will be interrupted by a footer and then a header element. If you set inline to false it will put all the header and footer elements at the end. Whether it's just the first header or footer it encounters or all of them depend on the firstOnly option.

 

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 17, 2023 Apr 17, 2023

Copy link to clipboard

Copied

The other ticky situation is where content near the top and bottom of the page gets identified as a header/footer but isn't; there aren't any actual headers or footer, it's all body content. Even with the "artifact" concept, we'd at least know that the first artifact was the first piece of content on the page and the last artifact is the last item on the page.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Apr 26, 2023 Apr 26, 2023

Copy link to clipboard

Copied

LATEST

Yes, definitely a lot of important information is getting lost.

 

Is there anywhere we can track whether this is going to be done at all?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources