• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Removing hidden text from document - Acrobat standard tool is not finding it.

New Here ,
May 04, 2023 May 04, 2023

Copy link to clipboard

Copied

Hi,

I have an odd issue with some PDFs that I have to extract data from, and I'm hoping for some support to unpick the problem.

The files appear to have text in them that is hidden in some way other than what might be considered normal. If the attached PDF is viewed in Acrobat, it appears ok but when I run a parser over it to extract the data from partular parts of the doc (using docparser.com, but I'm confident the issue lies in the PDF, not the parser) there is additional text appearing in parts of the doc.

An example problem area is the small table on the right hand side of page 2 titled 'Cott' with rows of 'Odd', 'Medium' and 'Heavy'. I'm extracting that particular area - just the table data, not the vertical title - with the parser, and there is text appearing in it that cannot be seen when viewing the doc in Acrobat.

 

I've tried the following in Acrobat:

Removing form fields

Running the 'Remove hidden text' operation.

 

Neither of these operations are finding that extra text and I'm out of ideas. Hoping that somebody can assist with removing anything that can't be seen without flattening the document to an image (resulting in the need to do OCR after, which is not an option in this case). 

TOPICS
Edit and convert PDFs , General troubleshooting , PDF

Views

2.7K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
1 ACCEPTED SOLUTION
LEGEND ,
May 08, 2023 May 08, 2023

Copy link to clipboard

Copied

The problem is in fact not hidden text. None of the extra text is hidden. Rather it is on the page but in a different place. Take a look at this pic. (No pic  in email replies).

TestScreenName_3-1683541625376.pngexpand image

Things extracted from the "Cott" box include parts of vertical captions, like "pe bre" which is made up from "type"..."break" just below the box. So this is one to give back to the text extraction people - or seek a different tool.

View solution in original post

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 05, 2023 May 05, 2023

Copy link to clipboard

Copied

" I'm extracting that particular area - just the table data, not the vertical title - with the parser, and there is text appearing in it that cannot be seen when viewing the doc in Acrobat."

 

What text does you get there?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
May 07, 2023 May 07, 2023

Copy link to clipboard

Copied

parser.jpgexpand image

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
May 07, 2023 May 07, 2023

Copy link to clipboard

Copied

Apologies, ignore previous reply. Wrong screenshot attached. The text displayed from the document that's attached to this thread is as follows: 

parser.jpgexpand image

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
May 05, 2023 May 05, 2023

Copy link to clipboard

Copied

To analyse this - since Acrobat doesn't extract any extra text - we'd need the exact text you extract, and the exact options you used in your non-Adobe extractor (for example, do you extract by coordinate area)?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
May 07, 2023 May 07, 2023

Copy link to clipboard

Copied

The parser requires a bounding box to be drawn on the file and extracts the text that it finds there. This is exatly what is extracted from the document attached to this thread:

parser.jpgexpand image

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
May 08, 2023 May 08, 2023

Copy link to clipboard

Copied

The problem is in fact not hidden text. None of the extra text is hidden. Rather it is on the page but in a different place. Take a look at this pic. (No pic  in email replies).

TestScreenName_3-1683541625376.pngexpand image

Things extracted from the "Cott" box include parts of vertical captions, like "pe bre" which is made up from "type"..."break" just below the box. So this is one to give back to the text extraction people - or seek a different tool.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
May 09, 2023 May 09, 2023

Copy link to clipboard

Copied

LATEST

That makes a lot of sense. Thank you! the bounding box of the selection area does not cover the vertical parts, but I assume that it may be something to do with the way that it's rotated when being read that's causing it to invisibly (to the eye) fall into the bounding box. 

I'll go back to the parser team.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines