Skip to main content
Participating Frequently
March 8, 2017
Question

Is PDF/A format useful for scanned documents?

  • March 8, 2017
  • 2 replies
  • 6153 views

Hi,

I work an academic library where I currently supervise a project to scan a collection of printed thesis into PDF files.

In order to garantee a long term accessibility to these files, I initially chose PDF/A format.

The thesis are scanned with a FUJITSU production scanner and then we run OCR on Acrobat Pro DC to add fulltext search.

But I run into two problems :

1/ It is much more time consuming to generate PDF/A.

2/ Some of theses thesis are put online ; I then open the PDF/A file in order to produce a 2nd version of the file in which I add a start page with a disclaimer + redact some sensitive informations (such as the place and date of birth of the author). I noticed that when I save this new version into PDF/A, the size of the new file is 2 or 3 times bigger than the original one!

Can you tell me the reason why ?

Do you think that PDF/A format is relevant in this context?

Thanks for your help,

JH Morneau

This topic has been closed for replies.

2 replies

JHMorneauAuthor
Participating Frequently
March 17, 2017

Any idea about this file size difference ?

I can upload a couple of PDF/A files if you want to take a look at them.

Maybe it could help you find the cause.

Dov Isaacs
Legend
March 9, 2017

You have a few questions here ...  

Yes, it does take time to generate PDF/A. Why? Because part of the process is to analyze the OCR'ed text and create the tags required by the PDF/A specification. This is a processor time-consuming process.

One thing you might not be aware of is that if you open the file already converted to PDF/A and add a start page and redact information, the changes you make may in fact invalidate the PDF/A certification of the file. (I've seen this happen before. Run the PDF/A validation after you add the page and redact and you will likely find that Acrobat no longer considers this a valid PDF/A file!!! And its right!!)

Recommendation is that after scanning the document and running OCR, make a copy of the PDF file to which you add the start page and do the redaction. Then do the PDF/A conversion on the original and the copy separately. Yes, this appears to be double work, but you will end up with two valid PDF/A files!

Yes, PDF/A is relevant in this context. It is the ISO standard recognized around the world as the PDF subset that is safe in terms of features used (and not used) for long term document archiving (the ‘A’ in PDF/A is for “Archiving”), As a librarian in an academic library, this should be important to you.

In terms of file size, when modifying PDF files (such as when you added a page and redacted a PDF file), always use Save as instead of Save. Save simply appends the resultant pages to the end of the PDF file being replaced. Save as totally rewrites the PDF file.

          - Dov

- Dov Isaacs, former Adobe Principal Scientist (April 30, 1990 - May 30, 2021)
JHMorneauAuthor
Participating Frequently
March 10, 2017

Thank you for your answer. I understand that PDF/A process better now.

I'm still puzzled by the huge size difference between :

1/ scanned thesis + OCR in PDF/A format

2/ the very same file, modified (with the addition of a start page + some redaction) then saved into PDF/A format again.

Basically, the content is practically the same, yet the size of file #2 is 2 or 3 times bigger than #2!

The start page being a Word document converted into PDF/A, I understand that it carries the font sets with it, but its weight is only 4 ko.

As for your suggestion to make two distinct files, I'll give it a try to check if it solves this mysterious size "explosion".

Problem is only a tiny proportion of our scanned thesis are put online (and thus need a second version with a startpage and redaction) because we need to obtain a written authorizarion from the author beforehand.

Legend
March 10, 2017

Redaction may do some very special things to obfuscate original information. Don't know what. You can use Audit Space Usage (under Save As Optimized PDF) as a useful step in your analysis.