Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Content conversion and content verification possibilites

New Here ,
Jan 28, 2019 Jan 28, 2019

Hi All,

I want to convert .tiff files are of scanned documents (each tiff file has close to 30 pages) into pdf files.

Also, I want to validate the .tiff file vs converted pdf file to check the accuracy of conversion (a software for image comparison that can check scanned document images pixel by pixel and need report what has changed during conversion).

Can you please suggest some tools with which can achieve my requirements mentioned above.

Thanks,

Ravi

TOPICS
Create PDFs
742
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 28, 2019 Jan 28, 2019

Discussion successfully moved from Forum comments to Creating PDFs

Converting Tiffs to PDF: Drop them into Acrobat...

Checking: I do not see the use of such a tool and I do not think that there is one. If you don't do destructive compression or data modification (like colour model change, size change) tiff data in a PDF will not differ from tiff data in the tiff file.

ABAMBO | Hard- and Software Engineer | Photographer
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 28, 2019 Jan 28, 2019

Hi Ravisan,

To expand upon what Abambo has said, there are things you can do during the original scan that can help in the OCR process. You are absolutely correct to be saving them as TIF documents as opposed to JPGs. Image degradation caused by JPG creation can significantly affect the final product of document scans.

In addition, higher resolution scans are also worth the effort. A minimum 300 ppi is always good and if your scanner can do it, 600 is even better. Also, if your document has smaller fonts, than 600 ppi is almost mandatory. The reason for this is that certain letter combinations can confuse the OCR process if (1) the scan is a low resolution scan, (2) if the scan is not a clean scan and (3) if the font in the document is very small. An example of letter combinations that can cause issues include the letter "ri." This can be seen by OCR is an "n."

Another big issue is if you were to place your pages on a scanner, hit the scan button, and say you are done. You are probably not done if you want a clean scan. What I mean by this is more easily explained by looking at a blog I wrote for Adobe here:

https://forums.adobe.com/community/creativepipeline/blog/2018/01/22/scanning-clean-search-able-pdfs

One last comment/warning are hyphenated words at the end of a text line. OCR is only recognizing shapes, not words. Thus if you have a text line where the last partial word is "par-" and the next line continues with "tial," you now have two misspelled words where in fact, there is no misspelled word.

If you do not currently have Acrobat Pro DC, I suggest you download a copy and try it during the trial period and see if it satisfies your needs.

As far as your desire to have a pixel by pixel comparison of the before and after documents, about the only thing I can think of at that level is to open the before and after into Photoshop, and place them as layers into the same document. Then set the top layer's blending mode to "Difference" and any altered pixels will then stand out. Then you have to zoom into each document looking for potential light-shaded pixels.

There are two ways you can look for errors. One is to save each document out as Word files and let Word underline any spelling errors. While this lets you see the entire document very easily, it also prevents you from making any corrections IN the document.

Alternatively, in Acrobat Pro DC, select the Enhance Scans tab and on the top, below the Recognize Text, dropdown menu, select Correct Recognized Text.

Acrobat Pro DCSc-001.png

This will highlight all of the words that are questionable. Please note that the only two actual mistakes were in the shown section included the word "instruction" and lower down, after the text All-Natural Hand Cleaner there is a small graphic of "USA" also was seen as an error. All of the other selected text items were actually spelled correctly.

Acrobat Pro DCSc-003.png

Important Note: this will not show how the word was misinterpreted as the OCRed text is invisible. Fortunately at the top of this screen you do see a closeup of the scanned text and how the text was interpreted as seen below.

Acrobat Pro DCSc-002.png

This was an advertising flyer I happened to have on my desk I grabbed to demo this. The font was small and the scan was done at 300 ppi.

Let me also add that I redid the same scan but teased the Levels (or Histogram, same thing, different name) and improved it (I enhanced the contrast between the page and the text) and this improved the accuracy considerably: there were no actual errors at the same 300 ppi (although Acrobat did want verification on some of the items).

Acrobat Pro DCSc-004.png

Please let us know if this helps your decisions.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 29, 2019 Jan 29, 2019

Hi Gary,

Thank you very much for detailed explanation.

I have already scanned tiff files (No pictures or photos e,g: of accidents etc). These MIME types were created due to some past issues in their printer/scanner and so the documents have these image file extensions. Each tiff file has close to 30 pages in it which need to convert these tiff files to PDF. Also, we need to validate the tiff files with converted PDF files and get the results what has changed during the conversion.

Is Photoshop/Adobe Acrobat DC will compare the images with PDF files and provide the validation results?

Please let me know if any more information required.

Thanks & Regards,

Ravi

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 29, 2019 Jan 29, 2019

HI Ravisankar,

I'm sorry but I'm not aware of any validation capability from any software.

What exactly are you looking for or expecting? Maybe I'm misunderstanding that aspect of your needs.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 29, 2019 Jan 29, 2019
LATEST

Hi Gary,

Please find the requirement steps are given below.

1. The existing source file (tiff, pcx, png, jpg) will be picked to convert into PDF/A format.

2. Compare the original file (tiff, pcx, png, jpg) with converted file (PDF/A) and log the comparison results (any differences during conversion from tiff to PDF) in document.

I am trying to find out any tool/software which can provide solution for my requirement.

Please let me know if any more information required.

Thanks,

Ravi

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 29, 2019 Jan 29, 2019

gary_sc  wrote

To expand upon what Abambo has said, there are things you can do during the original scan that can help in the OCR process. You are absolutely correct to be saving them as TIF documents as opposed to JPGs. Image degradation caused by JPG creation can significantly affect the final product of document scans.

Up to my knowledge there are no multi-page JPEGs. Tiff is the most effective format for bitmaps.

ABAMBO | Hard- and Software Engineer | Photographer
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines