Skip to main content
Digitizer Dave
Participant
January 25, 2022
Answered

Tagging and Accessibility woes.

  • January 25, 2022
  • 1 reply
  • 1684 views

I work in a University Library. Part of my job is to digitize old materials so we can remove the physical copies from the shelves to make room for newer materials. I recently started a rather large project of digitizing some very old material. This material is about as far from ideal as you can get for the purposes of OCR & accessibility. Think old magazine layouts, that contain lots of border images, atypical reading orders and non friendly fonts, and lots of errant markings from past filing and organizing methods (these items have been circulated through our collection for decades across multiple cataloging systems). At some point years ago they were removed from the system and scanned into PDF files. I DO NOT have permission to edit the source files, e.g. switching to a friendlier font, removing troublesome border graphics or running graphics through filters. I can't use auto tagging so I'm having to manually tag. I'm having multiple problems in this process. First I often have problems with it not allowing me to add a tag, or applying the wrong tag, which seems to be permanent unless I go back and clear the entire page structure and start over, which will often end with the same problem, I'm told that this might be caused by leftover encodings from Word or other similar. Is there software or a process to clean these old artifacts from the file before I start editing? The largest problem I'm having is that so many of the images on the pages get identified as text when they are not. This has been a massive problem for screen readers, as most of the images, seals, and annoying borders around text boxes often are recognized as text and the screen readers will read out entire paragraphs of arbitrary special characters, even when I've tagged them as background artifacts or images and provided alternate text. (the screen reader will read the alternate text and then read off the paragraphs of special characters as well.) I am aware that the unsighted are used to certain issues with screen readers and have the ability to skip when necessary, but due to the poor quality of starting materials, this issue is extreme and drastically reduces the accessibility quality of the material. I desperately need a method to make the readers skip over perceived text. Furthermore when the readers do read these and similar problems with complex tables they will often read the data out of order, up to down, right to left, and even try to read parts of the border as text all while reading the same table, in spite of providing a summary. Any work around for this issue would be welcome. Or perhaps even a non invasive process that could prepare the files better before I begin the OCR process, keep in mind I can't change the visual content in any way. I'm desperate, at the rate I'm going now, I'd make more progress editing them in their source code.

This topic has been closed for replies.
Correct answer Bevi Chagnon - PubCom.com

@Digitizer Dave, the problems you describe are very common when using any OCR (optical character recognition) software to recover legacy documents. And you're trying to also make the accessible, too.  Yeowzah!

 

Your success depends upon several factors:

  • The quality of the original printed document. Is it faded? Handwritten notes on it? Coffee stains?
  • The complexity of the page design. Graphics? Items superimposed or overlapping each other? Graphics of text? Background tints?
  • Tables, lists, footnotes, Tables of Content...all are difficult to capture, and even more difficult to tag for accessibility that must include accessible hyperlinks.

 

There is no software that can do this in a reasonable period of time. Our workflow for these types of documents is to capture the text and graphics, export to MS Word, and rebuild a better source document with everything that's needed:

  • <P> body text
  • <Hx> headings
  • <L> lists
  • <Table> tables
  • <Ref> / <Note> footnotes
  • <TOC> / <TOCI> Tables of content (required for documents 10 pages or longer)
  • <Figure> with Alt-text

 

One other problem with OCR scans is that they often mis the spaces between words, which jams two or more words together that is not fully accessible to screen reader users. You can't see this problem until you extract the text into Word and view the hidden spaces, returns, etc. which you can correct in Word.

 

Our workflow at our accessibility shop:

  1. Scan/OCR in Acrobat Pro DC and test to see how well it interpreted and captured the text content.
  2. Scan/OCR in Abbyy Fine Reader, which is a competing program that often does a better job on legacy materials.
  3. Export the content to Word.docx file. Both programs have this utility.
  4. In Word, you can now see missing spaces, paragraph returns in the wrong place (they should only appear at the end of paragraphs, not mid-paragraph), and typos. Graphics should be there, too.
  5. Use industry techniques to create an accessible Word document with headers/footers, styles to designate lists, headings, hyperlinks, TOCs and anything else like Alt-text.
  6. Re-export to PDF and the result will be a pretty accessible, compliant PDF ... much less time and money.

 

Remember, there is no magic wand to do this quickly.

 

1 reply

Bevi Chagnon - PubCom.com
Legend
January 26, 2022

@Digitizer Dave, the problems you describe are very common when using any OCR (optical character recognition) software to recover legacy documents. And you're trying to also make the accessible, too.  Yeowzah!

 

Your success depends upon several factors:

  • The quality of the original printed document. Is it faded? Handwritten notes on it? Coffee stains?
  • The complexity of the page design. Graphics? Items superimposed or overlapping each other? Graphics of text? Background tints?
  • Tables, lists, footnotes, Tables of Content...all are difficult to capture, and even more difficult to tag for accessibility that must include accessible hyperlinks.

 

There is no software that can do this in a reasonable period of time. Our workflow for these types of documents is to capture the text and graphics, export to MS Word, and rebuild a better source document with everything that's needed:

  • <P> body text
  • <Hx> headings
  • <L> lists
  • <Table> tables
  • <Ref> / <Note> footnotes
  • <TOC> / <TOCI> Tables of content (required for documents 10 pages or longer)
  • <Figure> with Alt-text

 

One other problem with OCR scans is that they often mis the spaces between words, which jams two or more words together that is not fully accessible to screen reader users. You can't see this problem until you extract the text into Word and view the hidden spaces, returns, etc. which you can correct in Word.

 

Our workflow at our accessibility shop:

  1. Scan/OCR in Acrobat Pro DC and test to see how well it interpreted and captured the text content.
  2. Scan/OCR in Abbyy Fine Reader, which is a competing program that often does a better job on legacy materials.
  3. Export the content to Word.docx file. Both programs have this utility.
  4. In Word, you can now see missing spaces, paragraph returns in the wrong place (they should only appear at the end of paragraphs, not mid-paragraph), and typos. Graphics should be there, too.
  5. Use industry techniques to create an accessible Word document with headers/footers, styles to designate lists, headings, hyperlinks, TOCs and anything else like Alt-text.
  6. Re-export to PDF and the result will be a pretty accessible, compliant PDF ... much less time and money.

 

Remember, there is no magic wand to do this quickly.

 

|&nbsp;&nbsp;&nbsp;&nbsp;Bevi Chagnon &nbsp;&nbsp;|&nbsp;&nbsp;Designer, Trainer, &amp; Technologist for Accessible Documents ||&nbsp;&nbsp;&nbsp;&nbsp;PubCom |&nbsp;&nbsp;&nbsp;&nbsp;Classes &amp; Books for Accessible InDesign, PDFs &amp; MS Office |
Digitizer Dave
Participant
January 26, 2022

Thank you for the reply. I have already been testing these procedures this morning and I think it will make a huge difference in the quality of our work.