Tagging and Accessibility woes.
I work in a University Library. Part of my job is to digitize old materials so we can remove the physical copies from the shelves to make room for newer materials. I recently started a rather large project of digitizing some very old material. This material is about as far from ideal as you can get for the purposes of OCR & accessibility. Think old magazine layouts, that contain lots of border images, atypical reading orders and non friendly fonts, and lots of errant markings from past filing and organizing methods (these items have been circulated through our collection for decades across multiple cataloging systems). At some point years ago they were removed from the system and scanned into PDF files. I DO NOT have permission to edit the source files, e.g. switching to a friendlier font, removing troublesome border graphics or running graphics through filters. I can't use auto tagging so I'm having to manually tag. I'm having multiple problems in this process. First I often have problems with it not allowing me to add a tag, or applying the wrong tag, which seems to be permanent unless I go back and clear the entire page structure and start over, which will often end with the same problem, I'm told that this might be caused by leftover encodings from Word or other similar. Is there software or a process to clean these old artifacts from the file before I start editing? The largest problem I'm having is that so many of the images on the pages get identified as text when they are not. This has been a massive problem for screen readers, as most of the images, seals, and annoying borders around text boxes often are recognized as text and the screen readers will read out entire paragraphs of arbitrary special characters, even when I've tagged them as background artifacts or images and provided alternate text. (the screen reader will read the alternate text and then read off the paragraphs of special characters as well.) I am aware that the unsighted are used to certain issues with screen readers and have the ability to skip when necessary, but due to the poor quality of starting materials, this issue is extreme and drastically reduces the accessibility quality of the material. I desperately need a method to make the readers skip over perceived text. Furthermore when the readers do read these and similar problems with complex tables they will often read the data out of order, up to down, right to left, and even try to read parts of the border as text all while reading the same table, in spite of providing a summary. Any work around for this issue would be welcome. Or perhaps even a non invasive process that could prepare the files better before I begin the OCR process, keep in mind I can't change the visual content in any way. I'm desperate, at the rate I'm going now, I'd make more progress editing them in their source code.
