We're having an issue with redacting our files where other information, not marked-up for redaction, is removed from the file after the redactions have been applied. The other information is normally a duplicate or copy in appearance, but it doesn't affect every copy. Looking at the pdf Content the issue only occurs where there is an Form XObject (#) or an Image (#) where the # number is a duplicate (is this the object ID? wasn't sure). For XObjects - the information just dissappears, for images, the image is replaced with a blank colour the same as the applied redactions.
Currently we can fix the issue by refrying the pdf , either the whole file or converting individual pages to TIFF, but this isn't ideal so would like to try work out what's causing the issue.
My question is, has anyone seen this or know about it? is it an bug with acrobat? or is it to do with how the pdf was created?
Have you tried flattening before redaction? Redaction will remove any object outside the page content (field or annotation) if it touches on a redaction area.
Yes, issue still occurs after files are flattened. To clarify, the redacted content, and the content removed unintentionally, can be many pages apart.
You've stated this happens on XObjects (images are also XObjects). If you're looking in the Content Navigation Panel, then the number you've described is the object ID number. It makes sense in a way that if the original XObject was removed from the PDF, then blanks would show at the other locatins where those objects are referenced
I just tested this out, and I'm seeing the same thing. At the original redaction site the image is replaced completly with a path and text, which is what should happen. At the other location where image is referenced, a new image is referenced that is the same size, but the pixels are all white, which is the redaction fill color.
I can see the justification for doing this. However, it aught to be a setting. You could complain to Adobe, but I think this is just the way Redaction has been setup to work.
Thanks for taking the time to investigate, much appreciated.
On solution would be to write a tool to replace all duplicate references to XObject with unique copies. This could potentially blow up the size of the PDF.
This is an interesting discussion. Having thought about it, it's entirely correct that the info is removed on other pages, but it would be nice to get a warning about it. The important thing is to remember the #1 job is redaction - removal of sensitive info - rather than convenient and selective text deletion.
I should first observe that the job of form XObjects is to allow shared content to be duplicated. A common use is to allow the same background on each page, without having to make the file huge with endless copies of the graphics. It is normal to have these made automatically, for example the PDF Optimizer will do this. The form Xobject is about appearance, and isn't a convienient collection of organized graphics.
Scenario 1: a graphic contains personal info (shall we say, the address of a person hiding from an abusive partner, just for an example). The graphic is used on page 1 and 3. The address on page 1 is covered up. The address on page 3, being part of the same graphic, is also removed. This seems correct. You can argue that it's the responsibility of the redactor to see and select the other address, though.
Scenario 2: a page is split into 9 smaller pages by making the page into a form XObject, duplicating it 9 times, and cropping each one. (This is a real thing). You redact an address on page 1. But you don't realise the address exists on 8 other pages, cropped out, easily accessed by the attacker. Happily, the XObject takes care of this, and it disappears in the invisible parts too.
Anyway, I feel that this is very much the right thing for Acrobat to do, and using techniques to split out/separate duplicate form XObjects is actually very dangerous in a redaction context.