Skip to main content
Participating Frequently
August 5, 2018
Question

Sanitizing PDFs manually from WebHelp input (or worse)

  • August 5, 2018
  • 3 replies
  • 769 views

Long story short, right now (soon to change) I have zero control over the Development of PDFs sent to my group. Sometimes, they’ll download them from our output, and edit them... and we Have had folks paste PDFs from output in to input. It sucks, the process is changing and we are migrating to version control, RH server, and switching to HTML5. For now I need what little I have to make it to the finish line.

How can I manually remove coding and text searching from a PDF (we have adobe acrobat x) ? I don’t exactly know what RoboHelp does to. PDF when it generates so I’m lost on what to change. Basically, I just want to make sure that before anything goes in to the project files it is washed... because I have no clue where the PDF came from the only place I have control is right before we import. Do I turn off text searching? Is there a way to see code and delete it?

This topic has been closed for replies.

3 replies

Participating Frequently
August 6, 2018

Hi Rick & Peter - It diverted me to the main site - I called The handling page a 404 as I do custom 404s that forward in  similar fashion.

My apologies, I feel like everything I’m saying makes zero sense. I’ve been on 70hr work weeks and exhausted.

I usEd the index to look up reverse engineer which did 404 me (to a non custom 404. I can’t grab the address right now just go through index > reverse engineer ) - could just be my iPhone. At this point I hadn’t thought to read the URL closer or I’d have realized what to look for under the head and month so I went to WBM.

Participating Frequently
August 6, 2018

Hi Amebr, thank you!

They are baggage files - so we get a file from someone and they want us to import it in to the project and link to it (for example, a PDF of a newsletter that they want linked on a newsletter page for folks to open or download).

A PDF is different on input than output - to my knowledge, when it is generating WebHelp, RoboHelp will make the PDF searchable. Which is what I want.

So, imagine a scenario where someone has downloaded the news letter from my output - changes something about it in acrobat and sends it back to my developer who blindly pastes it ontop of the original input. So two things have gone wrong: the newsletter writer didn’t update their original newsletter, and my developer brought In or overwrote an input file with what is now an edited output file.

What I am wondering is, if I had two files in front of me that look exactly the same visually except one is from the input and the other is from the output, how would I tell them apart? Similarly, lets say my project source files are on a computer burned in a fire, and now all i have is output and I need to reconstruct my project - what, if anything, would need to happen to those PDFs?

i don’t know enough about what RoboHelp does that makes a source version of the file a searchable version.

No, it’s not normal, no, it’s not a good process - there is a myriad of problems with the how that has caused this to happen I just want to explain how my People can look out for it, and how to fix it if it happens.

Adobe Expert
August 6, 2018

I don't think Robohelp does anything to pdf baggage files, except to copy them to the output folder when you generate the project. The only thing you can control is whether searching in Webhelp or HTML5 help will show pdf files in the webhelp/html5 search results. That's a setting in your SSL/Output settings (Search tab). You might also find the "Exclude Unreferenced Baggage Files from Search" useful - it will make sure baggage files that aren't explicitly linked (in the TOC or a topic) aren't included in the search.

For comparing two pdf files, you would need to ask that question on the Acrobat forums.

If you manually change the security settings in each pdf you receive, you could at least prevent people from updating the pdf that you have published. But it would have to be for each pdf, as you are not generating the pdf files from content within RH. And from the sound of your workflow, there is no way to stop developers from updating their original copy. Again, more specific information about PDF files would be better asked on the Acrobat forum.

Another idea is to add a prefix to the pdf files you receive. e.g. You receive acrobat_file.pdf, rename it RH_acrobat_file.pdf and import in to Robohelp. Then you could tell developers not to copy content from files called RH_, but rather follow whatever process they should be following. Use any naming convention that makes sense for your situation, and tell them not to use that naming for their original files.

If you need to rebuild your project from Output, Peter Grainge (www.grainge.org/index.htm) has a section about reverse engineering on his website. If you just needed the pdf files and not the RH files (e.g. you decided to start the project from scratch), you could just copy the pdfs from your output and use them as normal.

Participating Frequently
August 6, 2018

Amebr, thank you again! Sadly all this baggage is linked - any that isn’t should be shipped off to the graveyard. They’re good at removing the link... we have that part! I’m going to ask the acrobat folks on some ways to remove OCR/recognize text... From what I’ve read w/o Pro I cant do batches... my project is dragging itself with one arm toward me asking to die LOL. We do have specific file names and prefixes... hasnt Stopped anyone - the will save over their own with their old name. The folks writing the docs aren’t technical writers... I’m working on a version control set up whereby we will have more involvement in their doc flow and change management. Until then... it is Wild West.

Thank you again - it sounds like no changes Are made to the docs - which might explain why I couldn’t conceive of what they were. Something happened to our output... after someone carried in a bunch of .js files (not kidding) having no clue that that would be an issue. Ive been sitting and exhaustively watching it compile and it hangs when generating WebHelp (creating contents) and it is becsuse it is making all the .htm files that go with the PDFs... there are 1600 files of baggage. It was taking those and making triple the .htm (those rhA21, rhC32D, etc type files) - before I fixed some of it and removed about 800 files it was taking a 1600 file input and spitting 9000 files out.

Adobe Expert
August 6, 2018

I think we need more information about your workflow.

Are the pdfs included in your webhelp as pdf files, or do you import and convert them to html topics? If included as pdfs, do you generate the pdf files from RH to include in your webhelp or do you import them as baggage files and edit within Acrobat?

I'm not sure why you would want to prevent people searching in a pdf, but I suspect you'll have to post on the Acrobat forum. If you're generating the pdfs from RH, the best I could suggest is poking around the settings you can configure for PDFs in your layout.

I'm not sure what 'coding' is, so I can't offer any suggestions.

I know there's a way to prevent editing pdfs, so you could check the pdf security settings - if you're generating the pdfs from RH you can set this information in your output settings. Otherwise if you just edit the pdfs in Acrobat, there should be somewhere in the settings for  pdf to configure this.