Extract PDF Page by Form Field Name and Value

Report · Jun 29, 2016

This project started with JavaScript for me, but I've since explored other languages and am open to any viable solution.

Disclaimer: 7 months ago I had 0 programming knowledge, but since then have made great strides and learned a lot. Once I complete this project I intend to go back and start from scratch and learn the foundation. I'm also working with a Windows computer.

What I'm looking to do is extract pages of a master PDF file. The pages in question that I want to extract all have 5 questions with "Yes" or "No" checkboxes. I want to extract only those pages have "Yes" checked.

I've poured through FDF, XML and CSV versions of the Master PDF to find trends and view the internal structure of the PDF. I have the Form Names of all the actual checkboxes (both the "Yes" and "No" fields) on the page I need. What I think needs to be done is:

Parse PDF

Read Field Names

If Field Name Value is "Yes" ("Off" is the value for being checked "No")

Then Extract the whole page

I have way more detail and specifics that I can get into if necessary. But if anyone can help show me how to get the whole page based on Field Names with Values of "Yes" it would be much appreciated!

P.S. I'm not married to any specific programming language. JavaScript seemed like a good starting point for me when I began this project due to it's compatibility with Acrobat. I'm open to exploring other languages if necessary.

Thanks again in advance!

Report · Jun 29, 2016

The first thing I would be curious about is what you intend to do with the results. For any language to help you completely it will need that ability. JavaScript has no filesystem access without Node-esque servers so to give you the best advice we'll really need to know if you're writing the results somewhere, what format that is, or if you haven't figured that out yet. So what are you doing with the results of your parse and will those results need to be publicly visible or is just information for you privately?

Report · Jun 29, 2016

Thanks for the response. So the master PDF file is sometimes upwards of 1000 pages in length. Currently I have to print all of them, and then manually sort through the stack to get the pages I need. Those pages are then handed off to another member of my team for review. All that can be eliminated if I can extract the pages with Field Names with the "Yes" value from the master PDF, and then simply email them over to the review team.

So the object isn't to write new data, but more so to append the extracted pages as a new PDF.

I've bounced back and forth between JavaScript and Python in trying to work this out. The trouble I'm having (with both languages) is finding a way to associate the form field name and value (ex. '/T': 'Field_Name', '/V': 'Yes') to the page it's on. Once I can say (for lack of better words) "If Page(X) has ('/T': 'Field_Name', '/V': 'Yes') .... Then extract that Page and Append etc....."

Report · Jun 29, 2016

Eventually you want to create something new and save it (unless you have a habit of sending hundreds/thousands of pages to a printer). That's what JavaScript won't help you with in any sensible way.

You've mentioned you're willing to consider learning a new language to get the job done. Languages like Java, c#, c++, .NET framework, Obj-c, etc etc are all quite large. If you're willing to learn a language you should choose one that has other practical uses to you. I personally would recommend either Java for the platform (OS) agnostic approach and giant amount of available resources, or c#.NET for the same thing. Although I have a vested interest in Java for editing games with my kids .

There's been recent posts here you can look for in these pages that approach reading and constructing PDFs. For example here's one that covers the reading side on c#.NET that you might want to take a look at yourself:

How do I embed PDF in RTF file using c#?

Pre-written code you can just grab and use for yourself is the quickest way from point A to B but it usually involves trying a few to find the right mix.

Extract PDF Page by Form Field Name and Value

Photos