Copy link to clipboard
Copied
I know how to develop Windows Application in Visual Studio to to control Adobe Acrobat and PDF Documents using OLE Automation.
I am referring to page 21 in this guide:
https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/iac_developer_guide.pdf
I have done that many times in the past and the result was 100% successful.
I need to do the same but using Java and Eclipse.
My ultimate objective is to be able to extract text from a flattened PDF which is an appraisal form. So the Form has Fields and Values in a flattened PDF, and it follows a strict and fixed layout.
So, I want to write a Windows Desktop Application that will open a flattened PDF, find the field caption, and jump to the field value, extract the text. I've done some research, and so far I realized that I have to use the Doc method "getPageNthWord()" using OLE JSObject in Java.
I was able to use this code sample in the console window to extract the text of the current page:
var len = this.getPageNumWords(this.pageNum);
var txt="";
for (var i=0; i<len; i++) {
var w = this.getPageNthWord(this.pageNum, i);
txt += w + " ";
}
txt;
Questions:
- How I can add a reference to the Acrobat Library in Java Project in Eclipse.
- Is there any other method other than "gerPageNthWord()" that I can use to perform scraping to extract the text from PDF. I was expecting to find a method to extract a paragraph or the complete text of a given page.
Any help would be greatly appreciated.
Tarek
1 Correct answer
Suggestion: forget Java. Use VB.
Copy link to clipboard
Copied
The Adobe PDF Library has a Java interface:
Copy link to clipboard
Copied
From Bernd’s reply it may not be clear, but the Adobe PDF Library is a separate product with a a separate price tag. You can license it via DataLogics:
Copy link to clipboard
Copied
https://forums.adobe.com/people/Bernd+Alheit wrote
The Adobe PDF Library has a Java interface:
Thanks a lot. All the information I need, except for Java, which probably no need to consider anyway.
Tarek
Copy link to clipboard
Copied
There are several Java libraries for processing PDF files. If you only need the entire page contents that shouldn't be too difficult.
If you need to access specific words in specific locations it becomes much (much) more complicated, though.
I have developed tools that can do it using PDFBox (a free, open-source Java PDF library), so if you're interested in purchasing something like that, feel free to contact me privately (via try6767 at gmail.com).
If you just need the full page contents I'm happy to direct you to an example of how to do it using PDFBox.
Copy link to clipboard
Copied
try67 wrote
There are several Java libraries for processing PDF files. If you only need the entire page contents that shouldn't be too difficult.
If you need to access specific words in specific locations it becomes much (much) more complicated, though.I have developed tools that can do it using PDFBox (a free, open-source Java PDF library), so if you're interested in purchasing something like that, feel free to contact me privately (via try6767 at gmail.com).
If you just need the full page contents I'm happy to direct you to an example of how to do it using PDFBox.
If you have a flattened PDF that represents an Application Form, does the method you mentioned (advanced tools) will help find the fields on the application form, and get the data of the field?
Remember that the field can be "Checkbox", "Radiobutton", Drop-Down List, Multi Selection List.
I am making an assumption that with the tool you mentioned, we need to configure the scrapping process to indicate the parts of the form which has fields, and what is the field type.
Can you provide some more details?
See example of a form that we need to scarp.
Copy link to clipboard
Copied
No, it won't work with anything but text, if the fields have been flattened.
Copy link to clipboard
Copied
If you want you can send me a sample file, though, and I'll see what I can extract from it, but I'm not very hopeful, based on what you shared...
Copy link to clipboard
Copied
Thanks anyway. I will discuss and come back if needed.
Copy link to clipboard
Copied
All text extraction in Adobe Interfaces starts with words. Paragraphs only exist in our perfection so you need to use guesswork and fuzzy logic.
If if you want to use JSObject I recommend you use VB. Converting this to another platform will use a lot of your time.
Copy link to clipboard
Copied
What is your solution or recommendation?
Please provide details.
Tarek
Copy link to clipboard
Copied
Suggestion: forget Java. Use VB.

