Detecting new Line with Acrobat SDK
Copy link to clipboard
Copied
Im using the Javascript plugin with the following code
var j = 3
for (var i = 0; i < 10; i++) {
var line= this.getPageNthWord(0,j+i,false);
var final2 = line.slice(-2);
if (final2 == ""){
console.println("I AM A NEW LINE");
}
//console.println(this.getPageNthWord(0, j+i,false));
console.println(line);
//console.println(line.slice(-2));
}
the output of this shows for example
word1
word2
word3
word4
word5
word6
word7
word8
word9
work10
as expexted howeever i want to see what is the last word on the line the spaces in the console print are showing correctly but ive tried if line == "" and "\n" etc but nothing is telling me that its the space. Any suggestions?
Copy link to clipboard
Copied
Sometimes the app that created the PDF will leave a \r or \n at the end of a line, but this is not guaranteed. It's also not guaranteed that the words will appear in the order in which you see them on the page. The only way to know for sure is to get the bounding boxes of all the words and sort them into lines. Of course you have to be aware that not all lines of text are across the entire page. Text can appear in blocks, as well as columns.
Use the Acrobat JavaScript Reference early and often
Copy link to clipboard
Copied
Can you provide some of the methods that are used for the sorting them into lines?
Copy link to clipboard
Copied
There are two skills necessary to solve this issue.
1) And understanding of 2D geometry.
2) JavaScript programming skills.
The idea is quite simple. Create a array where each entry is another array representing a line. The line array contains objects, where each object contains the word and the word rectangle. Then write a loop just like the one you have above, only save the word and it's rectangle to a line array based on the rectangle. The meat of this method is a function that determines whether or not a word rectangle is on the same line as another rectangle, i.e., do the vertical limits of the rectangle overlap the vertical limits of another rectangle. A 50% overlap is enough to say they are on the same line. If a word doesn't match any existing line, then it is the first entry on a new line.
The last word on a line is the one if the right-most coordinate.
Use the Acrobat JavaScript Reference early and often
Copy link to clipboard
Copied
Hi,
Sorry if I'm late! I've only catched your post today.
Try this script and let me know if it suits you.
var thePage=0;
var theLine=""
var nbLines=0;
for (var p=0; p<this.numPages; p++) {
var aRect=this.getPageBox("Crop",p);
var bottomWord=aRect[1];
for (var i=0; i<this.getPageNumWords(p); i++) {
var theWord=this.getPageNthWord(p,i,false);
var q=this.getPageNthWordQuads(p,i);
m=(new Matrix2D).fromRotated(this,p);
mInv=m.invert();
r=mInv.transform(q);
r=r.toString();
r=r.split(",");
if (thePage==p && bottomWord!=Number(r[5])) {
var theLine=theLine.replace(/\s+$/,"");
if (theLine.length) {
nbLines++;
console.println("\r***** I AM A NEW LINE *****");
console.println(theLine);
}
var theLine=theWord;
} else {
theLine+=theWord;
}
bottomWord=Number(r[5]);
var thePage=p;
}
var theLine=theLine.replace(/\s+$/,"");
if (theLine.length) {
nbLines++;
console.println("\r***** I AM A NEW LINE *****");
console.println(theLine);
}
}
console.println("\r***** THERE ARE "+nbLines+" LINES IN THIS DOCUMENT. *****");
@+
Copy link to clipboard
Copied
@bebarth , What exactly is the idea with your algorithm?
And I think you made some errors. "theLine" is redeclared twice.
the page coordinate conversion matrix should be created above the inner loop.
It looks like your assumption is that the page words will be arranged in order that they appear on the page. This may not be correct, but will probably work in most situations.
Use the Acrobat JavaScript Reference early and often
Copy link to clipboard
Copied
What exactly is the idea with your algorithm?
The script checks the vertical position of the bottom of each word. If the next word has the same position, it is the same line, if it is different it is a new line.
"theLine" is redeclared twice.
I don't think so! It's only declared on line #2.
the page coordinate conversion matrix should be created above the inner loop.
It's correct if all pages have the same size, but I've the habit of checking the size of each page because I've been fooled several times before.
It looks like your assumption is that the page words will be arranged in order that they appear on the page. This may not be correct, but will probably work in most situations.
I know that the order of words on a page is not always from top to bottom and from left to right, but as you say it works in most cases. When in very few cases for me it didn't work, I used the "Auto tag" action which solved my problem. I am not a "pdf" specialist, but with my few years of experience I understood that the word order was not what it should be when the text as been modified or some texts added. For me, each time the "Auto tag" action corrected the problem but certainly you can find a case where that doesn't work...
@+
Copy link to clipboard
Copied
What exactly is the idea with your algorithm?
The script checks the vertical position of the bottom of each word. If the next word has the same position, it is the same line, if it is different it is a new line.
This methodology only works in the ideal senario. I've written many different scirpts and plug-ins for analyzing PDFs and have examined thousands of documents, generated by a wide variety of tools. There are mulitiple ways this strategy could fail. The simplist is the use of different fonts in different words, and super-script and sub-scripts. There are also many other cases where the bottom coordinates for words on the same line don't match up. Which is why I suggested testing for overlap. This is a direct result of my experience.
"theLine" is redeclared twice.
I don't think so! It's only declared on line #2.
var theLine=theWord
is a re-declartion. While JS is loose enough to correct for this error, it presents both a memory managment and a scoping issue that could cause issues. Best to write scripts using proper form for robustness.
the page coordinate conversion matrix should be created above the inner loop.
It's correct if all pages have the same size, but I've the habit of checking the size of each page because I've been fooled several times before.
You have this code inside the loop that iterates over the page words. The page isn't changing, so there is no reason to recreate the page coordinate conversion matrix.
And on a related note. The thePage variable is unnecessary, it serves no purpose.
BTW: Kudos on the auto-tagging trick for reordering content. Thats a good one!
Use the Acrobat JavaScript Reference early and often
Copy link to clipboard
Copied
I've definitely been very lucky that it's worked out for me every time.
var theLine=theWord
is a re-declartion. While JS is loose enough to correct for this error, it presents both a memory managment and a scoping issue that could cause issues. Best to write scripts using proper form for robustness.
I don't understand. The first time I declare the variable without nay value with
var theLine="";
because I need it for the line #16, else it doesn't work;
var theLine=theLine.replace(/\s+$/,"");
then, in line #22, if the condition is true I attribute the "theWord" value to this variable.
For me, the declaration is only the first time (which could only be var theLine)... it is possible to change the value of a variable, isn't it?
You have this code inside the loop that iterates over the page words. The page isn't changing, so there is no reason to recreate the page coordinate conversion matrix.
And on a related note. The thePage variable is unnecessary, it serves no purpose.
Correct. This script comes from an other one where I had to check if the next line was on the same page or the next page.
@+
Copy link to clipboard
Copied
Try this:
console.println(line.toSource());

