Copy link to clipboard
Copied
How do I count words in an English PDF document?
Copy link to clipboard
Copied
There's no built-in word count tool. In Adobe Acrobat you can use a console JavaScript:
var cnt=0;
for (var p = 0; p < this.numPages; p++) cnt += getPageNumWords(p);
console.println("There are " + cnt + " words in this file.");
Copy link to clipboard
Copied
When i use the script :
var cnt=0;
for (var p = 0; p < this.numPages; p++) cnt += getPageNumWords(p);
app.alert("There are " + cnt + " words in this file.");
I keep getting word count 0 on all PDFs?
Any ideas?
Copy link to clipboard
Copied
The document is probably scanned and was not OCR-ed, so it doesn't contain
any actual words in it, just images with text on them...
Copy link to clipboard
Copied
The PDF is full text, when i copy and paste to word the count is 1,052 words - Im just wondering whether i need to edit the script at all?
Script i am using -
var cnt=0;
for (var p = 0; p < this.numPages; p++) cnt += getPageNumWords(p);
app.alert("There are " + cnt + " words in this file.");
Copy link to clipboard
Copied
No, that script comes from the JS API Reference and should work. Can you
share the file?
On Tue, Nov 19, 2013 at 10:15 AM, tobywilmington
Copy link to clipboard
Copied
Hi Gilad,
It is happeneing with any PDF, ive just tried the ipad mannual - http://manuals.info.apple.com/MANUALS/1000/MA1595/en_US/ipad_user_guide.pdf
I have this now set up so when i open a pdf it runs the Java automatically but just doesnt seem to pick up words?
Cheers
Toby
Copy link to clipboard
Copied
What do you mean, exactly? From where are you running this code?
Copy link to clipboard
Copied
In Adobe Acrobat pro - im using this JS as a saved script, i think it runs through the debugger?
Copy link to clipboard
Copied
So you're running it from the console directly?
Copy link to clipboard
Copied
Also, are you selecting all of the code when you run it?
Copy link to clipboard
Copied
I'm attempting to use the same JS code in Adobe Pro version 10 and receive the "no words found" response. What and how did you finally resolve your issues? If you don't mind sharing....
Thanx
Copy link to clipboard
Copied
That usually means that your file contains only images, no real text...
Copy link to clipboard
Copied
I ran it on the iPad Manual file and the result was:
Copy link to clipboard
Copied
When i run it directly in the console - the console replies as below
Sorry to be a pain!
Copy link to clipboard
Copied
Yes, that's what I thought... You have to select all of the code (with the mouse) before running it.
Copy link to clipboard
Copied
So sorry! Rookie!
Massive thanks
Out of interest - does this script have the potential to count images? that would be a huge help to myself
Copy link to clipboard
Copied
It's a common mistake to make...
No, JS has no access to the images in the file, at all. Only to the textual content, and even that's limited.
Copy link to clipboard
Copied
anyone know why the Acrobat console reports such wildly different word counts than other tools (e.g. Word)?
What is this script counting in addition to word breaks? I get differences of a couple hundred to over a thousand extra "words."
Copy link to clipboard
Copied
For example, Acrobat splits hyphenated words into two, so "right-handed" will count as two words, while Word counts it as one. If you can share a sample file that demonstrates this issue we can look more closely into it, but if Word behaves the way you're looking for, just export the PDF to Word and do the count there. That's the easiest solution.
Copy link to clipboard
Copied
Copy link to clipboard
Copied
initially I thought it might be the metadata. The metadata of the above PDF, saved from Acrobat, comes to 37 "words" in Word. It's 157 copy/pasted from Word into Acrobat.
Big difference, but nowhere near the total difference.
Copy link to clipboard
Copied
Metadata info is not included in the word count.
Copy link to clipboard
Copied
Not the easiest file to work with... More than 12K words just on page 1! Can you find a smaller file that demonstrates this issue? Also, notice the page is cut off at the end of the page. The last line is duplicated on both pages, which might help explain the differences you're getting in the counts.
Copy link to clipboard
Copied
We work with full-length novels. The word count is not unusual. I grabbed that file because it's in the public domain. I cannot share the full text of the books we work with.
I obtained that file from the Project Guttenberg website. It was HTML text that I saved as a PDF. I took off the headers and footers as they weren't part of the actual text. Whatever text is visible in the PDF, I did Select All > Copy, then Paste into Word, so the text is the same (or should be, let me know if you want the Word file that was created).
To your point, you could take _any_ PDF upwards of 1,000 words, run the word count in Acrobat, then paste the text into Word and run a word count there.
In my two examples — 1,000 words vs 46,000 words — the higher word count title resulted in larger number of "extra" words, though it was a smaller percentage of the total; while a smaller word count had a fewer number of "extra" words (naturally) though a higher percentage difference of the total.
Specifically: 1,000 words in Word vs 1,200 in Acrobat = 200 extra words or 20% variation; vs 46,000/47,000 or 1,000 extra words with only 2% variation.
The only reason I included the stats about the metadata is because the results are so wildly different: 37 vs 157! That's not a lot of text for such a large variation.
Perhaps that could be a clue as to what it is about the text that's causing it to be read so differently.