Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Identifying blank body on a page

Participant ,
Feb 24, 2016 Feb 24, 2016

Hello fellows,

I wonder if it's possible to check if a page contains text in the body area. As far as I can see, Javascript cannot discern between headers/footers/body area. Is this true?

Thank you for your response in advance!

TOPICS
Acrobat SDK and JavaScript , Windows
2.5K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

LEGEND , Feb 28, 2016 Feb 28, 2016

You can do that, but doing an intermediate step like copying to an array is just more overhead in an already slow task.

In pseudocode,

set a flag variable to 0

if there are less than 42 words,

  step through each word

    if the word starts Part (or whatever) set the flag to 1

Now, when the loop is finished, if flag is 1, do your action.

Translate
Community Expert ,
Feb 24, 2016 Feb 24, 2016

It is possible, but it's not easy. You can use the getPageNthWordQuads method to get the exact location of each word in the page. Then you need to compare it to the area you're interested in and see if they overlap. If no words match this area then you can conclude that there's no text in it.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Feb 24, 2016 Feb 24, 2016

Hi try67,

Thank you for your prompt response! I checked the definition of this method and I don't see how it can be used in this case.

The method params are 0-based indices - how can they help me identify location on a page?

In addition, if you would like to test if a specific word (text string) is present on a page, how would you do that?

Thanks again!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 24, 2016 Feb 24, 2016

As I said, it's a complex task. You'll basically need to iterate over all of the words in all of the pages to be able to determine it (although you can stop as soon as a match is made, since then you know that there's text in that area).

You can search for a specific word using a similar method and the getPageNthWord method.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Feb 24, 2016 Feb 24, 2016

Thanks for your response!

I tried using the following code to find a page that contains no more than 42 words and does not contain the word "Part n", when n is a digit. However, Acrobat freezes. Can't figure out why. Any sugestions? Thanks!

var FilterWord, numWords;

  for (var i = 0; i < this.numPages -2; i++)

  {

  numWords = this.getPageNumWords(i);

  for (var j = 0; j < numWords; j++)

  {

  var FilterWord = this.getPageNthWord(i, j);

  if (numWords < 42 && FilterWord !== ("Part " + [0-9]))

  {

  DO SMTH HERE

  }

  }

  }

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 24, 2016 Feb 24, 2016

There are several issues with your code. The main one is this piece of code: ("Part " + [0-9]))

This is not a valid expression. You might want to look into regular expressions for this comparison.

Also, keep in mind that the getPageNthWord method only returns one word at a time, so it can return "Part" and then "0" or "1", etc., for the next word, but not together.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Feb 24, 2016 Feb 24, 2016

Hi try67,

Thank you for your input! I'll check the regex. However, even if I remove the digits, Acrobat still dies on me.

BTW, how can I search for the word "part" with the first letter capitalized only?

Thank you!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 24, 2016 Feb 24, 2016

Your comparison already does that, since it's case-sensitive.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Feb 24, 2016 Feb 24, 2016

OK, I see. Thank you!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Feb 25, 2016 Feb 25, 2016

When I remove the [0-9] wildcard, it takes ages for the script to run (on a 2500 page document ). However, it seems to ignore the instruction to search for the "Part" on a page (and if found, skip this page), and still executes the action that comes after the if condition even if the number of words is <42. Any ideas why this happens?

Thank you!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 25, 2016 Feb 25, 2016

It executes the action when numWords < 42 and the word is not "Part".

Why do look at any word of the pages with 42 or more words?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Feb 25, 2016 Feb 25, 2016

Hi Bernd,

Thank you for your response!

First of all, for some reason, running the script causes Acrobat to hang for a while (may take a couple of hours) and in the end, it ignores the instruction to find pages with less than 42 words and not containing the word Part.

I created this if statement because I was looking for a way to find pages that only contain headers and footers (in total containing less than 42 words) and nothing in the body area. In addition, I don't want the action to be applied to cover pages containing the word "Part".

Do you have any suggestion on how to improve the script? Thank you!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 25, 2016 Feb 25, 2016

‌You can execute the inner for loop only when numWords < 42.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Feb 25, 2016 Feb 25, 2016

You may have changed the script but

1. Bernd's point is an excellent one: if you know the page has more than 42 words clearly there is no point checking each word.

2. The time for thousands of pages is not unreasonable if you are using JavaScript for every word on every page. This can hardly be a quick test.

3. Your code DO SMTH HERE

if (numWords < 42 && FilterWord !== ("Part " + [0-9]))

  {

  DO SMTH HERE

  }

will be executed perhaps thousands of times, and will be executed on every page with less than 42 words, for every word EXCEPT the target word. I thought you wanted to detect this case.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Feb 28, 2016 Feb 28, 2016

Hi guys,

Thank you for your response!

What I am trying to do is filtering out those pages that contain the word "Part". The script is not supposed to run the action on these pages.

As you said, testing for the presence of the word "Part" is not a good solution as the action is applied when other words are detected. I guess, the solution is creating an array of all the words that are present on the page and checking if it contains the word "Part". Am i right?

Thanks!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Feb 28, 2016 Feb 28, 2016

You can do that, but doing an intermediate step like copying to an array is just more overhead in an already slow task.

In pseudocode,

set a flag variable to 0

if there are less than 42 words,

  step through each word

    if the word starts Part (or whatever) set the flag to 1

Now, when the loop is finished, if flag is 1, do your action.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Feb 28, 2016 Feb 28, 2016

Hi Test Screen Name,

Thank you very much for your input! The script seems to be working properly now.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Feb 28, 2016 Feb 28, 2016

BTW, what is counted as a word by JS? Is a punctuation mark or any other char counted?

Thank you!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Feb 28, 2016 Feb 28, 2016
LATEST

You could read the documentation for the getPageNthWord or add a statement to display the return value form the action to the JavaScript console.

It returns a word and no white space characters or punctuation characters between words unless asked to.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines