Skip to main content
Inspiring
May 10, 2021
Question

It takes about 15 minutes to execute jso.getPageNumWords

  • May 10, 2021
  • 1 reply
  • 1382 views

Hi.

I have written an macro as the following code to get the list of word and the quads of the word in PDF file.
But the code doesn't work well.

Sub GetPDFWdList()
  Dim acroApp As Object
  Dim acroPDDoc As Object
  
  Set acroApp = CreateObject("AcroExch.App")
  Set acroPDDoc = CreateObject("AcroExch.PDDoc")
  Call Prc_1(acroPDDoc)
  Call Prc_2(acroPDDoc)
  aroApp.Hide
  acroApp.Exit
  Set acroPDDoc = Nothing
  Set acroApp = Nothing
  MsgBox "Done"
End Sub

Private Sub Prc_1(acroPDDoc As Object)
  Call GetWdList_EachPDF(acroPDDoc, path, 1)
End Sub

Private Sub Prc_2(acroPDDoc As Object)
  Dim fileArr As Variant
  Dim i As Long
  
  fileArr = Array()
  Call GetFileList(folderPath, fileArr, "pdf", False)
  For i = LBound(fileArr) To UBound(fileArr)
    Call GetWdList_EachPDF(acroPDDoc, CStr(fileArr(i)), i + 1)
  Next
  
End Sub

Private Sub GetWdList_EachPDF(acroPDDoc As Object, PDFPath As String, docNum As Integer)

  Dim jso As Object
  Dim TotalPage As Long
  Dim TotalWds As Long
  Dim wdList As Variant
  Dim wdCnt As Long
  Dim quads As Variant
  Dim lRet As Long
  Dim i As Long, j As Long
  
  lRet = acroPDDoc.Open(PDFPath)
  Set jso = acroPDDoc.GetJSObject
  TotalPage = jso.numpages
  wdList = Array()
  quads = Array()
  wdCnt = 0
  For i = 0 To TotalPage - 1
    Application.StatusBar = "Getting PDF word list at page " & i + 1 & "/" & TotalPage & " on PDF file " & docNum
    DoEvents
    TotalWds = jso.getPageNumWords(i)
    For j = 0 To TotalWds - 1
      ReDim Preserve wdList(wdCnt)
      wdList(wdCnt) = jso.getPageNthWord(i, j, False)
      ReDim Preserve quads(wdCnt)
      quads(wdCnt) = jso.getPageNthWordQuads(i, j)
      wdCnt = wdCnt + 1
    Next
  Next
  acroPDDoc.Close
  Set jso = Nothing
End Sub


My PC is windows 10.
In Prc_1, the PDF file used is a PDF file of about 300 pages, and each page has 100-200 words .
In Prc_2, there are about 300 PDF files in the folder, each PDF file has 1 or 2 pages, and each page has 100-200 words.

 

(1)In Acrobat DC, no error occurs in Prc_1 and Prc_2.
But in the Prc_2, it always takes about 15 minutes to execute jso.getPageNumWords(i) when i=0,
and there is no problem after i=1.
This phenomenon doesn't occur at version 2021.001.20149, but after updating to 2021.001.20150 it occcurs

 

(2)In Acrobat XI(version 11.0.23), Prc_1 works OK, but in the middle of processing of Prc_2 is NG.
The "Automation error the remote procedure call failed" occurs.
It always occurs at the process when using the jso object to get the "PageNumWords" or "PageNthWordQuads".
I found it seems that Acrobat XI is closed for some reason during execution, and I get this error.

 

Is there something wrong with my code?
How can I solve these problem?
Please give me some advices.

This topic has been closed for replies.

1 reply

BarlaeDC
Community Expert
Community Expert
May 10, 2021

Hi,

 

I was just looking through the docs and found this :

"Open

Opens a file. A new instance of AcroExch.PDDoc must be created for each open PDF file."

 

From the code above it doesn't look like you are creating a new AcroExch.PDDoc for each and every document you are opening.

erieru103Author
Inspiring
May 11, 2021

Hi.
I have rewrote my code as the following.

But the problem in both AcrobatDC and AcrobatXI did not improve.

If only Prc_1 or Prc_2 is executed, the problem will disappear.

BarlaeDC
Community Expert
Community Expert
May 11, 2021

Hi,

 

based on the "If only Prc_1 or Prc_2 is executed, the problem will disappear."

 

would it be prudent to move the  AcroExch.App to inside each Prc, so that you have a new object for each run, that way it can be cleaned up and such before the next run.

 

I believe this might be an actual issue with Acrobat, I am just suggesting workarounds.