Extract complete hyphenated word from .pdf using acrobat.tlb in .NET

Question

I posted this question on Stackoverflow but did not receive any usable answers.

I have just joined the Adobe forums in the hope someone knowledgeable here will be able to answer this specific Adobe Acrobat SDK automation question.

*******************************************************************

I am parsing a .pdf using the acrobat.tlb library

Hyphenated words are being split across new lines with the hyphens removed.

e.g. ABC-123-XXX-987

Parses as:
ABC
123
XXX
987

If I parse the text using iTextSharp it parses the whole string as displayed in the file which is the behaviour I want.

However, I need to highlight these strings (serial numbers) in the .pdf and iTextSharp is not placing the highlight in the correct location... hence acrobat.tlb

I am using this code, from here: http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf

' filey = "*your full file name including directory here*"

AcroExchApp = CreateObject("AcroExch.App")

AcroExchAVDoc = CreateObject("AcroExch.AVDoc")

' Open the [strfiley] pdf file

AcroExchAVDoc.Open(filey, "")

' Get the PDDoc associated with the open AVDoc

AcroExchPDDoc = AcroExchAVDoc.GetPDDoc

sustext = "accessorizes"

suktext = "accessorises"

' get JavaScript Object

' note jso is related to PDDoc of a PDF,

jso = AcroExchPDDoc.GetJSObject

' count

nCount = 0

nCount1 = 0

gbStop = False

bUSCnt = False

bUKCnt = False

' search for the text

If Not jso Is Nothing Then

' total number of pages

nPages = jso.numpages

' Go through pages

For i = 0 To nPages - 1

' check each word in a page

nWords = jso.getPageNumWords(i)

For j = 0 To nWords - 1

' get a word

word = Trim(CStr(jso.getPageNthWord(i, j)))

'If VarType(word) = VariantType.String Then

If word <> "" Then

' compare the word with what the user wants

If Trim(sustext) <> "" Then

result = StrComp(word, sustext, vbTextCompare)

' if same

If result = 0 Then

nCount = nCount + 1

If bUSCnt = False Then

iUSCnt = iUSCnt + 1

bUSCnt = True

End If

If suktext<> "" Then

result1 = StrComp(word, suktext, vbTextCompare)

' if same

If result1 = 0 Then

nCount1 = nCount1 + 1

If bUKCnt = False Then

iUKCnt = iUKCnt + 1

bUKCnt = True

End If

Next j

Next i

jso = Nothing

End If

The code does the job of highlighting the text, but the FOR loop with the 'word' variable is splitting the hyphenated string into component parts prohibiting me from highlighting the complete string.

For i = 0 To nPages - 1 ' check each word in a page nWords = jso.getPageNumWords(i) For j = 0 To nWords - 1 ' get a word word = Trim(CStr(jso.getPageNthWord(i, j)))

Does anyone know how to maintain the whole string using acrobat.tlb? My quite extensive searches have drawn a blank.

Many thanks...

Test Screen Name · Answer

First but important question: is this a server app? A background app? An app for a single user who has Acrobat and will run your app manually? Something else?

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded