Skip to main content
Inspiring
October 2, 2018
Question

Extract complete hyphenated word from .pdf using acrobat.tlb in .NET

  • October 2, 2018
  • 1 reply
  • 923 views

I posted this question on Stackoverflow but did not receive any usable answers.

I have just joined the Adobe forums in the hope someone knowledgeable here will be able to answer this specific Adobe Acrobat SDK automation question.

*******************************************************************

I am parsing a .pdf using the acrobat.tlb library

Hyphenated words are being split across new lines with the hyphens removed.

e.g. ABC-123-XXX-987

Parses as:
ABC
123
XXX
987

If I parse the text using iTextSharp it parses the whole string as displayed in the file which is the behaviour I want.

However, I need to highlight these strings (serial numbers) in the .pdf and iTextSharp is not placing the highlight in the correct location... hence acrobat.tlb

I am using this code, from here: http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf

' filey = "*your full file name including directory here*"

        AcroExchApp = CreateObject("AcroExch.App")

        AcroExchAVDoc = CreateObject("AcroExch.AVDoc")

        ' Open the [strfiley] pdf file

        AcroExchAVDoc.Open(filey, "")      

        ' Get the PDDoc associated with the open AVDoc

        AcroExchPDDoc = AcroExchAVDoc.GetPDDoc

        sustext = "accessorizes"

        suktext = "accessorises"

        ' get JavaScript Object

        ' note jso is related to PDDoc of a PDF,

        jso = AcroExchPDDoc.GetJSObject

        ' count

        nCount = 0

        nCount1 = 0

        gbStop = False

        bUSCnt = False

        bUKCnt = False

        ' search for the text

        If Not jso Is Nothing Then

            ' total number of pages

            nPages = jso.numpages          

                ' Go through pages

                For i = 0 To nPages - 1

                    ' check each word in a page

                    nWords = jso.getPageNumWords(i)

                    For j = 0 To nWords - 1

                        ' get a word

                        word = Trim(CStr(jso.getPageNthWord(i, j)))

                        'If VarType(word) = VariantType.String Then

                        If word <> "" Then

                            ' compare the word with what the user wants

                            If Trim(sustext) <> "" Then

                                result = StrComp(word, sustext, vbTextCompare)

                                ' if same

                                If result = 0 Then

                                    nCount = nCount + 1

                                    If bUSCnt = False Then

                                        iUSCnt = iUSCnt + 1

                                        bUSCnt = True

                                    End If

                                End If

                            End If

                            If suktext<> "" Then

                                result1 = StrComp(word, suktext, vbTextCompare)

                                ' if same

                                If result1 = 0 Then

                                    nCount1 = nCount1 + 1

                                    If bUKCnt = False Then

                                        iUKCnt = iUKCnt + 1

                                        bUKCnt = True

                                    End If

                                End If

                            End If

                        End If

                    Next j

                Next i

jso = Nothing

        End If

The code does the job of highlighting the text, but the FOR loop with the 'word' variable is splitting the hyphenated string into component parts prohibiting me from highlighting the complete string.

For i = 0 To nPages - 1
  
' check each word in a page
  nWords
= jso.getPageNumWords(i)
  
For j = 0 To nWords - 1
  
' get a word

  word
= Trim(CStr(jso.getPageNthWord(i, j)))

Does anyone know how to maintain the whole string using acrobat.tlb? My quite extensive searches have drawn a blank.

Many thanks...

This topic has been closed for replies.

1 reply

Legend
October 2, 2018

First but important question: is this a server app? A background app? An app for a single user who has Acrobat and will run your app manually? Something else?

Inspiring
October 2, 2018

Hi, it's just on my local machine.  I have Adobe Acrobat X Pro installed.