Skip to main content
Known Participant
July 11, 2017
Question

Retrieve words that are space separated as one word (pdwordfinderrec parameters)

  • July 11, 2017
  • 1 reply
  • 408 views

Hi there

I have a strange behaviour with word finding.

I want the word finder to ignore spaces betwween letters, along with punctuation marks.

On some texts it does, and I get a whole 'sentence' like :"list of authorized codes"

On some constructs it doesn't work and returns indiidual words.

Example: "A/CODE 134 FAILS" returns as separate words. I can try to reconstruct, but it's a lot of work.

I have played with the character types table, with some sucess, but incomplete.

Any clue? If I use PDFEdit (and how),,will II have a better result?

Thanks

Christian

This topic has been closed for replies.

1 reply

Legend
July 11, 2017

PDFEdit will take you closer to the internal representation of the text. It probably won't make your task easier. I'm surprised you can persuade Acrobat to ignore spaces; I doubt it was expected. But the thing to understand is that text in a PDF is not a simple flow, from which Acrobat is pulling words in an inconvenient way. Rather, text is a collection of distinct graphical objects. Some will contain text with spaces. Some will be parts of words. Sometimes they are out of order. The miracle is that Acrobat can make it seem there is a simple flow of words from the disparate stuff it finds...