Indexing certain PDFs fails

Report · Jul 10, 2015

Hey Group.

I have 36000 files or so on a windows 2008 server. CF11 Enterprise Update 5

Love SOLR indexing for it's speed but having issues with some of the docs. PDFs especially. These are documents of a legal nature so i cannot share them but the problem is pretty straight forward.

I get the: "Could not index the file [path here] .pdf in SOLR. Check the exception for more details: An error occurred during the extracttext operation of the cfpdf tag.

When i run cfpdf extract on the file I get invalid document [path] specified for source or directory.

cfpdf action="extracttext" source="http://localhost/[path]" name="mypdf"

When I run the same with useStructure="false"

cfpdf action="extracttext" useStructure="false" source="http://localhost/[path]" name="mypdf"

and dump the variable I get all of the text along with what looks like poorly formatted xml (xml closing tags missing)

I dont really care if that is how I get the data as it is only used to let the uner know what document contains the subject of their search.

Thngs I know:

it opens in Acrobat

was created with Acrobat PDF maker 10.1 for Word

all dates are present

claims to be PDF Version 1.4 (acrobat 4.x)

it is 3 mb

Is there a way to tell CF11 to retry on failure of that document, ignoring the structure?

Thanks

Adobe Community

Indexing certain PDFs fails