Indexing certain PDFs fails

New Here ,
Jul 10, 2015 Jul 10, 2015

Copy link to clipboard

Copied

Hey Group.

I have 36000 files or so on a windows 2008 server.  CF11 Enterprise Update 5

Love SOLR indexing for it's speed but having issues with some of the docs.  PDFs especially.  These are documents of a legal nature so i cannot share them but the problem is pretty straight forward.

I get the: "Could not index the file [path here] .pdf in SOLR. Check the exception for more details: An error occurred during the extracttext operation of the cfpdf tag.

When i run cfpdf extract on the file I get invalid document [path] specified for source or directory.

cfpdf action="extracttext" source="http://localhost/[path]" name="mypdf"

When I run the same with useStructure="false"

cfpdf action="extracttext" useStructure="false" source="http://localhost/[path]" name="mypdf"


and dump the variable I get all of the text along with what looks like poorly formatted xml (xml closing tags missing)

I dont really care if that is how I get the data as it is only used to let the uner know what document contains the subject of their search.


Thngs I know:

it opens in Acrobat

was created with Acrobat PDF maker 10.1 for Word

all dates are present

claims to be PDF Version 1.4 (acrobat 4.x)

it is 3 mb


Is there a way to tell CF11 to retry on failure of that document, ignoring the structure?


Thanks

Views

228

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
no replies

Have something to add?

Join the conversation