Skip to main content
Inspiring
August 28, 2009
Question

CF 7 Verity and pdf files - problem

  • August 28, 2009
  • 1 reply
  • 5007 views

I have a directory of .pdf files that I have used verity to index for several years. I'm using verity so that full text searches can be placed against all of the pdf documents. This has worked pretty good for some time but recently we have noticed that newer pdf documents are no longer working correctly in regards to the text search. They are indexed in the collection and will show up when I search for a blank text string which returns all documents in the collection but any attempt to search by any string of text in the document returns an empty search.

To test this I created a new collection that contained several pdf files that I know work and several that I know do not work. I used a simple form to supply key words to my cfsearch tag and then dumped the results out to see what happens. Searching by strings of text in the pdf work for some files and not others. I examined the files in question and the only difference I can see is the version. The working files list the creator as Adobe InDesign CS3 (5.0.2) and the now working files show the creator as Adobe InDesign CS3 (5.0.4). Has anyone else noticed this issue and found a solution or work around? I really don't want to migrate to some other search function at this time as this was an unexpected problem.

I found a few threads on the internet suggesting problems with Acrobat 9 files but list those as version 5 / version 7 issue. I used cffile read to look at the begining of both a working and non working file and they both list PDF-1.4 at the beginning of the file. I hope someone else has found a way around this issue.

    This topic has been closed for replies.

    1 reply

    Inspiring
    September 1, 2009

    I'm wondering if there is any other items I should look at with these pdf files or log enteries. I do not see errors in the verity log when index or optimizing for these files and they do show up in the results if a blank search string is passed with the cfsearch tag so they are in the index. No combination of words from the text will be found though. It seems that all of the files created in the last year fall into the non working category. I don't mind recreating the pdf files If I understood what needed to be done differently. Are there updates to Verity for Cold Fusion 7 that I'm not aware about? Would attaching a working and non working pdf file to this post be helpful?

    Inspiring
    September 9, 2009

    I've been able to isolate this problem to the fonts used and embedded in the pdf files. Previous pdf files that we have in the index use Interstate
    and Truesdell as fonts. At some point they switched toFonts (Suck!) Berthold Akzidenz Grotesk and Apollo MT the content is no longer searchable with verity. I tested this by building a small collection with versions of the pdf files in both the old and new fonts. If I send cfsearch a blank search criteria it finds all of the documents but if I send any text string from the documents to cfsearch as the criteria it only finds the pdf files with the Interstate and Truesdell fonts. Any ideas as to why and a way around this issue?

    Inspiring
    September 23, 2009

    Hi Adam,

    I finally got back to work after a long vacation and got a chance to test this all out on my production server. I'm still not able to do any text searches with the 200901_TenTopTours_Teasdale.pdf file. If I run the searchcollection.cfm template it does return both files correctly but if I specify a value in the url criteria variable the verity collection only returns the TENTOP_TRUES.pdf file. This is the exact same results that I'm seeing when using the cf admin to create the collections.

    Here is the steps I'm taking when trying this test. First I run searchcollection.cfm that you provided without any url variable and it returns both pdf files as results. I then add the criteria url variable and used the word teasdale as my criteria and only the the TENTOP_TRUES.pdf file is returned in the results. Is this behaving differently on your test server? The url to these files on my server is http://www.adventurecycling.org/testing/veritytests/ in case you would like to try and run it from my server.

    I understand that the fonts are embedded in these pdf files but I'm really wondering if I need to have these fonts installed on my server for verity to correctly index these files. Any other thoughts? I'm curious if the same test works fine in your environment.


    Groan.

    I had typed a reasonable-sized response to you, but whilst trying to select some text to remove (clicking within this area, and dragging the mouse pointer), the thing decided to submit an entirely blank response.  The text editor for these forums is so variable in reliability this sort of thing doesn't even really surprise me any more.  This forum software is unbelievably and unreservedly shite.

    But anyway.

    Um, using your system, I see that you are getting quite different results for the same searches as I do.  For example if I search for "grizzly", which only appears in the 200901_TenTopTours_Teasdale.pdf file, I get a match.  You do not.  If I search for "flathead", I get matches in both; you only get the match in the other file.  No search that I can do on my system yields unpredictable results.

    So... um... I don't know what to suggest.

    --

    Adam

    Message was edited by: A Cameron