Copy link to clipboard
Copied
I have a directory of .pdf files that I have used verity to index for several years. I'm using verity so that full text searches can be placed against all of the pdf documents. This has worked pretty good for some time but recently we have noticed that newer pdf documents are no longer working correctly in regards to the text search. They are indexed in the collection and will show up when I search for a blank text string which returns all documents in the collection but any attempt to search by any string of text in the document returns an empty search.
To test this I created a new collection that contained several pdf files that I know work and several that I know do not work. I used a simple form to supply key words to my cfsearch tag and then dumped the results out to see what happens. Searching by strings of text in the pdf work for some files and not others. I examined the files in question and the only difference I can see is the version. The working files list the creator as Adobe InDesign CS3 (5.0.2) and the now working files show the creator as Adobe InDesign CS3 (5.0.4). Has anyone else noticed this issue and found a solution or work around? I really don't want to migrate to some other search function at this time as this was an unexpected problem.
I found a few threads on the internet suggesting problems with Acrobat 9 files but list those as version 5 / version 7 issue. I used cffile read to look at the begining of both a working and non working file and they both list PDF-1.4 at the beginning of the file. I hope someone else has found a way around this issue.
Copy link to clipboard
Copied
I'm wondering if there is any other items I should look at with these pdf files or log enteries. I do not see errors in the verity log when index or optimizing for these files and they do show up in the results if a blank search string is passed with the cfsearch tag so they are in the index. No combination of words from the text will be found though. It seems that all of the files created in the last year fall into the non working category. I don't mind recreating the pdf files If I understood what needed to be done differently. Are there updates to Verity for Cold Fusion 7 that I'm not aware about? Would attaching a working and non working pdf file to this post be helpful?
Copy link to clipboard
Copied
I've been able to isolate this problem to the fonts used and embedded in the pdf files. Previous pdf files that we have in the index use Interstate
and Truesdell as fonts. At some point they switched to
Copy link to clipboard
Copied
Can you create an example PDF using each font scheme, and attach them here?
--
Adam
Copy link to clipboard
Copied
Hi Adam,
Here is an example of a working and non working pdf file in my verity collection. I notice the size is a good bit different but from what I was told the difference is in the fonts embedded in the pdf. The TENTOP_Trues.pdf file uses Interstate and Truesdell for fonts and works succesfully when searching for text. The 200901_TenTopTours_Teasdale.pdf uses
-John
Copy link to clipboard
Copied
Hi John
Just some preliminary findings here (none useful...). I've only got CF8+9 installed on my PC right now, so cannot test on CFMX7, but those files index & can be searched fine with both CF8 & CF9. This is good because I didn't think Verity had seen any new development since CFMX7 (Verity is a defunct product now, so no new bugs will get fixed), but it seems it did see some enhancement @ some point between CFMX7 and CF8.
I'll install CFMX7 later today and check on that too.
Even if you cannot get it working on CFMX7, you should be able to install the version of Verity which ships with CF8 or CF9 and direct CFMX7 to use that.
Or... well... let me test on CFMX7 first, and I'll scratch my head once I get some results in front of me.
--
Adam
Copy link to clipboard
Copied
Hi Adam,
Thanks so much for looking into this problem. I'm excited to hear that both files are working on with Verity on CF 8 and 9 as well of the prospect of upgrading Verity on my CF 7 server. That would be much better than trying to quickly move to some other solution for this problem. I'm headed out for vactaion for the next week and a helf but I will check back here when I can to see if there is any new developments in this thread. Thanks again.
-John
Copy link to clipboard
Copied
G'day John
I installed CFMX7.0.2 and ran my test rig on that. And it worked fine. Both PDF files indexed fine and are searcable (and expected results are returned).
So whatever your problem is... it's not an innate problem with CFMX7.0.2 / Verity / those PDFs.
😞
I am running this code:
<!--- createCollection.cfm --->
<cfflush interval="64">
<cfset sCollection = "testUnreadablePdfs">
<cftry>
Deleting collection…
<cfcollection action="delete" path="#expandPath('./collection')#" collection="#sCollection#">
Deleted<br />
<cfcatch>
Error deleting collection: <cfoutput>[#cfcatch.message#][#cfcatch.detail#]</cfoutput><br />
</cfcatch>
</cftry>
<cftry>
Creating collection…
<cfcollection action="create" path="#expandPath('./collection')#" collection="#sCollection#">
Created<br />
<cfcatch>
Error creating collection: <cfoutput>[#cfcatch.message#][#cfcatch.detail#]</cfoutput><br />
</cfcatch>
</cftry>
<cftry>
Indexing collection…
<cfindex key="#expandPath('./docs/')#" urlpath="/shared/cf/cfml/tags/search/index/unreadablePdfs/docs/" extensions=".pdf" type="path" action="refresh" collection="#sCollection#" status="stIndex">
Indexed<br />
<cfdump var="#stIndex#"><br />
<cfcatch>
Error indexing collection: <cfoutput>[#cfcatch.message#][#cfcatch.detail#]</cfoutput><br />
</cfcatch>
</cftry>
<cftry>
Searching collection…
<cfsearch name="qResults" collection="#sCollection#" criteria="">
Done<br />
<cfdump var="#qResults#"><br />
<cfcatch>
Error searching collection: <cfoutput>[#cfcatch.message#][#cfcatch.detail#]</cfoutput><br />
</cfcatch>
</cftry>
<!--- searchCollection.cfm --->
<cfparam name="URL.criteria" default="*">
<cfset sCollection = "testUnreadablePdfs">
<cftry>
Searching collection…
<cfsearch name="qResults" collection="#sCollection#" criteria="#URL.criteria#">
<cfdump var="#qResults#"><br />
<cfcatch>
Error searching collection: <cfoutput>[#cfcatch.message#][#cfcatch.detail#]</cfoutput><br />
</cfcatch>
</cftry>
I can search for anything I like in either PDF you gave me, and Verity finds correct matches just fine.
If you run this same code, what do you get?
Note: you'll have to set up a "collection" and "docs" subdir in the dir that these two code files go in, plus you'll need to change the URLPATH attribute to reflect your environment. Obviously the PDFs should go in the docs subdir.
Enjoy your vacation.
--
Adam
Copy link to clipboard
Copied
Hi Adam,
I'm going to give this a try as soon as I get back. I have been building my collection directly from the CF administrator and when searching with a blank string it does return both pdf files but searching by any text in the pdf only returns the one file. I'm anxious to see the results that I get when building the collection with the code you used. I'll be sure to let you know what I find. One question, what OS are you using on your test bed? Thanks again for testing this out for me.
-John
Copy link to clipboard
Copied
On my test machine I'm running Windows Vista Home, 64-bit. My CF install is a Multiserver one for CF8 and CF9, but could not get CFMX7 to start on my multiserver rig, so installed that one as a standard install. Latest patches for CF8, latest release of CF9, vanilla install of CFMX7.0.2 (not patched). I'm pretty sure CF8+9 are both running the sun 1.6.0_04 (64-bit), and the CFMX7 one is running 1.4.2_19 (32-bit).
--
Adam
Copy link to clipboard
Copied
Hi Adam,
I finally got back to work after a long vacation and got a chance to test this all out on my production server. I'm still not able to do any text searches with the 200901_TenTopTours_Teasdale.pdf file. If I run the searchcollection.cfm template it does return both files correctly but if I specify a value in the url criteria variable the verity collection only returns the TENTOP_TRUES.pdf file. This is the exact same results that I'm seeing when using the cf admin to create the collections.
Here is the steps I'm taking when trying this test. First I run searchcollection.cfm that you provided without any url variable and it returns both pdf files as results. I then add the criteria url variable and used the word teasdale as my criteria and only the the TENTOP_TRUES.pdf file is returned in the results. Is this behaving differently on your test server? The url to these files on my server is http://www.adventurecycling.org/testing/veritytests/ in case you would like to try and run it from my server.
I understand that the fonts are embedded in these pdf files but I'm really wondering if I need to have these fonts installed on my server for verity to correctly index these files. Any other thoughts? I'm curious if the same test works fine in your environment.
Copy link to clipboard
Copied
Groan.
I had typed a reasonable-sized response to you, but whilst trying to select some text to remove (clicking within this area, and dragging the mouse pointer), the thing decided to submit an entirely blank response. The text editor for these forums is so variable in reliability this sort of thing doesn't even really surprise me any more. This forum software is unbelievably and unreservedly shite.
But anyway.
Um, using your system, I see that you are getting quite different results for the same searches as I do. For example if I search for "grizzly", which only appears in the 200901_TenTopTours_Teasdale.pdf file, I get a match. You do not. If I search for "flathead", I get matches in both; you only get the match in the other file. No search that I can do on my system yields unpredictable results.
So... um... I don't know what to suggest.
--
Adam
Message was edited by: A Cameron
Copy link to clipboard
Copied
Well, I finally appear to have this issue resolved. I started to compare the Verity files on my production CF7 server and my standalone developer version of CF9. I noticed that the following folder was much larger on CF9, verity\k2\_nti40, and after swapping out this folder, restarting the Cold Fusion Search service and rebuilding my collection it is now working properly. I also tried substituting the verity\k2\common folder but I was unable to get the Search Service to run after that change. So at some point Verity was updated and that update allowed it to handle more fonts but I'm unsure as to why CF7 instance did not get this update. As long as it is working I'm now very happy. Thanks for the help in working through this issue Adam.