Windows Server 2008 R2 SP1 hosted remotely. Virtual Server with 100 users. Normal only 5 or 6 on at a time
Coldfusion serving from c:\inetpub\wwwroot\application_name
Documents stored S:\docs (same server virtual drive)
34000 docs in 3300 folders total size including not indexed docs is about 45 gigs (PDF, HTM, Txt, all variances of MS Office Docs, RTF)
Collection indexing taking days instead of hours and it does not seem to matter if it is verity or solr. Resourse monitor shows solr create the cache and it flat out blazes through doing that, but the only indication I have of it ACTUALLY doing anything after is 50 to 70% cpu usage.
I increased the buffer to 80 but I am at a loss on speeding this process up.
Any help will be greatly appreciated
It's got to be the sheer volume of files that you're trying to index. Solr is (normally) much faster than Verity.
Are you indexing via the CFAdmin panel, or by CFINDEX tag?
using Scheduler to fire off a CFM Page.
i am at 56 hours and my wits end. The verity collection only seems to take about 30 hours tops. Is there any way to speed this process up?
On this latest run I upped the Min and Max Memory to 4 gigs (from 256).
It is just an index > Refresh of one set of docs then an update from another folder. Heck, I cant even tell where in the process it is and the solr console is about useless.
solr is hanging on certain MS Excel docs. Not all. One of the docs is 14 mb. Another is 126 mb. Smaller ones seem to make it. Nothing unusual about the xcel files. some do have drop down sorting elements but that is not all of them.
Solr blazes if i remove xls and xlsx from the file types.
So now if I am doing an index on a folder, is there a way to to tell it to move on if it runs "too" long?
Double check those xml/xmls files. Depending upon how they were created, there might be extraneous data that is causing the collection to choke when trying to index them.
I know (for a fact) that if the Excel files were created by a ColdFusion template AND if debugging is turned on (and the IP address of the client system is within the authorized list of addresses allowed to see debugging information), then the debugging information is appended in a very loose way to the data for the Excel sheet, and can cause a lot of problems.
This happened to me on another project, and it took me almost four days to troubleshoot the issue. Excel sheets were being created by the "SpreadsheetNew()" function via a .cfm file that (in the development environment only) had debugging information appended to every page. I had to finally view the source of the Excel sheet, saw the CF debugging information at the bottom of the source, and turned off debugging for that page. Once I did that, there were no more issues with the Excel sheets created by that .cfm page.
So, check the source of the Excel file (I forget how, but there IS a way) to make sure that there isn't a lot of "corrupted" data causing the collection to choke when indexing those files.
Good point. i know that these Excel files are all office 97 and above created. I do have some corrupt files (mainly PDF). I can get them to index on a short haul and just return a blank PDF shen the link is clicked.
in testing I can manage to get it to index about 4000 files in 80 directories before the latest error. "Error_opening_new_searcher_exceeded_limit_of_maxWarmingSearchers4_try_again_later"
I have made adjustments to the solr config to hold off on a commit until the end, but I do not think that is working
Have you tried recreating the actual collection? Could be a corrupt collection doing this.
thanks for the reply, Yes I have. Even went as far as to remove it from the XML and remove the directories.34000 physical documents. I let it run for 6 days. it finally returned 8000 docs Nothing in the logs as to why it did not index so many
Took a different approach. Now I am indexing one folder at a time and that is working for a while but I am running into the error "Error_opening_new_searcher_exceeded_limit_of_maxWarmingSearchers4_try_again_later" I am attempting to tell it not to autocommit by remming that out and changed all autowarmings to 0.
Not sure what else I can do
Only other thing I could think of is adding a sleep after each update to slow down the searchers.
There are a few tips here you can try if you haven't already - Tips for software engineer: Solr in Coldfusion 9