CF2023 Collection Issues

Report · Feb 02, 2024

Well folks while I'm waiting to resolve the html data field validation bug, I started working on collections.

In CF 4.51 I have 6 collections. 3 of them are:

Basically, these are populated with .txt files while were extracted from resumes with the following formats: .doc, .docx, .html, .txt resumes. There are 20,000 + or - files in each of them.

I am only able to index the folder ending in ho. And it only contains 8,396 documents. Whereas the folder contains 19,385 .txt files. I presume CF2023 collection should contain 19,385 documents?

The other two folders bomb out when I try to index them resulting in zero documents being populated in the corresponding collections. I have checked the source folders and they only contain .txt files. Thus, I have no clue as to why the processing fails.

It would require far less code changes, were we able to use our current collection scheme and simply add some additional code to accomodate .pdf resumes.

As an alternative approcach , I tried creating a collection from the folder that contained 180 of the 60,040 raw resumes with all four of the aforementioned formats. And it did create 180 documents in the collection.

I am concerned that only 1 collection containing 60,040 documents would process too slowly. I would appreciate any opinions on this concern.

Thanks in advance for any help!

Alex Craig, General Manager
"Avid Saltwater Fly Fisherman"

Report · Feb 02, 2024

Alex, I think I have good news for you.

1) First, some background: if such cf "collection" processing happens to "bomb out", "fail", or "process too slowly", the suspect to pursue is generally not cf itself--though I realize it can seem so.

Instead, look to the underlying implementation of the open source Solr engine that CF is relying on, since cf9, to do such index importing and searching. This is similar to how your CF 4 (or 5-8) relied upon the commercial Verity engine to do that collection processing then.

And though I appreciate how that may have been "rock solid" for decades and "never needed any tweaking", don't hear me suggesting you need to become an expert in Solr or Solr "tuning". No, it's likely something very simple to solve. The default implementation may simply not be suited to the volume you're pushing (which can be about more than simply the "number of documents" you're trying to index, and if all at once).

2) So as with any diagnosis of problems, a key can be to find any available logs (or enable them, or add other diagnostics). And the Solr that CF enables has such logs--and they can be made to log still more when needed. Also, as the Solr engine is based on Java, one can also add Java monitoring/profiling tools to also better understand it. (If one uses FusionReactor to monitor cf, the same license can be used to monitor Solr running on the same machine.)

2a) And that leads to another point (which may not apply to you, but I'm writing as much for others who may find this thread): if the person installing CF chose the option to enable Solr, you'll find the logs in the the cfusion/jetty/logs folder. This is because CF implements solr under a jetty app server that's SEPARATE from CF and the CF process and jvm it runs on.

2b) If instead they downloaded and implemented it with the separate CF addon installer, then it will be in a folder they named on installing that (which might be a sibling to the coldfusion2023 folder as ColdFusionAdd-onServices--and same for prior versions of cf), which will again implement jetty as its own process (and service) and which will run in jvm separate from CF.

All this is stuff above does need to be understood even to implement the possibly simple solution, both to know where to find logs, and where to possibly make changes based on the diagnostics (and also where to enable such additional diagnostics, such as FR or jvm monitoring features).

3) Now to what I hope may be the simple solution: you may may find indications in those Solr (jetty) logs that it has had an outofmemory error. This would indicate (usually) a need to simply increase the max heap size allocated to solr. You may find reference to that error in the file (for a given date) ending in stderrout.log in that jetty/logs folder. [I've added that last sentence newly since my first reply, to help folks reading from the top. I wasn't on a computer when I first offered this extended reply for Alex.)

Here again, where to do that depends on which of those two options above was used to install it. For the "normal" installation of it with CF, look in that cfusion/jetty folder for a file called jetty.lax. (In the "addons" install approach, ~~I have to check later where to change that, as I'm writing on my phone.~~, I've confirmed the file name is the same.) And note that despite that odd file extension, the jetty.lax is just a plain text file. (This is another update since my first reply, as you'll see below that Alex wondered in reply about that. I'm clarifying it now for the sake of other readers.)

And in that file there's a line that has jvm args (and again ~~I'll update this with the exact name of the line~~ I'm now updating this reply with that name) which starts with lax.nl.java.option.additional, and which has an xmx value. You'd want to consider increasing that, if indeed the logs indicate it's running out of "heap". You may find that simply doubling it is enough.

Make a copy of that file before editing it, then change it, then restart the solr service (called the "coldfusion addon service", regardless of how it was installed above.) And first, make sure it restarts: if you make a mistake in editing that file, it may not start. Then test your operation.

If you still get an outofmemory error, try doubling it again (keeping an eye on available memory on your machine, of course). Sometimes a problem like yours needs less additional memory than you have on your phone. 🙂

4) Finally, the error in the logs may be something else. And perhaps other readers here will chime in with tweaks they've made to other aspects of how solr is configured.

For now, please consider what I've offered, and let us know if it might get you going, or what other info you may find or want to share.

/Charlie (troubleshooter, carehart.org)

Report · Feb 02, 2024

Hey Charlie,

As always, much appreciate your time/effort/advice.

First, when I installed CF2023, since I had no knowledge of solr, I took the default install.

Found the log you suggested & don't see anything in it that indicated a problem to my uneducated mind. I have attached a file with the log entries for my last attempt to create an "ag" collection as ag_solr_log.txt.

I was also able to find jetty.lax and attached a copy of it to this reply. Not sure what program I should use to open it. So I made a copy and opened it with notepad just to see if I could find the "args" value. Only see two references to "args". Have copied/pasted the section containing the two references below.

*****************************

lax.jar

# LAX.COMMAND.LINE.ARGS
# ---------------------
# what will be passed to the main method -- be sure to quote arguments with spaces in them

lax.command.line.args=$CMD_LINE_ARGUMENTS$

#

*********************************************

Please advise. Thanks.

Alex Craig, General Manager
"Avid Saltwater Fly Fisherman"

Report · Feb 03, 2024

@AlexCraig , just to clarify: does your Collections question pertain to CF 4.51 or to CF 2023?

Report · Feb 03, 2024

I have zero problems with CF 4.51. It has been running without issue since 1997. The Collections issue is with CF 2023 & their move to solr.

Alex Craig, General Manager
"Avid Saltwater Fly Fisherman"

Report · Feb 03, 2024

I did a bit of research and found the file which contains java.args values. It is jvm.config which on my machine is in

.... cfusion\bin. The values were "-Xms512m -Xmx1024m" which I changed to "-Xms1024m -Xmx2048m" without the quotes. I doubled both values assuming one is minimum and the other is maximum. I then restarted the CF addon service and tried to index the collections which were problematic. Still no joy.

I then repeated the above process doubling the values to "-Xms2048m -Xmx4096m". Alas, still no joy.

I'm going to try rebooting the server with those values. But I doubt it will help.

Any additional thoughts would be appreciated.

Alex Craig, General Manager
"Avid Saltwater Fly Fisherman"

Report · Feb 03, 2024

As I suspected rebooting the server did not help.

Did a bit more research and took a look at the Application log. It said there was an error at line 130 of indexcollection.cfm.

Line 130 is:

"NUL EOT ETX J SOH NUL DC3 J solr_alias_required BS ETX L SOH NUL SUB L An alias name is required. BS ETX T SOH NUL SUB"

I'm guessing this is misleading and that it choked for another reason as I was able to create a collection without an alias for a much smaller amount of data.

Beyond that, I am clueless!

Alex Craig, General Manager
"Avid Saltwater Fly Fisherman"

Report · Feb 03, 2024

Alex:

Sorry for being MIA for a couple of days in replies.
No, it's NOT the jvm.config you should have changed. That controls CF, not the addon service. Please set that back to the original values.
Instead, yes, it's the jetty.lax file, as I had said. (And I have confirmed that it's that file regardless of whether one has implemented the addon service via the CF installer or via the available addon service installer.)
Sorry to hear you were confused about editing it, not recognizing the file extension. But yes, any editor would do, as you found with np++.
No, it was not THAT line (which said the word "args"). What I had said was first, "there's a line that has jvm args" and then that "which has an xmx value." Somehow, you lost track of that between finding and then opening the file. 🙂
Instead, as I can now report, the line starts with "lax.nl.java.option.additional". THAT has the xmx arg I was referring to. And I just confirmed in both a cf2023 and 2021 version of that jetty.lax, the default is -Xmx512m. And THAT is what I was propsing you double, so to 1024 to start.
(I would recommend you make the change to the CF jvm.config file and restart that, just in case it's using a lot of memory now because you told it could grow to 4g. After restarting that, then restarting this add-on service should have no problem using 500m more.)
Now, do your test. And if you find there's any problem, then double it again (and restart the add-on service).
As for the log you shared, that's what's called the "request" log, and it can be useful--but it would not show the heap error I wondered about. Have you looked at the other log file in that folder (for the day you had the error, as it rotates each day that service is running), whose name ends in stderrout.log (like 2024_02_03.stderrout.log)? Don't ignore it because the OS reports it is a 0-byte file. Sometimes that is lying, and there IS content in the file.

Hope among all these, we get you going.

/Charlie (troubleshooter, carehart.org)

Report · Feb 03, 2024

Oh, and while I don't recognize that error you found, I'll note that since you found it in the application.log, that means it was indeed an error in CF itself--not the add-on service, or Solr, or Jetty. If you were able to move on regardless of the error, great. If it recurs, you may want to find if it happens in conjunction with any error tracked in those jetty logs as discussed elsewhere here.

And as I said in my very first reply, a tool like FusionReactor can help a lot when one wants to watch "what's going on" inside of something like this add-on service (whether for Solr or the PDFg processing). For now, let's hold off on pursuing that here, but I did want to reiterate the point--perhaps more for others reading along in the future, who may have other challenges.

/Charlie (troubleshooter, carehart.org)

Report · Feb 03, 2024

Tried with a value of 1024. Then went to 2048. Restarted the addon service each time. No joy.

While I'd have been interested in getting them populated just for the sake of accomplishment. At this point, those collections will become superfluous as all things considerd to include the large amount of code changes to implement solr and the solr's ability to deal with .pdf docs negatiing the need for a custom.dll.

MIght as well go with a single Resumes collection now that I've gotten it to populate. And worry about splitting the collection into multiple categories or collections as warranted.

BTW, I will also need to populate a TPI_Jobs collection using about a half dozen varchar datafields from a Sql/Server table.

I don't suppose you can lay your hands on a sample .html page with the sample code format I need to use to get that job done? It would save me a lot of research grunt work.

Alex Craig, General Manager
"Avid Saltwater Fly Fisherman"

Report · Feb 03, 2024

If you're asking me, I have no idea what you're referring to. An html page to do what?

Finally, as for your ongoing challenges, they can be solved. There would be some reason for the failure. You'd just need to to diagnose the problem, using any of the various techniques I've proposed (over the different replies).

But I'm sensing you're running out of steam, and patience, and may be resigned to working around the the problem. Hope it works out.

/Charlie (troubleshooter, carehart.org)

Report · Feb 04, 2024

Morning Charlie, with respect to your last question:

In CF 4.51 we have a Verity collection called TPI_Jobs for almost 7000 job requisitions.

It contains indexed data drawn from 6 different varchar fields pertaining to each job from a table in Sql/Server.

I've been trying to find a generic example of the code I would need need to populate a solr collection from multiple datafields specific to each job from a Sql/Server table.

With respect to your comment about "working around the problem". I have now realized, the scheme we used in Verity to render applicant data searchable can be improved with Solr. The Verity scheme involved populating 3 collections drawn from .txt candidate data candidate from 3 different folders.

With Solr we can simply use the data from a single folder containing the resumes themselves to include the added benefit of improving our system to also be able to deal with .pdf resumes (which are current system is not capable of handling.

I was trying to avoid changing as much of the code as possible. My main job is running a recruitingt company. But I'm just going to have to bite the bullet.

So in short, there is no longer a need to populate collections which will no longer be used in the upgraded system. At this point, I would only being doing it for the sake of problem solving when I no longer have a problem to solve. Hope all of this makes sense. 😉

Alex Craig, General Manager
"Avid Saltwater Fly Fisherman"

Report · Feb 03, 2024

OK. I managed to successfully create a collection from a resume source folder containing 60,000 documents.

The first time I tried it threw a requesttimeout error. After increasing the default timeout from 30 to 600 seconds it succeeded. Note: Just for kicks I watched the memory usage in task manager for the pertinent CF app as it processed; and ticked up to over 2600 MB at times. The size of the resulting collection was a smidge under 1 GB.

For whatever reason, I cannot get the much smaller collections that come from folders that contain 1/3 the number of documents to process successfully.

I can live with trying to use the outsided Resume collection instead of the 3 collections that we currently emply to reduce processing time for searches. i suspect the collection will need to be broken down to get acceptable processing times.

Worse, I took a long look at our use of Verity; and it is so prevalent on so many pages with very complex code that this conversion will be an absolute bear. I really need to see if I can get Adobe to license a copy of CF 9.X for me!! 😉

Alex Craig, General Manager
"Avid Saltwater Fly Fisherman"

Report · Feb 03, 2024

Good to hear you got it going, at least for one collection. (BTW, I was writing my previous reply just now while you wrote this one...but what I said may still help you or other folks.)

So you say you increased a timeout from its default of 30 to 600 seconds. Great. What default was that? (The CF Admin "request timeout" default is 60 seconds, so it must be some other. And the jetty.lax file itself does not have any property with "timeout" in the name.)

Second, as for your observing the memory use, was it really CF that hit a gig? Did you happen to also watch the addon service? It would be a java.exe (in the "details" tab of task manager. In the "processes" tab it can have different names that may not be obvious.)

Third, as for the collections that do NOT work, again let's see if the previous suggestions (to increase the heap size in jetty.lax) might get you going, after a restart of the addon service of course.

Finally, as for your concern about the bear of complexity in converting your CFML from using verity to solr, can you share some of the things you're encountering? Many found it was pretty straightforward, though yes there were SOME incompatibilities. Sadly, Adobe won't license CF9 for you...and of course one would not want to run CF9 as it's not gotten security updates for over 10 years. And while you may wonder if you can just license verity (and point CF at that), you'll find out that it's VERY expensive: Adobe was doing a pretty sweet deal to include it back then.

But perhaps while I'm writing you may be sharing more, so let me end this, for now. Also, I may tweak my earlier replies a bit, to help folks reading along sooner.

/Charlie (troubleshooter, carehart.org)

Report · Feb 03, 2024

Am going to compose a couple of replies before I lose track of the reply content I have in mind.

1)

>So you say you increased a timeout from its default of 30 to 600 seconds. Great. What default was that? >(The >CF Admin "request timeout" default is 60 seconds, so it must be some other. And the jetty.lax file >itself does >not have any property with "timeout" in the name.)

Yes it is the Admin "Timeout Requests after seconds". I had it set to 30 seconds as I was mirroring my CF 4.51 config when I did the initial setup.

2) It was the Adobe Cold Fusion Launcher app that I was monitoring. I think you indavertently juxtapositioned the numbers re: memory & collection size. I stated:

>Note: Just for kicks I watched the memory usage in task manager for the pertinent CF app as it processed; >and ticked up to over 2600 MB at times. The size of the resulting collection was a smidge under 1 GB.

BTW, the memory was down around 450 MB before I started the collection process. Hmmm ... if my math is right, that means it was using over 2 GB of memory. And I am dead certain of the figures as I watched it for quite awhile. 😉

OK. I changed the jvm.config file back to the defaults and restarted both the CF App & Add-on services.

Won't have much time to do more work on it tonight. But I'll get back on it tomorrow.

Of course, thanks a bunch for you assistance!

Alex Craig, General Manager
"Avid Saltwater Fly Fisherman"

Report · Feb 06, 2024

I am returning to this thread to update it with a complete answer to the problem which is the primary subject of the thread.

I left it dangling because we won't be using the collections involved in our migration to CF 2023. But since I hate being beaten by these machines I decided to take the time to determine the root of the problem.

Bottom line, it turns out there were about a dozen corrupt .txt files which were used to populate each of the three collections. This was due to a minor bug involving a custom tag we use in CF 4.51 to convert .doc & docx files to .txt files to import into collections. I.E., importing word docs into a collection was not possible with Verity collections in CF 4.51.

Even though, in the aggregate it effected a very small percentage of search results it is good to know as it provides further impetus to complete the migration.

Also, while repopulating collections, I have discovered that you was not able to process more than somewhere between 5000 to 6000 .txt files or the process seems to stall. It doesn't appear to be a memory problem. I had tried to populate a batch of 10,000 files in a single batch with no corrupt records and the process simply stopped processing. After I broke it down to about 5000 each, both processed successfully. Bottom line, if you are having trouble populating a large colledtion, try breaking them down into smaller batches.

Alex Craig, General Manager
"Avid Saltwater Fly Fisherman"

Adobe Community

CF2023 Collection Issues

2 Correct answers