CF2023 Collection Issues

Question

Well folks while I'm waiting to resolve the html data field validation bug, I started working on collections.

In CF 4.51 I have 6 collections. 3 of them are:

Basically, these are populated with .txt files while were extracted from resumes with the following formats: .doc, .docx, .html, .txt resumes. There are 20,000 + or - files in each of them.

I am only able to index the folder ending in ho. And it only contains 8,396 documents. Whereas the folder contains 19,385 .txt files. I presume CF2023 collection should contain 19,385 documents?

The other two folders bomb out when I try to index them resulting in zero documents being populated in the corresponding collections. I have checked the source folders and they only contain .txt files. Thus, I have no clue as to why the processing fails.

It would require far less code changes, were we able to use our current collection scheme and simply add some additional code to accomodate .pdf resumes.

As an alternative approcach , I tried creating a collection from the folder that contained 180 of the 60,040 raw resumes with all four of the aforementioned formats. And it did create 180 documents in the collection.

I am concerned that only 1 collection containing 60,040 documents would process too slowly. I would appreciate any opinions on this concern.

Thanks in advance for any help!

AlexCraig · Accepted Answer

I am returning to this thread to update it with a complete answer to the problem which is the primary subject of the thread.

I left it dangling because we won't be using the collections involved in our migration to CF 2023. But since I hate being beaten by these machines I decided to take the time to determine the root of the problem.

Bottom line, it turns out there were about a dozen corrupt .txt files which were used to populate each of the three collections. This was due to a minor bug involving a custom tag we use in CF 4.51 to convert .doc & docx files to .txt files to import into collections. I.E., importing word docs into a collection was not possible with Verity collections in CF 4.51.

Even though, in the aggregate it effected a very small percentage of search results it is good to know as it provides further impetus to complete the migration.

Also, while repopulating collections, I have discovered that you was not able to process more than somewhere between 5000 to 6000 .txt files or the process seems to stall. It doesn't appear to be a memory problem. I had tried to populate a batch of 10,000 files in a single batch with no corrupt records and the process simply stopped processing. After I broke it down to about 5000 each, both processed successfully. Bottom line, if you are having trouble populating a large colledtion, try breaking them down into smaller batches.

Charlie Arehart · Answer

Alex, I think I have good news for you.

1) First, some background: if such cf "collection" processing happens to "bomb out", "fail", or "process too slowly", the suspect to pursue is generally not cf itself--though I realize it can seem so.

Instead, look to the underlying implementation of the open source Solr engine that CF is relying on, since cf9, to do such index importing and searching. This is similar to how your CF 4 (or 5-8) relied upon the commercial Verity engine to do that collection processing then.

And though I appreciate how that may have been "rock solid" for decades and "never needed any tweaking", don't hear me suggesting you need to become an expert in Solr or Solr "tuning". No, it's likely something very simple to solve. The default implementation may simply not be suited to the volume you're pushing (which can be about more than simply the "number of documents" you're trying to index, and if all at once).

2) So as with any diagnosis of problems, a key can be to find any available logs (or enable them, or add other diagnostics). And the Solr that CF enables has such logs--and they can be made to log still more when needed. Also, as the Solr engine is based on Java, one can also add Java monitoring/profiling tools to also better understand it. (If one uses FusionReactor to monitor cf, the same license can be used to monitor Solr running on the same machine.)

2a) And that leads to another point (which may not apply to you, but I'm writing as much for others who may find this thread): if the person installing CF chose the option to enable Solr, you'll find the logs in the the cfusion/jetty/logs folder. This is because CF implements solr under a jetty app server that's SEPARATE from CF and the CF process and jvm it runs on.

2b) If instead they downloaded and implemented it with the separate CF addon installer, then it will be in a folder they named on installing that (which might be a sibling to the coldfusion2023 folder as ColdFusionAdd-onServices--and same for prior versions of cf), which will again implement jetty as its own process (and service) and which will run in jvm separate from CF.

All this is stuff above does need to be understood even to implement the possibly simple solution, both to know where to find logs, and where to possibly make changes based on the diagnostics (and also where to enable such additional diagnostics, such as FR or jvm monitoring features).

3) Now to what I hope may be the simple solution: you may may find indications in those Solr (jetty) logs that it has had an outofmemory error. This would indicate (usually) a need to simply increase the max heap size allocated to solr. You may find reference to that error in the file (for a given date) ending in stderrout.log in that jetty/logs folder. [I've added that last sentence newly since my first reply, to help folks reading from the top. I wasn't on a computer when I first offered this extended reply for Alex.)

Here again, where to do that depends on which of those two options above was used to install it. For the "normal" installation of it with CF, look in that cfusion/jetty folder for a file called jetty.lax. (In the "addons" install approach, ~~I have to check later where to change that, as I'm writing on my phone.~~, I've confirmed the file name is the same.) And note that despite that odd file extension, the jetty.lax is just a plain text file. (This is another update since my first reply, as you'll see below that Alex wondered in reply about that. I'm clarifying it now for the sake of other readers.)

And in that file there's a line that has jvm args (and again ~~I'll update this with the exact name of the line~~ I'm now updating this reply with that name) which starts with lax.nl.java.option.additional, and which has an xmx value. You'd want to consider increasing that, if indeed the logs indicate it's running out of "heap". You may find that simply doubling it is enough.

Make a copy of that file before editing it, then change it, then restart the solr service (called the "coldfusion addon service", regardless of how it was installed above.) And first, make sure it restarts: if you make a mistake in editing that file, it may not start. Then test your operation.

If you still get an outofmemory error, try doubling it again (keeping an eye on available memory on your machine, of course). Sometimes a problem like yours needs less additional memory than you have on your phone. 🙂

4) Finally, the error in the logs may be something else. And perhaps other readers here will chime in with tweaks they've made to other aspects of how solr is configured.

For now, please consider what I've offered, and let us know if it might get you going, or what other info you may find or want to share.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded