Skip to main content
AlexCraig
Inspiring
February 2, 2024
Answered

CF2023 Collection Issues

  • February 2, 2024
  • 2 replies
  • 2000 views

Well folks while I'm waiting to resolve the html data field validation bug, I started working on collections.

 

In CF 4.51 I have 6 collections.  3 of them are:

TEN_Apps_ag

TEN_Apps_ho

TEN_Apps_pz

Basically, these are populated with .txt files while were extracted from resumes with the following formats:  .doc, .docx, .html, .txt resumes.  There are 20,000 + or -  files in each of them.

 

I am only able to index the folder ending in ho.  And it only contains 8,396 documents.  Whereas the folder contains 19,385 .txt files.  I presume CF2023 collection should contain 19,385 documents?

 

The other two folders bomb out when I try to index them resulting in zero documents being populated in the corresponding collections.  I have checked the source folders and they only contain .txt files.  Thus, I have no clue as to why the processing fails.

 

It would require far less code changes, were we able to use our current collection scheme and simply add some additional code to accomodate .pdf resumes.

 

As an alternative approcach , I tried creating a collection from the folder that contained 180 of the 60,040 raw resumes with all four of the aforementioned formats.  And it did create 180 documents in the collection.

 

I am concerned that only 1 collection containing 60,040 documents would process too slowly.  I would appreciate any opinions on this concern.

 

Thanks in advance for any help!

This topic has been closed for replies.
Correct answer AlexCraig

I am returning to this thread to update it with a complete answer to the problem which is the primary subject of the thread.

I left it dangling because we won't be using the collections involved in our migration to CF 2023.  But since I hate being beaten by these machines I decided to take the time to determine the root of the problem.

Bottom line, it turns out there were about a dozen corrupt .txt files which were used to populate each of the three collections.  This was due to a minor bug involving a custom tag we use in CF 4.51 to convert .doc & docx files to .txt files to import into collections.  I.E., importing word docs into a collection was not possible with Verity collections in CF 4.51.

Even though, in the aggregate it effected a very small percentage of search results it is good to know as it provides further impetus to complete the migration.

Also, while repopulating collections, I have discovered that you was not able to process more than somewhere between 5000 to 6000 .txt files or the process seems to stall. It doesn't appear to be a memory problem.  I had tried to populate a batch of 10,000 files in a single batch with no corrupt records and the process simply stopped processing.  After I broke it down to about 5000 each, both processed successfully.  Bottom line, if you are having trouble populating a large colledtion, try breaking them down into smaller batches.

2 replies

AlexCraig
AlexCraigAuthorCorrect answer
Inspiring
February 6, 2024

I am returning to this thread to update it with a complete answer to the problem which is the primary subject of the thread.

I left it dangling because we won't be using the collections involved in our migration to CF 2023.  But since I hate being beaten by these machines I decided to take the time to determine the root of the problem.

Bottom line, it turns out there were about a dozen corrupt .txt files which were used to populate each of the three collections.  This was due to a minor bug involving a custom tag we use in CF 4.51 to convert .doc & docx files to .txt files to import into collections.  I.E., importing word docs into a collection was not possible with Verity collections in CF 4.51.

Even though, in the aggregate it effected a very small percentage of search results it is good to know as it provides further impetus to complete the migration.

Also, while repopulating collections, I have discovered that you was not able to process more than somewhere between 5000 to 6000 .txt files or the process seems to stall. It doesn't appear to be a memory problem.  I had tried to populate a batch of 10,000 files in a single batch with no corrupt records and the process simply stopped processing.  After I broke it down to about 5000 each, both processed successfully.  Bottom line, if you are having trouble populating a large colledtion, try breaking them down into smaller batches.

Alex Craig, General Manager"Avid Saltwater Fly Fisherman"
Charlie Arehart
Community Expert
Community Expert
February 2, 2024

Alex, I think I have good news for you.

 

1) First, some background: if such cf "collection" processing happens to "bomb out", "fail", or "process too slowly", the suspect to pursue is generally not cf itself--though I realize it can seem so.

 

Instead, look to the underlying implementation of the open source Solr engine that CF is relying on, since cf9, to do such index importing and searching. This is similar to how your CF 4 (or 5-8) relied upon the commercial Verity engine to do that collection processing then.

 

And though I appreciate how that  may have been "rock solid" for decades and "never needed any tweaking", don't hear me suggesting you need to become an expert in Solr or Solr "tuning". No, it's likely something very simple to solve. The default implementation may simply not be suited to the volume you're pushing (which can be about more than simply the "number of documents" you're trying to index, and if all at once). 

 

2) So as with any diagnosis of problems, a key can be to find any available logs (or enable them, or add other diagnostics). And the Solr that CF enables has such logs--and they can be made to log still more when needed. Also, as the Solr engine is based on Java, one can also add Java monitoring/profiling tools to also better understand it. (If one uses FusionReactor to monitor cf, the same license can be used to monitor Solr running on the same machine.) 

 

2a) And that leads to another point (which may not apply to you, but I'm writing as much for others who may find this thread): if the person installing CF chose the option to enable Solr, you'll find the logs in the the cfusion/jetty/logs folder. This is because CF implements solr under a jetty app server that's SEPARATE from CF and the CF process and jvm it runs on.

 

2b) If instead they downloaded and implemented it with the separate CF addon installer, then it will be in a folder they named on installing that (which might be a sibling to the  coldfusion2023 folder as ColdFusionAdd-onServices--and same for prior versions of cf), which will again implement jetty as its own process (and service) and which will run in jvm separate from CF. 

 

All this is stuff above does need to be understood even to implement the possibly simple solution, both to know where to find logs, and where to possibly make changes based on the diagnostics (and also where to enable such additional diagnostics, such as FR or jvm monitoring features).

 

3) Now to what I hope may be the simple solution: you may may find indications in those Solr (jetty) logs that it has had an outofmemory error. This would indicate (usually) a need to simply increase the max heap size allocated to solr.  You may find reference to that error in the file (for a given date) ending in stderrout.log in that jetty/logs folder. [I've added that last sentence newly since my first reply, to help folks reading from the top. I wasn't on a computer when I first offered this extended reply for Alex.)

 

Here again, where to do that depends on which of those two options above was used to install it. For the "normal" installation of it with CF, look in that cfusion/jetty folder for a file called jetty.lax. (In the "addons" install approach, I have to check later where to change that, as I'm writing on my phone., I've confirmed the file name is the same.)   And note that despite that odd file extension, the jetty.lax is just a plain text file. (This is another update since my first reply, as you'll see below that Alex wondered in reply about that. I'm clarifying it now for the sake of other readers.)

 

And in that file there's a line that has jvm args (and again I'll update this with the exact name of the line I'm now updating this reply with that name) which starts with lax.nl.java.option.additional, and which has an xmx value. You'd want to consider increasing that, if indeed the logs indicate it's running out of "heap". You may find that simply doubling it is enough.

 

Make a copy of that file before editing it, then change it, then restart the solr service (called the "coldfusion addon service", regardless of how it was installed above.) And first, make sure it restarts: if you make a mistake in editing that file, it may not start. Then test your operation.

 

If you still get an outofmemory error, try doubling it again (keeping an eye on available memory on your machine, of course). Sometimes a problem like yours needs less additional memory than you have on your phone. 🙂 

 

4) Finally, the error in the logs may be something else. And perhaps other readers here will chime in with tweaks they've made to other aspects of how solr is configured.

 

For now, please consider what I've offered, and let us know if it might get you going, or what other info you may find or want to share. 

/Charlie (troubleshooter, carehart. org)
AlexCraig
AlexCraigAuthor
Inspiring
February 3, 2024

As I suspected rebooting the server did not help.

Did a bit more research and took a look at the Application log.  It said there was an error at line 130 of indexcollection.cfm.

Line 130 is:

"NUL EOT ETX J SOH NUL DC3 J solr_alias_required BS ETX L SOH NUL SUB L An alias name is required. BS ETX T SOH NUL SUB"

 

I'm guessing this is misleading and that it choked for another reason as I was able to create a collection without an alias for a much smaller amount of data.

 

Beyond that, I am clueless!

Alex Craig, General Manager"Avid Saltwater Fly Fisherman"
AlexCraig
AlexCraigAuthor
Inspiring
February 4, 2024

Alex:

  1. Sorry for being MIA for a couple of days in replies.
  2. No, it's NOT the jvm.config you should have changed. That controls CF, not the addon service. Please set that back to the original values. 
  3. Instead, yes, it's the jetty.lax file, as I had said. (And I have confirmed that it's that file regardless of whether one has implemented the addon service via the CF installer or via the available addon service installer.)
  4. Sorry to hear you were confused about editing it, not recognizing the file extension. But yes, any editor would do, as you found with np++.
  5. No, it was not THAT line (which said the word "args"). What I had said was first, "there's a line that has jvm args" and then that "which has an xmx value." Somehow, you lost track of that between finding and then opening the file. 🙂
  6. Instead, as I can now report, the line starts with "lax.nl.java.option.additional". THAT has the xmx arg I was referring to.  And I just confirmed in both a cf2023 and 2021 version of that jetty.lax, the default is -Xmx512m. And THAT is what I was propsing you double, so to 1024 to start.
  7. (I would recommend you make the change to the CF jvm.config file and restart that, just in case it's using a lot of memory now because you told it could grow to 4g. After restarting that, then restarting this add-on service should have no problem using 500m more.)
  8. Now, do your test. And if you find there's any problem, then double it again (and restart the add-on service).
  9. As for the log you shared, that's what's called the "request" log, and it can be useful--but it would not show the heap error I wondered about. Have you looked at the other log file in that folder (for the day you had the error, as it rotates each day that service is running), whose name ends in stderrout.log (like 2024_02_03.stderrout.log)? Don't ignore it because the OS reports it is a 0-byte file. Sometimes that is lying, and there IS content in the file.

 

Hope among all these, we get you going.


Tried with a value of 1024.  Then went to 2048.  Restarted the addon service each time.  No joy.

While I'd have been interested in getting them populated just for the sake of accomplishment.  At this point, those collections will become superfluous as all things considerd to include the large amount of code changes to implement solr and the solr's ability to deal with .pdf docs negatiing the need for a custom.dll.

 

MIght as well go with a single Resumes collection now that I've gotten it to populate.  And worry about splitting the collection into multiple categories or collections as warranted.

 

BTW, I will also need to populate a TPI_Jobs collection using about a half dozen varchar datafields from a Sql/Server table.

 

I don't suppose you can lay your hands on a sample .html page with the sample code format I need to use to get that job done?  It would save me a lot of research grunt work.

Alex Craig, General Manager"Avid Saltwater Fly Fisherman"