Copy link to clipboard
Copied
We have recently migrated our Coldfusion 2021 environment to Coldfusion 2023 on Windows Server 2022 using Oracle JDK 17.0.16. We've noticed that on Coldfusion 2023, whenever we restart the server, there is a chance that each instance will generate a JVM error log & dump file in Coldfusion2023\instance_name\bin. The log (named in the pattern hs_err_pidXXXX.log, where XXXX is a 4 digit number) reports an EXCEPTION_ACCESS_VIOLATION (see the error details at the bottom). Also, a 1gb+ memory dump file named hs_err_pidXXXX.mdmp is generated.
We have various ideas for strategies on how to clean up these files, but are concerned that they are an indication of some problem in our CF2023 environment. So far, we have not seen any indication that the environment is unstable, and the files only generate when the server is rebooted - not when manually shutting down CF services. We use the Adobe provided Oracle JDK build, but have reproduced this with Redhat OpenJDK. We have an ongoing ticket with Adobe, who has tried various iterations of JVM flags with us, but would like to see if anyone in the community has encountered something similar.
# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ff9ab9557e0, pid=4428, tid=11096
#
# JRE version: Java(TM) SE Runtime Environment (17.0.16+12) (build 17.0.16+12-LTS-247)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.16+12-LTS-247, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, windows-amd64)
# Problematic frame:
# C [net.dll+0x57e0]
Copy link to clipboard
Copied
OK Matthew, so that's a couple of things. 🙂 Let me offer a few thoughts on each, which I hope may help you solve this.
1) So first, that's a valuable clue that you've observed it happens ONLY on cf startup after a reboot, but not on subsequent cf startups. That could suggest there's some race condition of something that CF needs to call upon but which has itself NOT YET been completely started, after a reboot (but is working fine later). Of course, it's hard to know FOR SURE what that would be, and it could take additional work to figure that out.
a) So one thing you could do it change the Windows service for CF so instead of being of type "automatic" it's instead "automatic (delayed start)". That causes it to start 2 mins after the box reboot.
As for making such an auto delayed service wait still longer, there are no built-in feature in the Services panel to control that, but there IS a pretty simple registry tweak which allows you to control that delay time, for a given service. For details, see resources such as this one. (There are also tools that can facilitate that additional startup delay functionality.)
b) Another way to look at the problem is instead configure the CF service to start AFTER whatever service it may be conflicting with (need to await its completion first). If we KNEW what that service was, you could change the CF service definition to make it "dependent" on that/them.
Again, for now it's not clear what service may be in conflict (needs to start first). Given the error you show about a net.dll exception, which suggests it may be some need to await completion of the networking stack, one guess is the DNS and TCPIP stack. And to modify the CF2023 service to be dependent on those (await their having started first), you could issue this command (at an admin command prompt):
sc config "ColdFusion 2023 Application Server" depend= Tcpip/Dnscache
Do you find that this problem ALWAYS happens after each reboot? If so, then try rebooting after this change and see if it happens now. If it does, great. But again, it's not clear these ARE what CF needs to await. BTW, to remove those dependencies, this would work:
sc config "ColdFusion 2023 Application Server" depend= ""
In conclusion, it's certainly possible that JUST changing the service to BE delayed will suffice. Try that first. Or changing the dependencies may help. Perhaps try that next. Finally, tweaking the delay time may be necessary. Perhaps try that last.
2) Finally, as for preventing the mdmp (crash minidump), you should be able to disable it by adding the JVM arg:
-XX:-CreateMinidumpOnCrash
As for putting that into CF, you may know you can do that either in the CF Admin ("java and jvm" and its "jvm arguments" field) or in the cfusion/bin/jvm.config file (on its java.args line), which is what the Admin edits. Either way, note that the dashes are important, and you do NOT want to put that on a new line, but instead add it (such as to the end) of the args there. There should also be a space before and after any such arg you add.
I would recommend you make a copy of the jvm.config file before you edit it (or change the cf admin jvm settings), so you can recover if you make some mistake. And also, I recommend you restart CF immediately to make sure it still starts (don't wait for the box reboot--as CF may not start at all if you make a mistake).
Let us know how things go, if you try any of these.
Copy link to clipboard
Copied
Hi Charlie,
Thanks for your thorough response - it's much appeciated! Our team will review the suggestions and let you know how things go.
Thanks,
Matt
Copy link to clipboard
Copied
It looks to me like there is something wrong with your Java 17.0.16 installation. Or perhaps there is a problem with the way the installation works with ColdFusion 2023. I would therefore suggest you uninstall, then reinstall Java.
The steps I would recommend:
Copy link to clipboard
Copied
For that matter, reinstalling 17.0.16 alone could prove if there was "something wrong with your Java 17.0.16 installation"--I mean if doing that somehow fixed things.
Note a key aspect of bkbk's suggestion: you do want to stop cf before installing the jvm, if the update would be placed into the same folder as the previous version, which is what the installer does by default. If you'd not been careful about that in this most recent install, Matthew, perhaps that's atma issue. It would seem curious if that would only affect cf startup after a reboot, as you've explained.
Anyway, let's see what you may find, whichever stepss you may follow: yours or either of ours.
Copy link to clipboard
Copied
Thanks, Charlie, for the remarks.
Copy link to clipboard
Copied
Hi,
We did actually do an uninstall and reinstall of the Adobe provided Oracle JDK. We shut down the services and reinstalled into the same folder (we don't use the default directory) and were able to reproduce the issue. As I mentioned, we also installed the RedHat OpenJDK as well and ran the same test - we were able to reproduce the issue regardless of which JDK we were using.
Copy link to clipboard
Copied
Any news, @matthewr4300865, as the week ends? Not pressing, just curious.
Copy link to clipboard
Copied
Hi Charlie,
I reviewed your post - I think I wasn't clear in that the errors occur on the shutdown side of rebooting. However, in the spirit of your suggestions to review service startup order, we did do a test where, instead of restarting, we ran a batch file to shut down each instance before proceeding with the restart. This worked, so I assume that the issue is Windows is not allowing our Coldfusion services to shut down cleanly before restarting and is causing them to crash instead.
We can go with this approach to run a batch file to shut down services - we'll need to handle cases where maybe the service doesn't shut down in a reasonable amount of time, and it won't help if there's a reboot initiated from some other means. So far, we have no other indicator that this is causing a problem, but I am still concerned that we experienced these net.dll crashes in the first place, when this isn't the experience of other Coldfusion2023 sites.
FYI - we did try the flag to suppress the minidump file and it did work, so the worst we will have to manage is the error logs.
Copy link to clipboard
Copied
Thanks for the update, and the clarification. Sorry I misinterpreted it as a problem of startup rather than shutdown. So that's indeed different. And glad a couple of ideas helped still.
And no, there should not be the crash on shutdown either. So the next question would be: how long is cf taking to stop when you do that at the cli? It shouldn't be more than 30 seconds.
There WILL be an answer AS TO WHY it takes longer. And I already have ideas and diagnostics to consider. But let's start with this first question. It would be a lot to write to convey what I have in mind, and if one thing didn't work then the results of it could influence what I'd propose as the next.
If we might solve the problem in less than an hour of a shared desktop consulting session together, might your folks be open to that? You'd not pay for time you didn't value. For more, including my rates, approach, satisfaction guarantee, online calender and more, see carehart.org/consulting.
Copy link to clipboard
Copied
We did run some metrics on how long it was taking for each instance to shut down using net stop from a batch file, and all instances were shutting down within 12 to 13 seconds (and not generating any errors). Interestingly, Adobe support has provided a patch - we'll test this out next week and see if this resolves the issue.
Copy link to clipboard
Copied
The cause of the issue is obviously the Java Java 17.0.16 that ColdFusion is using.***
So, the most likely solution is:
*** The likely causes:
1. Bitness:
Double-check for 32/64-bit mismatches. Are Java, Wndows and ColdFusion all 64-bit?
2. Native networking and socket shutdown issue:
The error message tells us the original problem was in net.dll (the native networking library). This tells me that the crash is very likely triggered during cleanup of network resources (sockets, threads) when a ColdFusion instance stops or restarts.
On restart, ColdFusion may not gracefully shut down threads and connectors. So if a socket is still in a weird state, the native layer will complain.
3. Mismatch between Windows Server 2022 and JDK 17.0.16 or a bug in that specific native “net.dll” version:
The native library version (on Windows Server 2022) plus how ColdFusion uses it might have hit a bug.
4. Race conditions on restart:
Does your application have persistent network jobs ("blocking" HTTP requests)?
On restart there might be network resources (threads, sockets, timers) that are still active when the Java Virtual Machine or ColdFusion attempts to tear them down. The fact it always happens on restart is a strong clue that such a race condition may be involved.
5. External native interference:
Antivirus, firewall, Windows network filter drivers, VPN/network adapter drivers and custom network libraries imported into ColdFusion may interfere with net.dll. So look into them.
Copy link to clipboard
Copied
Bkbk, it seems you've missed that Matthew has clarified that Cf does run. It only sometimes fails, and that only on shutdown.
The jmv issues you outline are indeed among the valid reasons someone may find cf to NOT start at all. But that's not Matthew's problem here.
Copy link to clipboard
Copied
We updated to the latest 17.0.17 Oracle JDK as provided by Adobe, and are reproducing the issue there. For #1, we're using 64 bit OS, Java and Coldfusion. We did run a test where we shut down services before shutting down the server and didn't produce any errors, so it's possibly some issue with not gracefully shutting down. Adobe did just provide a patch, so we'll see if that helps.
Copy link to clipboard
Copied
Hi @matthewr4300865 , Thanks for taking the time to test the suggestions and for updating us. Your findings point even more towards a net.dll bug.
Here then are two more suggestions in that direction (do both)
Find more inspiration, events, and resources on the new Adobe Community
Explore Now