I have a web application that makes extensive use of websockets for call center agents. There are up to 50 simultaneous users who are triggering messages and listening on channels for chat, phone calls and broadcast messaging. After about a month in production (and, with no issues), we had two failures yesterday where messages stopped being processed. So far, it is unclear if the .cfc extension handlers were receiving the browser websocket messages during the failure, but it is positive that the server was not publishing back out to the browser listeners.
When I searched the CF and IIS logs, there is NO record of any anomalous events during the stalled periods (application.log, exception.log, etc...). Restarting the ColdFusion Application Service causes messages to start flowing immediately, but there is no recovery (no catching up) for messages sent during the stall.
All other website functionality appeared to be working properly during the stall.
I'm awaiting the next failure so that I can diagnose further, but I'd rather it not fail (duh).
Has this issue been reported previously (I could not find a similar report)?
Are there any limits to the number of simultaneous registered websocket clients in CF11 (ultimately need headroom for 250 simultaneous users)?
Is there any other location where errors might have been captured?
Is there a record or queue for websocket messages that are processed by the server before/during publishing?
Is there any way to detect this issue in real-time without building a heartbeat websocket?
Windows 2012 Server R2
ColdFusion 11 Update 2 (rolled back immediately in December from Update 3 due to its breaking of websockets over SSL)
Websockets over SSL using RapidSSL cert over Port 8577
All comments and assistance welcomed.
Was there any solution found to this issue?
We are having the same issue, again Call Centre enviroment. The subsbribed channel will without warning suddenly not exist. All other website functions work as normal, no error log anywhere. The only way to get things running again is ti restart CF Application Service. We use CF inbuilt websocket as opposed to proxy via IIS, which we were considering moving to.
We are running on CF2021 Standard on a Windows Server 2019.
Darrell (and Kevin), I can report that I was helping a client just last week having the same problem. They too found the client code getting a response from the sever that the channel somehow "doesn't exist or is not running".
(FWIW, unlike Darrell, they WERE using the iis connector setup via cf's wsproxyconfig tool. So it seems that's not a factor in this problem.)
And like you both, I helped them confirm there were no cf logs obviously reflecting the issue. That said, I suggested they look back through the logs since the problem could have happened some minutes or hours before they were aware of the problem, and some log entry/entries may have happened at a point in time which may be harder to spot. I'd not yet heard back, but I leave that as an idea for you here.
I also confirmed that fusionreactor (the cf monitor they were using--and which l love) had no features for monitoring WebSocket/wss calls (in or out). I've asked if they may ever add that, but for now they've said it's not on their road map.
Finally, I also asked Adobe if they had anything in their PMT monitor (new since cf2018), as again I've seen nothing in it. They've not responded other than saying the question was passed to the cf team.
So it's indeed interesting to hear of this hitting at least you three. I suspect it happens more widely, but like you guys I've not found other discussion of it--regarding cf or otherwise (as it's almost certainly based on a underlying java--and perhaps tomcat--library). It is indeed a mystery, for now.
If nothing else, it would be great if we could find a way to reset things short of restarting cf. But without a clue of where the problem lies, that too remains a mystery for now.
It's indeed an interesting thought to consider whether it may be a matter of hitting a limit on how many connections are allowed in cf standard. When the feature first came out (in cf10?), that number was quite limited. And it was lifted in the next release. I don't recall if it was made unlimited.
I sure would expect cf to give a far more a clear error if that's what these are hitting. And indeed that's also why I'd think there was SOME means to monitor the state of such a limit. Again, I await any clarification from Adobe on that.
I really look forward to finding resolution on this for you all.
Sorry I don't have answers, but for now only mostly commiseration (and that one other diagnostic possibility, of some log entry happening only at the time the channel fails).