How To: CF11 Clustering without Multicast (AWS)

Report · Aug 15, 2014

This week I've been working on getting clustering setup for a client. Initially we were using CF10 with the latest patches. Ideally we wanted non-sticky load balancing with session replication. We want really high availability with the option to reboot a server at any time and not have to wait for session draining or lose customers if a node goes down. Adam Cameron points out that there is an issue with CF10 and not having an option to turn on session replication Adam Cameron's CFML Blog: Problem with session replication with CF10 clustering. Trying various fixes I could not get the session to replicate we moved to CF11 which restores that issue. There is a bug open for CF10 with some weird responses but I never saw any sort of fix for this.

CF11 as noted solves this odd issue, so I thought we were in the clear. Following the limited cluster setup guides found online there is some manual configuration to do on the remote instance. First, I am not sure if the default cfusion instance just can't be used as a member of a cluster but I had a hard time ever getting it to work. So both the local and remote instance use new CF11 instances created from within the Instance Manager. The instructions Adobe ColdFusion 10 * Enabling clustering for load balancing and failover are mostly correct in that you have to copy the <cluster> node to the remote instance. One issue pointed out in a few places is that the cluster block has to actually go IN the <host> node and not after it. CF10, CF11 and maybe even CF9 put the block (and the documents suggest putting the block) after the </host> tag which, in my experience, does not work.

After everything was configured and I started up my test I could not get the remote node to respond at all. Looking in the cf error log I constantly saw this line:

INFO: Manager [/]: skipping state transfer. No members active in cluster group.

Digging in to the tomcat clustering discussions this basically means the cluster couldn't find the remote instance. By default CF uses the multicast cluster support in tomcat and doesn't have an option to do anything different. Researching this found that AWS does not support broadcast nor multicast in EC2. Further research showed how tomcat could be configured for static cluster member configuration and so I modified the server.xml files to match and viola, cluster with session replication. Using the ELB on AWS we have sticky sessions disabled (basically round-robin style requests) and the requests bounce evenly between the instance members. The session id's, however, stay the same on each page load even though the request is going to a different host.

So here is what the cluster node of the server.xml looks like:

<Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster" channelSendOptions="8" channelStartOptions="3">
        <Manager notifyListenersOnReplication="true" expireSessionsOnShutdown="false" className="org.apache.catalina.ha.session.DeltaManager"/>
        <Channel className="org.apache.catalina.tribes.group.GroupChannel">
          <!--<Membership port="45564" dropTime="3000" address="228.0.0.4" className="org.apache.catalina.tribes.membership.McastService" frequency="500"/>-->
          <Receiver port="4001" autoBind="100" address="auto" selectorTimeout="5000" maxThreads="6" className="org.apache.catalina.tribes.transport.nio.NioReceiver"/>
          <Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
            <Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"/>
          </Sender>
          <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpPingInterceptor"/> <!-- ADDED -->
          <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/>
          <Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor"/>
      <Interceptor className="org.apache.catalina.tribes.group.interceptors.StaticMembershipInterceptor">
                <Member className="org.apache.catalina.tribes.membership.StaticMember"
                  port="4002"
                  host="172.31.33.220"
                  domain="delta-static"
                  uniqueId="{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}"
                />
          </Interceptor>
        </Channel>
        <Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter=""/>
        <Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve"/>
        <ClusterListener className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener"/>
        <ClusterListener className="org.apache.catalina.ha.session.ClusterSessionListener"/>
      </Cluster>

You can see the <membership> node is commented out (this is the multicast function). The TcpPingInterceptor is added and the StaticMembershipInterceptor is added. The reciever port on this instance is 4001 and the remote instance is 4002 so the interceptor uses 4002 on this instance to contact the remote host and vice-versa. In other words the remote instance will use the same <cluster> node with the ports switch and the host IP address changed on the static interceptor. The uniqueID then rotates on each member going from {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} to {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0}

Of course each additional member to the cluster will mean manual changes to each existing member (to add additional static interceptors) but that seems a small price to pay to not have to move our entire environment off AWS.

Report · Aug 19, 2014

Hi Scott,

Thanks for very informative post. No doubt will be helpful information for folks.

Couple comments if I may?

>default cfusion instance just can't be used as a member of a cluster but I had a hard time ever getting it to work. So both the local and remote instance use new CF11 instances created from within the Instance Manager.

I think you can use the default instance cfusion tho I prefer not to, which is what you did, keeping the default apart from the clustered instances so I can manage the instances or cluster if the need arises.

>AWS does not support broadcast nor multicast in EC2

Very interesting. I wonder if it was not some kind of AWS EC2 security group denying the default CF multicast port traffic between CF instances (each CF instance on separate EC2 instances I am guessing).

Regards, Carl.

Report · Aug 20, 2014

Thanks Carl. Maybe in my testing the default instance scenario never had the other proper configurations in place. Good to know.

From the EC2 perspective Amazon has commented on their forums that they do not allow multi/broadcast traffic and while it has been a few years of them asking what it would be used for and soliciting opinions I haven't seen any movement on allowing it.

Report · Aug 26, 2014

Hope I am not hijacking your excellent post.

Some details to add for findings on AWS EC2 environment.

From CMD prompt CF11 instance that is clusted starting:

Aug 26, 2014 11:23:44 PM org.apache.catalina.ha.session.DeltaManager startIntern
al
INFO: Register manager / to cluster element Host with name localhost
Aug 26, 2014 11:23:44 PM org.apache.catalina.ha.session.DeltaManager startIntern
al
INFO: Starting clustering manager at /
Aug 26, 2014 11:23:44 PM org.apache.catalina.ha.session.DeltaManager getAllClust
erSessions
INFO: Manager [/], requesting session state from org.apache.catalina.tribes.memb
ership.StaticMember[tcp://172.31.21.168:4001,172.31.21.168,4001, alive=0, secure
Port=-1, UDP Port=-1, id={1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 }, payload={}, c
ommand={}, domain={100 101 108 116 97 45 115 116 97 ...(12)}, ]. This operation
will timeout if no session state has been received within 60 seconds.
Aug 26, 2014 11:23:45 PM org.apache.catalina.ha.session.DeltaManager waitForSend
AllSessions
INFO: Manager [/]; session state send at 8/26/14 11:23 PM received in 125 ms.
Aug 26, 2014 11:23:45 PM org.apache.catalina.ha.session.JvmRouteBinderValve star
tInternal
INFO: JvmRouteBinderValve started

From CMD prompt CF11 instance details when other cluster has been restarted:

Aug 26, 2014 11:22:47 PM org.apache.catalina.ha.tcp.SimpleTcpCluster memberDisap
peared
INFO: Received member disappeared:org.apache.catalina.tribes.membership.StaticMe
mber[tcp://172.31.25.175:4002,172.31.25.175,4002, alive=0, securePort=-1, UDP Po
rt=-1, id={0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 }, payload={}, command={}, doma
in={100 101 108 116 97 45 115 116 97 ...(12)}, ]
Aug 26, 2014 11:23:06 PM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded

INFO: Replication member added:org.apache.catalina.tribes.membership.StaticMembe
r[tcp://172.31.25.175:4002,172.31.25.175,4002, alive=0, securePort=-1, UDP Port=
-1, id={0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 }, payload={}, command={}, domain=
{100 101 108 116 97 45 115 116 97 ...(12)}, ]
Aug 26, 2014 11:23:06 PM org.apache.catalina.tribes.group.interceptors.TcpFailur
eDetector performBasicCheck
INFO: Suspect member, confirmed alive.[org.apache.catalina.tribes.membership.Sta
ticMember[tcp://172.31.25.175:4002,172.31.25.175,4002, alive=0, securePort=-1, U
DP Port=-1, id={0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 }, payload={}, command={},
domain={100 101 108 116 97 45 115 116 97 ...(12)}, ]]

Running CF11 via services.msc (as you normally would) these similar details are recorded in ColdFusion11\clustered_instance\logs\coldfusion-error.log. The latter part of log showing when other clustered instance has been stopped and started.

Aug 26, 2014 11:40:31 PM org.apache.catalina.ha.session.DeltaManager startInternal
INFO: Register manager / to cluster element Host with name localhost
Aug 26, 2014 11:40:31 PM org.apache.catalina.ha.session.DeltaManager startInternal
INFO: Starting clustering manager at /
Aug 26, 2014 11:40:31 PM org.apache.catalina.ha.session.DeltaManager getAllClusterSessions
INFO: Manager [/], requesting session state from org.apache.catalina.tribes.membership.StaticMember[tcp://172.31.21.168:4001,172.31.21.168,4001, alive=0, securePort=-1, UDP Port=-1, id={1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 }, payload={}, command={}, domain={100 101 108 116 97 45 115 116 97 ...(12)}, ]. This operation will timeout if no session state has been received within 60 seconds.
Aug 26, 2014 11:40:31 PM org.apache.catalina.ha.session.DeltaManager waitForSendAllSessions
INFO: Manager [/]; session state send at 8/26/14 11:40 PM received in 141 ms.
Aug 26, 2014 11:40:31 PM org.apache.catalina.ha.session.JvmRouteBinderValve startInternal
INFO: JvmRouteBinderValve started
Aug 26, 2014 11:40:31 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-bio-8501"]
Aug 26, 2014 11:40:31 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["ajp-bio-8012"]
Aug 26, 2014 11:40:31 PM com.adobe.coldfusion.launcher.Launcher run
INFO: Server startup in 44274 ms
Aug 26, 2014 11:42:04 PM org.apache.catalina.ha.tcp.SimpleTcpCluster memberDisappeared
INFO: Received member disappeared:org.apache.catalina.tribes.membership.StaticMember[tcp://172.31.21.168:4001,172.31.21.168,4001, alive=0, securePort=-1, UDP Port=-1, id={1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 }, payload={}, command={}, domain={100 101 108 116 97 45 115 116 97 ...(12)}, ]
Aug 26, 2014 11:42:23 PM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded
INFO: Replication member added:org.apache.catalina.tribes.membership.StaticMember[tcp://172.31.21.168:4001,172.31.21.168,4001, alive=0, securePort=-1, UDP Port=-1, id={1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 }, payload={}, command={}, domain={100 101 108 116 97 45 115 116 97 ...(12)}, ]
Aug 26, 2014 11:42:23 PM org.apache.catalina.tribes.group.interceptors.TcpFailureDetector performBasicCheck
INFO: Suspect member, confirmed alive.[org.apache.catalina.tribes.membership.StaticMember[tcp://172.31.21.168:4001,172.31.21.168,4001, alive=0, securePort=-1, UDP Port=-1, id={1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 }, payload={}, command={}, domain={100 101 108 116 97 45 115 116 97 ...(12)}, ]]

Hope that adds to the usefulness of this thread.

Regards, Carl.

How To: CF11 Clustering without Multicast (AWS)

Photos