Cluster member cannot rejoin cluster and logs unending loop of WARN in JBoss EAP Cluster
https://access.redhat.com/kb/docs/DOC-52966 Cluster member cannot rejoin cluster and logs unending loop of WARN in JBoss Enterprise Application Platform Cluster Article ID: 52966 - Created on: Aug 7, 2009 2:33 PM - Last Modified: Mar 28, 2011 11:33 AM Issue Server unresponsive and lots of JGroups warnings server.log is filled with WARN [org.jgroups.protocols.pbcast.GMS] join(1.2.3.4:7900) sent to 1.2.3.4:7900 timed out, retrying Seeing the problem described in JGRP-130, it is the unending loop of WARN messages described here: https://jira.jboss.org/jira/browse/JGRP-130?focusedCommentId=12333614#action_12333614 This race condition is also known to cause this in EAP 5.x: https://issues.jboss.org/browse/JGRP-1282 Environment JBoss Enterprise Application Platform (EAP) 4.3.0 5.0 JBoss Cluster Diagnosis This warning can also occur if the node gets really really busy, but the warning would eventually stop. Check to see how busy the node is. Check Failure Detection (FD) settings (timeout & max_tries) in the jgroups configuration ( $JBOSS_HOME/server/ /deploy/cluster-service.xml ) Ensure the address and port for the from and to address are identical in the WARN message. If they are not, it is not this issue but likely just a communication issue with other cluster nodes. join(1.2.3.4:7900) sent to 1.2.3.4:7900 Resolution Fixed in jGroups 2.8 ( https://jira.jboss.org/jira/browse/JGRP-130 ) Workaround: Wait for ( FD timeout * max_tries ) which is by default 50 seconds in clustering-service.xml, before restarting the nodes Root Cause When a member P crashes and then is restarted, if FD is used and P is restarted *before* it is excluded by the other nodes, then P is a new member *under the same old address*. Since it lost all of its state (e.g. retransmission table), retransmission requests sent to the new P will fail. By default the FD_SOCK protocol has a timeout of 10 sec * 5 retries which is 50 seconds. After that the other nodes would remove the node P. Then when the node P is restarted it will rejoin on a different channel. If the node is restarted before the timeout * max_tries, then the issue will occur. The fix in jGroups 2.8 was a design change, that cause the issue to not occur, it is not feasible to backport for JBoss EAP 4.x or 5.x The typical cause for that issue is a TCP clustering node that's the master leaves the cluster, it's not detected immediately by FD_SOCK, and the node is restarted before FD detects it gone. A race condition in EAP 5.x that can also trigger it https://issues.jboss.org/browse/JGRP-1282