Replacing Networker Nodes
An end-to-end scenario on replacing a Red Hat OpenStack Platform Networker node
Abstract
Preface
In certain circumstances a node with a Networker profile as described in Tagging Nodes into Profiles in a high availability cluster might fail. In these situations, you must remove the node from the cluster and replace it with a new Networker node. This procedure assumes node discovery and ensuring the node can connect to the other nodes in the cluster over the network.
This section provides instructions on how to replace a Networker node. The process involves running the openstack overcloud deploy
command to update the overcloud with a request to replace a Networker node.
This procedures was prepared for Telecom Italia to address Bugzilla #1578502. This procedure requires the hotfix for Bugzilla #1600178. The following procedure only applies to high availability environments. Do not use this procedure if only using one Networker node.
Chapter 1. Preliminary Checks
Before attempting to replace an overcloud Networker node, it is important to check the current state of the Red Hat OpenStack Platform environment. Checking the current state can help avoid complications during the Networker replacement process. Use the following list of preliminary checks to determine if it is safe to perform a Networker node replacement. Run all commands for these checks on the undercloud.
Check the current status of the
overcloud
stack on the undercloud:[stack@director ~]$ source stackrc [stack@director ~]$ openstack stack list --nested
The
overcloud
stack and its subsequent child stacks should have eitherCREATE_COMPLETE
orUPDATE_COMPLETE
.Perform a backup of the undercloud databases:
[stack@director ~]$ mkdir /home/stack/backup [stack@director ~]$ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz
- Ensure the undercloud contains 10 GB free storage to accommodate image caching and conversion when provisioning the new node.
Check the status of Pacemaker on the running Networker nodes. For example, if
192.168.0.47
is the IP address of a running Networker node, use the following command to get the Pacemaker status:[stack@director ~]$ ssh heat-admin@192.168.0.47 'sudo pcs status'
Replace the exemplary IP address with the IP address of the running Networker node. The output should show all services running on the existing nodes and stopped on the failed node.
Check the following parameters on each node of the overcloud’s MariaDB cluster:
-
wsrep_local_state_comment: Synced
wsrep_cluster_size: 2
Use the following command to check these parameters on each running Networker node, but replace the exemplary
192.168.0.47
and192.168.0.46
IP addresses with IP addresses from the cluster:[stack@director ~]$ for i in 192.168.0.47 192.168.0.46 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql -p\$(sudo hiera -c /etc/puppet/hiera.yaml mysql::server::root_password) --execute=\"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done
-
Check the RabbitMQ status. For example, if
192.168.0.47
is the IP address of a running Networker node, use the following command to get the status:[stack@director ~]$ ssh heat-admin@192.168.0.47 "sudo docker exec \$(sudo docker ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"
The
running_nodes
key should only show the two available nodes and not the failed node.Check the
nova-compute
service on the director node:[stack@director ~]$ sudo systemctl status openstack-nova-compute [stack@director ~]$ openstack hypervisor list
The output should show all non-maintenance mode nodes as
up
.Make sure all undercloud services are running:
[stack@director ~]$ sudo systemctl -t service
Chapter 2. Node Replacement
Identify the index of the node to remove. The node index is the suffix on the instance name from Nova
list
output. For example:[stack@director ~]$ openstack server list
+--------------------------------------+------------------------+ | ID | Name | +--------------------------------------+------------------------+ | 861408be-4027-4f53-87a6-cd3cf206ba7a | overcloud-compute-0 | | 0966e9ae-f553-447a-9929-c4232432f718 | overcloud-compute-1 | | 9c08fa65-b38c-4b2e-bd47-33870bff06c7 | overcloud-compute-2 | | a7f0f5e1-e7ce-4513-ad2b-81146bc8c5af | overcloud-controller-0 | | cfefaf60-8311-4bc3-9416-6a824a40a9ae | overcloud-controller-1 | | 97a055d4-aefd-481c-82b7-4a5f384036d2 | overcloud-controller-2 | | 844c9a88-713a-4ff1-8737-6410bf551d4f | overcloud-networker-0 | | aef7c27a-f0b4-4814-b0ff-aaf8d05ad721 | overcloud-networker-1 | | c2e40164-c659-4849-a28f-507eb7edb79f | overcloud-networker-2 | +--------------------------------------+------------------------+
In this example, the aim is to remove the
overcloud-networker-1
node and replace it withovercloud-networker-3
. First, set the node into maintenance mode so the director does not re-provision the failed node. Correlate the instance ID from nova list with the node ID fromopenstack baremetal node list
. For example:[stack@director ~]$ openstack baremetal node list
+------------------------+------+---------------------------------+ | UUID | Name | Instance UUID | +------------------------+------+---------------------------------+ | 36404147-7c8a-41e6-8c72| None | 7bee57cf-4a58-4eaf-b851 | | 91eb9ac5-7d52-453c-a017| None | None | | 75b25e9a-948d-424a-9b3b| None | None | | 038727da-6a5c-425f-bd45| None | 763bfec2-9354-466a-ae65 | | dc2292e6-4056-46e0-8848| None | 2017b481-706f-44e1-852a | | c7eadcea-e377-4392-9fc3| None | 5f73c7d7-4826-49a5-b6be | | da3a8d19-8a59-4e9d-923a| None | cfefaf60-8311-4bc3-9416 | | 807cb6ce-6b94-4cd1-9969| None | c07c13e6-a845-4791-9628 | | 0c245daa-7817-4ae9-a883| None | 844c9a88-713a-4ff1-8737 | | e6499ef7-3db2-4ab4-bfa7| None | aef7c27a-f0b4-4814-b0ff | | 7545385c-bc49-4eb9-b13c| None | c2e40164-c659-4849-a28f | +------------------------+------+---------------------------------+ (truncated UUIDs)
Set the node into maintenance mode.
[stack@director ~]$ openstack baremetal node maintenance set \ e6499ef7-3db2-4ab4-bfa7-ef59539bf972
Tag the new node with the
networker
profile.[stack@director ~]$ openstack baremetal node set --property \ capabilities='profile:networker,boot_option:local' \ 91eb9ac5-7d52-453c-a017-c0e3d823efd0
Create a
~/templates/remove-networker.yaml
YAML file that defines the node index to remove:parameters: NetworkerRemovalPolicies: [{'resource_list': ['1']}]
Create a
~/templates/node-count-networker.yaml
file and set the total count of Networker nodes in the file. For example, if the cluster has 3 Networker nodes, the file will look like this:parameter_defaults: OvercloudNetworkerFlavor: networker NetworkerCount: 3
Redeploy the overcloud including the
node-count-networker.yaml
andremove-networker.yaml
environment files:[stack@director ~]$ openstack overcloud deploy --templates -e ~/templates/node-count-networker.yaml -e ~/templates/remove-networker.yaml [OTHER OPTIONS]
If you passed any extra environment files or options when you created the overcloud, pass them again here to avoid making undesired changes to the overcloud. However, note that
-e ~/templates/remove-networker.yaml
is only required once in this instance.
The director removes the old node, creates a new one, and updates the overcloud stack. Check the status of the overcloud stack using the following command:
[stack@director ~]$ openstack stack list --nested
Verify that the new network node is listed, and the old one is removed.
[stack@director ~]$ openstack server list
+--------------------------------------+------------------------+ | ID | Name | +--------------------------------------+------------------------+ | 861408be-4027-4f53-87a6-cd3cf206ba7a | overcloud-compute-0 | | 0966e9ae-f553-447a-9929-c4232432f718 | overcloud-compute-1 | | 9c08fa65-b38c-4b2e-bd47-33870bff06c7 | overcloud-compute-2 | | a7f0f5e1-e7ce-4513-ad2b-81146bc8c5af | overcloud-controller-0 | | cfefaf60-8311-4bc3-9416-6a824a40a9ae | overcloud-controller-1 | | 97a055d4-aefd-481c-82b7-4a5f384036d2 | overcloud-controller-2 | | 844c9a88-713a-4ff1-8737-6410bf551d4f | overcloud-networker-0 | | c2e40164-c659-4849-a28f-507eb7edb79f | overcloud-networker-2 | | 425a0828-b42f-43b0-940c-7fb02522753a | overcloud-networker-3 | +--------------------------------------+------------------------+
Chapter 3. Neutron Cleanup and Rescheduling
Following the previous procedure for replacing a Networker node, remove all neutron agents on the removed Networker node from the database to ensure that they don’t show up as dead agents, and to ensure that DHCP resources are automatically rescheduled to other Networker nodes.
Source
overcloudrc
to gain admin credentials on the OpenStack deployment overcloud.[stack@director ~]$ source ~/overcloudrc
Verify that 4 agents exist, and are marked dead as indicated by
xxx
for theovercloud-networker-1
(metadata, l3, openvswitch and dhcp).[stack@director ~]$ neutron agent-list -c id -c binary -c host -c alive | grep overcloud-networker-1
| 8377-66d75323e466 | neutron-metadata-agent | overcloud-networker-1 | xxx | | b55d-797668c33670 | neutron-l3-agent | overcloud-networker-1 | xxx | | 9dcb-00a9e32ecde4 | neutron-ovs-agent | overcloud-networker-1 | xxx | | be83-e4d932984654 | neutron-dhcp-agent | overcloud-networker-1 | xxx | (truncated UUIDs)
Capture the UUIDs of the agents registered for the removed
overcloud-networker-1
.[stack@director ~]$ AGENT_UUIDS=$(neutron agent-list -c id -c binary -c host -c alive -f value | grep overcloud-networker-1 | cut -d\ -f1)
Delete any remaining
overcloud-networker-1
agents from the database.[stack@director ~]$ for agent in $AGENT_UUIDS; do neutron agent-delete $agent ; done
Deleted agent(s): 5024f9b5-7ad9-4692-8377-66d75323e466 Deleted agent(s): 9f49adba-50a1-48ca-b55d-797668c33670 Deleted agent(s): b66221f8-61cf-4017-9dcb-00a9e32ecde4 Deleted agent(s): b6b1e492-9420-4406-be83-e4d932984654
Chapter 4. Rescheduling Tenant Routers
Reschedule all tenant routers on all Networker nodes.
Verify that all the existing L3 agents are marked alive as indicated by
:-)
, and that the number of agents are correct. In the foregoing examples, there were three Networker nodes. So there would be threeneutron-l3-agent
lines. For example:[stack@director ~]$ openstack network agent list -c ID -c Binary -c Host -c Alive | grep neutron-l3-agent
| 41d3-ab4e-66f1267ce4f8 | neutron-l3-agent | overcloud-networker-0 | :-) | | 4ba6-9696-623759039af8 | neutron-l3-agent | overcloud-networker-2 | :-) | | 4112-b3e3-e93fb3826ce7 | neutron-l3-agent | overcloud-networker-3 | :-) | (UUID truncated)
Ensure that all routers are associated to an agent. Start by setting the number of agents that should be hosting each of the routers. This should match the
max_l3_agents_per_router
setting in neutron configuration (the default is3
).[stack@director ~]$ export MAX_L3_AGENTS=3
WarningIf you are not using using
l3-ha
, setMAX_L3_AGENTS
to1
.Once the
MAX_L3_AGENTS
variable is set, continue by running the following script in the console (or from a Bash file).MAX_L3_AGENTS=${MAX_L3_AGENTS:-3} L3_AGENT_UUIDS=$(openstack network agent list -c ID -c Binary -f value | grep neutron-l3-agent | cut -d\ -f1) ROUTER_UUIDS=$(openstack router list -c ID -f value) for router_id in $ROUTER_UUIDS; do echo "Processing router $router_id" R_AGENTS=$(neutron l3-agent-list-hosting-router $router_id -f value -c id) SHUFF_AGENTS=$(shuf -e $L3_AGENT_UUIDS) N_AGENTS=$(echo $R_AGENTS | wc -w) if [ "$MAX_L3_AGENTS" -gt "$N_AGENTS" ]; then COUNT=`expr $N_AGENTS - $MAX_L3_AGENTS` for agent_id in $SHUFF_AGENTS; do if echo "$R_AGENTS" | grep "$agent_id" >/dev/null ; then # skipping agent, since router is already associated to it continue fi neutron l3-agent-router-add $agent_id $router_id N_AGENTS=`expr $N_AGENTS + 1` if [ "$N_AGENTS" -ge "$MAX_L3_AGENTS" ]; then break fi done fi done
Chapter 5. Rescheduling Tenant DHCP Services
OpenStack enables DHCP automatic failure by default. This procedure ensures that existing networks are properly scheduled to several DHCP agents.
Configure an environment variable to match the NeutronDhcpAgentsPerNetwork (
dhcp_agents_per_network
) configuration setting in the overcloud deployment templates. The default is3
.[stack@director ~]$ export MAX_DHCP_AGENTS=3
Once the
MAX_DHCP_AGENTS
variable is set, run the following script in the console (or from a Bash file).MAX_DHCP_AGENTS=${MAX_DHCP_AGENTS:-3} DHCP_AGENT_UUIDS=$(openstack network agent list -c ID -c Binary -c Alive -f value | grep neutron-dhcp-agent | grep True | cut -d\ -f1) DHCP_NETWORK_UUIDS=$(openstack subnet list --dhcp -c Network -f value) for network_id in $DHCP_NETWORK_UUIDS; do echo "Processing network $network_id" NET_AGENTS=$(neutron dhcp-agent-list-hosting-net $network_id -c id -c alive -f value | grep ":-)" | cut -f1 -d\ ) SHUFF_AGENTS=$(shuf -e $DHCP_AGENT_UUIDS) N_AGENTS=$(echo $NET_AGENTS | wc -w) if [ "$MAX_DHCP_AGENTS" -gt "$N_AGENTS" ]; then COUNT=`expr $N_AGENTS - $MAX_DHCP_AGENTS` for agent_id in $SHUFF_AGENTS; do if echo "$NET_AGENTS" | grep "$agent_id" >/dev/null ; then # skipping agent, since network is already associated to it continue fi neutron dhcp-agent-network-add $agent_id $network_id N_AGENTS=`expr $N_AGENTS + 1` if [ "$N_AGENTS" -ge "$MAX_DHCP_AGENTS" ]; then break fi done fi done