Replacing Networker Nodes

In certain circumstances a node with a Networker profile as described in [sect-Tagging_Nodes_into_Profiles] in a high availability cluster might fail. In these situations, you must remove the node from the cluster and replace it with a new Networker node. This also includes ensuring the node connects to the other nodes in the cluster.

This section provides instructions on how to replace a Networker node. The process involves running the openstack overcloud deploy command to update the overcloud with a request to replace a Networker node.

Important
The following procedure only applies to high availability environments. Do not use this procedure if only using one Networker node.

Preliminary Checks

Before attempting to replace an overcloud Networker node, it is important to check the current state of your Red Hat OpenStack Platform environment. Checking the current state can help avoid complications during the Networker replacement process. Use the following list of preliminary checks to determine if it is safe to perform a Networker node replacement. Run all commands for these checks on the undercloud.

  1. Check the current status of the overcloud stack on the undercloud:

    $ source stackrc
    (undercloud) $ openstack stack list --nested

    The overcloud stack and its subsequent child stacks should have either a CREATE_COMPLETE or UPDATE_COMPLETE.

  2. Perform a backup of the undercloud databases:

    (undercloud) $ mkdir /home/stack/backup
    (undercloud) $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz
  3. Ensure the undercloud contains 10 GB free storage to accommodate for image caching and conversion when provisioning the new node.

  4. Check the status of Pacemaker on the running Networker nodes. For example, if 192.168.0.47 is the IP address of a running Networker node, use the following command to get the Pacemaker status:

    (undercloud) $ ssh heat-admin@192.168.0.47 'sudo pcs status'

    The output should show all services running on the existing nodes and stopped on the failed node.

  5. Check the following parameters on each node of the overcloud’s MariaDB cluster:

    • wsrep_local_state_comment: Synced

    • wsrep_cluster_size: 2

      Use the following command to check these parameters on each running Networker node (respectively using 192.168.0.47 and 192.168.0.46 for IP addresses):

      (undercloud) $ for i in 192.168.0.47 192.168.0.46 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql -p\$(sudo hiera -c /etc/puppet/hiera.yaml mysql::server::root_password) --execute=\"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done
  6. Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Networker node, use the following command to get the status:

    (undercloud) $ ssh heat-admin@192.168.0.47 "sudo docker exec \$(sudo docker ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"

    The running_nodes key should only show the two available nodes and not the failed node.

  7. Check the nova-compute service on the director node:

    (undercloud) $ sudo systemctl status openstack-nova-compute
    (undercloud) $ openstack hypervisor list

    The output should show all non-maintenance mode nodes as up.

  8. Make sure all undercloud services are running:

    (undercloud) $ sudo systemctl -t service

Node Replacement

  1. Identify the index of the node to remove. The node index is the suffix on the instance name from Nova list output. For example:

    [stack@director ~]$ openstack server list
    +--------------------------------------+------------------------+
    | ID                                   | Name                   |
    +--------------------------------------+------------------------+
    | 861408be-4027-4f53-87a6-cd3cf206ba7a | overcloud-compute-0    |
    | 0966e9ae-f553-447a-9929-c4232432f718 | overcloud-compute-1    |
    | 9c08fa65-b38c-4b2e-bd47-33870bff06c7 | overcloud-compute-2    |
    | a7f0f5e1-e7ce-4513-ad2b-81146bc8c5af | overcloud-controller-0 |
    | cfefaf60-8311-4bc3-9416-6a824a40a9ae | overcloud-controller-1 |
    | 97a055d4-aefd-481c-82b7-4a5f384036d2 | overcloud-controller-2 |
    | 844c9a88-713a-4ff1-8737-6410bf551d4f | overcloud-networker-0  |
    | aef7c27a-f0b4-4814-b0ff-aaf8d05ad721 | overcloud-networker-1  |
    | c2e40164-c659-4849-a28f-507eb7edb79f | overcloud-networker-2  |
    +--------------------------------------+------------------------+

    In this example, the aim is to remove the overcloud-networker-1 node and replace it with overcloud-networker-3. First, set the node into maintenance mode so the director does not re-provision the failed node. Correlate the instance ID from nova list with the node ID from openstack baremetal node list. For example:

    [stack@director ~]$ openstack baremetal node list
    +------------------------+------+--------------------------------------+
    | UUID                   | Name | Instance UUID                        |
    +------------------------+------+--------------------------------------+
    | 36404147-7c8a-41e6-8c72| None | 7bee57cf-4a58-4eaf-b851              |
    | 91eb9ac5-7d52-453c-a017| None | None                                 |
    | 75b25e9a-948d-424a-9b3b| None | None                                 |
    | 038727da-6a5c-425f-bd45| None | 763bfec2-9354-466a-ae65              |
    | dc2292e6-4056-46e0-8848| None | 2017b481-706f-44e1-852a              |
    | c7eadcea-e377-4392-9fc3| None | 5f73c7d7-4826-49a5-b6be              |
    | da3a8d19-8a59-4e9d-923a| None | cfefaf60-8311-4bc3-9416              |
    | 807cb6ce-6b94-4cd1-9969| None | c07c13e6-a845-4791-9628              |
    | 0c245daa-7817-4ae9-a883| None | 844c9a88-713a-4ff1-8737              |
    | e6499ef7-3db2-4ab4-bfa7| None | aef7c27a-f0b4-4814-b0ff              |
    | 7545385c-bc49-4eb9-b13c| None | c2e40164-c659-4849-a28f              |
    +------------------------+------+--------------------------------------+
    (truncated UUIDs)
  2. Set the node into maintenance mode.

    [stack@director ~]$ openstack baremetal node maintenance set \
                  e6499ef7-3db2-4ab4-bfa7-ef59539bf972
  3. Tag the new node with the networker profile.

    [stack@director ~]$ openstack baremetal node set --property \
         capabilities='profile:networker,boot_option:local' \
         e6499ef7-3db2-4ab4-bfa7-ef59539bf972
  4. Create a ~/templates/remove-networker.yaml YAML file that defines the node index to remove:

    parameters:
     NetworkerRemovalPolicies:
        [{'resource_list': ['1']}]
  5. Set the total count of Networker nodes in the ~/templates/remove-networker.yaml file. For example, if the cluster has 3 Networker nodes, the file will look like this:

    parameter_defaults:
      OvercloudNetworkerFlavor: networker
      NetworkerCount: 3
  6. Redeploy the overcloud including the node-count-networker.yaml and remove-networker.yaml environment files:

    [stack@director ~]$ openstack overcloud deploy --templates -e  ~/templates/node-count-networker.yaml -e ~/templates/remove-networker.yaml [OTHER OPTIONS]

    If you passed any extra environment files or options when you created the overcloud, pass them again here to avoid making undesired changes to the overcloud. However, note that -e ~/templates/remove-networker.yaml is only required once in this instance.

The director removes the old node, creates a new one, and updates the overcloud stack. Check the status of the overcloud stack using the following command:

[stack@director ~]$ openstack stack list --nested

Verify that the new network node is listed, and the old one is removed.

[stack@director ~]$ openstack server list
+--------------------------------------+------------------------+
| ID                                   | Name                   |
+--------------------------------------+------------------------+
| 861408be-4027-4f53-87a6-cd3cf206ba7a | overcloud-compute-0    |
| 0966e9ae-f553-447a-9929-c4232432f718 | overcloud-compute-1    |
| 9c08fa65-b38c-4b2e-bd47-33870bff06c7 | overcloud-compute-2    |
| a7f0f5e1-e7ce-4513-ad2b-81146bc8c5af | overcloud-controller-0 |
| cfefaf60-8311-4bc3-9416-6a824a40a9ae | overcloud-controller-1 |
| 97a055d4-aefd-481c-82b7-4a5f384036d2 | overcloud-controller-2 |
| 844c9a88-713a-4ff1-8737-6410bf551d4f | overcloud-networker-0  |
| c2e40164-c659-4849-a28f-507eb7edb79f | overcloud-networker-2  |
| 425a0828-b42f-43b0-940c-7fb02522753a | overcloud-networker-3  |
+--------------------------------------+------------------------+

Neutron Cleanup and Rescheduling

Following the previous procedure for replacing a Networker node, remove all neutron agents on the removed Networker node from the database to ensure that they don’t show up as dead agents, and to ensure that DHCP resources are automatically rescheduled to other Networker nodes.

  1. Source overcloudrc to gain admin credentials on the OpenStack deployment overcloud.

    [stack@director ~]$ source ~/overcloudrc
  2. Verify that 4 agents exist, and are marked dead as indicated by xxx for the overcloud-networker-1 (metadata, l3, openvswitch and dhcp).

    [stack@director ~]$ neutron agent-list -c id -c binary -c host -c alive  | grep overcloud-networker-1
    | 8377-66d75323e466 | neutron-metadata-agent | overcloud-networker-1 | xxx |
    | b55d-797668c33670 | neutron-l3-agent       | overcloud-networker-1 | xxx |
    | 9dcb-00a9e32ecde4 | neutron-ovs-agent      | overcloud-networker-1 | xxx |
    | be83-e4d932984654 | neutron-dhcp-agent     | overcloud-networker-1 | xxx |
    (truncated UUIDs)
  3. Capture the UUIDs of the agents registered for the removed overcloud-networker-1.

    [stack@director ~]$ AGENT_UUIDS=$(neutron agent-list -c id -c binary -c host -c alive -f value | grep overcloud-networker-1 | cut -d\  -f1)
  4. Delete any remaining overcloud-networker-1 agents from the database.

    [stack@director ~]$ for agent in $AGENT_UUIDS; do neutron agent-delete $agent ; done
    Deleted agent(s): 5024f9b5-7ad9-4692-8377-66d75323e466
    Deleted agent(s): 9f49adba-50a1-48ca-b55d-797668c33670
    Deleted agent(s): b66221f8-61cf-4017-9dcb-00a9e32ecde4
    Deleted agent(s): b6b1e492-9420-4406-be83-e4d932984654

Rescheduling Tenant Routers

Reschedule all tenant routers on all Networker nodes.

  1. Verify that all the existing L3 agents are marked alive as indicated by :-), and that the number of agents are correct. In the foregoing examples, there were three Networker nodes. So there would be three neutron-l3-agent lines. For example:

    [stack@director ~]$ openstack network agent list -c ID -c Binary -c Host -c Alive | grep neutron-l3-agent
    | 41d3-ab4e-66f1267ce4f8 | neutron-l3-agent | overcloud-networker-0 | :-) |
    | 4ba6-9696-623759039af8 | neutron-l3-agent | overcloud-networker-2 | :-) |
    | 4112-b3e3-e93fb3826ce7 | neutron-l3-agent | overcloud-networker-3 | :-) |
    (UUID truncated)
  2. Ensure that all routers are associated to an agent. Start by setting the number of agents that should be hosting each of the routers. This should match the max_l3_agents_per_router setting in neutron configuration (the default is 3).

    [stack@director ~]$ export MAX_L3_AGENTS=3
    Warning
    If you are not using using l3-ha, set MAX_L3_AGENTS to 1.

    Once the MAX_L3_AGENTS variable is set, continue by running the following script in the console (or from a Bash file).

    MAX_L3_AGENTS=${MAX_L3_AGENTS:-3}
    L3_AGENT_UUIDS=$(openstack network agent list -c ID -c Binary -f value | grep neutron-l3-agent | cut -d\  -f1)
    ROUTER_UUIDS=$(openstack router list -c ID -f value)
    
    for router_id in $ROUTER_UUIDS; do
    
       echo "Processing router $router_id"
    
       R_AGENTS=$(neutron l3-agent-list-hosting-router $router_id -f value  -c id)
       SHUFF_AGENTS=$(shuf -e $L3_AGENT_UUIDS)
       N_AGENTS=$(echo $R_AGENTS | wc -w)
    
       if [ "$MAX_L3_AGENTS" -gt "$N_AGENTS" ]; then
           COUNT=`expr $N_AGENTS - $MAX_L3_AGENTS`
           for agent_id in $SHUFF_AGENTS; do
    
               if echo "$R_AGENTS" | grep "$agent_id" >/dev/null ; then
               	  # skipping agent, since router is already associated to it
                  continue
               fi
               neutron l3-agent-router-add $agent_id $router_id
    
               N_AGENTS=`expr $N_AGENTS + 1`
               if [ "$N_AGENTS" -ge "$MAX_L3_AGENTS" ]; then
                  break
               fi
           done
       fi
    done

Rescheduling Tenant DHCP Services

OpenStack enables DHCP automatic failure by default. This procedure ensures that existing networks are properly scheduled to several DHCP agents.

  1. Configure an environment variable to match the NeutronDhcpAgentsPerNetwork (dhcp_agents_per_network) configuration setting in the overcloud deployment templates. The default is 3.

    [stack@director ~]$ export MAX_DHCP_AGENTS=3
  2. Once the MAX_DHCP_AGENTS variable is set, run the following script in the console (or from a Bash file).

    MAX_DHCP_AGENTS=${MAX_DHCP_AGENTS:-3}
    DHCP_AGENT_UUIDS=$(openstack network agent list -c ID -c Binary -c Alive -f value | grep neutron-dhcp-agent | grep True | cut -d\  -f1)
    DHCP_NETWORK_UUIDS=$(openstack subnet list --dhcp  -c Network -f value)
    
    for network_id in $DHCP_NETWORK_UUIDS; do
    
       echo "Processing network $network_id"
    
       NET_AGENTS=$(neutron dhcp-agent-list-hosting-net $network_id -c id -c alive -f value | grep ":-)" | cut -f1 -d\ )
       SHUFF_AGENTS=$(shuf -e $DHCP_AGENT_UUIDS)
       N_AGENTS=$(echo $NET_AGENTS | wc -w)
    
       if [ "$MAX_DHCP_AGENTS" -gt "$N_AGENTS" ]; then
           COUNT=`expr $N_AGENTS - $MAX_DHCP_AGENTS`
           for agent_id in $SHUFF_AGENTS; do
    
               if echo "$NET_AGENTS" | grep "$agent_id" >/dev/null ; then
               	  # skipping agent, since network is already associated to it
                  continue
               fi
               neutron dhcp-agent-network-add $agent_id $network_id
    
               N_AGENTS=`expr $N_AGENTS + 1`
               if [ "$N_AGENTS" -ge "$MAX_DHCP_AGENTS" ]; then
                  break
               fi
           done
       fi
    
    done