TIM End to End OVS-DPDK Troubleshooting Guide
How to use the end to end OVS-DPDK troubleshooting procedures
Abstract
- Preface
- 1. Preliminary Checks
- 2. High Packet Loss in the TX Queue of the Instance’s Tap Interface
- 3. TX Drops on Instance VHU Interfaces with Open vSwitch DPDK
- 4. Interpreting the output of the
pmd-stats-show
command in Open vSwitch with DPDK - 5. Attaching and Detaching SR-IOV ports in nova
- 6. Configure and Test LACP Bonding with Open vSwitch DPDK
- 7. Deploying different bond modes with OVS DPDK
- 8. Receiving the
Could not open network device dpdk0 (No such device) in ovs-vsctl show
message - 9. Insufficient Free Host Memory Pages Available to Allocate Guest RAM with Open vSwitch DPDK
- 10. Troubleshoot OVS DPDK PMD CPU Usage with perf and Collect and Send the Troubleshooting Data
- 11. Using virsh emulatorpin in virtual environments with NFV
- 11.1. Symptom
- 11.2. Solution
- 11.2.1. qemu-kvm Emulator Threads
- 11.2.2. Default Behavior for Emulator Thread Pinning
- 11.2.3. The Current Implementation for Emulator Thread Pinning in OpenStack nova (OpenStack Platform 10)
- 11.2.4. Later Changes to OpenStack nova (OpenStack Platform 12 and Above) for Emulator Thread Pinning
- 11.2.5. About the Impact of isolcpus on Emulator Thread Scheduling
- 11.2.6. Optimal Location of Emulator Threads
- 11.3. Diagnosis
- 12. Mixing System (Kernel Space) and netdev (DPDK User Space) Datapath OVS Bridges
- 13. DPDK vhost-user port does not Send or Receive Traffic
- 14. Using ovs-tcpdump on vhost-user Interfaces
Preface
This troubleshooting guide provides detailed procedures for OVS-DPDK workloads The procedures documented in this Guide supersede the Knowledge Base articles.
Chapter 1. Preliminary Checks
Before using any of the procedures in this document, perform the following procedures:
Chapter 2. High Packet Loss in the TX Queue of the Instance’s Tap Interface
Use this procedure to determine the cause of packet loss in the TX queue and how to diagnose the problem.
2.1. Symptom
During a test of a VNF using host-only networking, high packet loss can be observed in the TX queue of the instance’s tap interface. The test setup sends packets from one VM on a node to another VM on the same node. The packet loss appears in bursts.
The following example shows a high number of dropped packets in the tap’s TX queue.
ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500034259301 132047795 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5481296464 81741449 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0
2.2. Diagnosis
This procedure deals with drop on tap (kernel path) interfaces. For drops on vhost user interfaces in the user datapath, see https://access.redhat.com/solutions/3381011
TX drops are due to interference between the instance’s vCPU and other processes on the hypervisor. The TX queue of the tap interface is a buffer that can store packets for a short while in case that the instance cannot pick up the packets. This would happen if the instance’s CPU is held off from running (or freezes) for a long enough time.
A tuntap
is a virtual device where one end is a kernel network interface, and the other end is a user space file descriptor.
tuntap
can run in two modes:
- Tap mode feeds L2 ethernet frames with L2 header into the device, and expects to receive the same out from user space. This mode is used for VMs.
- Tun mode feeds L3 IP packets with L3 header into the device, and expects to receive the same out from user space. This mode is mostly used for VPN clients.
In KVM networking, the user space file descriptor is owned by the qemu-kvm
process. Any frames that are sent into the tap (TX from the hypervisor’s perspective) end up as L2 frames inside qemu-kvm
, which can then feed those frames to the virtual network device in the VM as network packets received into the virtual network interface (RX from the VM’s perspective).
The key concept with tuntap
is: hypervisor TX == VM RX
. The opposite is also true: hypervisor RX == VM TX
.
There is no "ring buffer" of packets on a virtio-net device. This means that if the tuntap device’s TX queue fills up because the VM is not receiving (either fast enough or at all) then there is nowhere for new packets to go, and the hypervisor sees TX loss on the tap.
If you notice TX loss on a tuntap
, then increasing the tap txqueuelen
is one way to help avoid that, similar to increasing the RX ring buffer to stop receive loss on a physical NIC.
However, this assumes the VM is just "slow" and "bursty" at receive. If the VM is not executing fast enough all the time, or otherwise not receiving at all, then tuning the TX queue length won’t help. You will need to find out why the VM is not running or receiving.
If you only need to improve VM packet handling performance, one can enable virtio-net multiqueue
on the hypervisor and then balance those multiple virtual device interrupts on difference cores inside the VM. This is documented in the libvirt domain spec for KVM (it can be done with virsh edit
on RHEL KVM hypervisor).
If you cannot configure virtio-net multiqueue
in Red Hat OpenStack Platform, consider configuring RPS inside the VM to balance receive load across multiple CPU cores with software. See scaling.txt
in the kernel-doc package, or see the RPS section in the RHEL product documentation.
2.2.1. Workaround
Increasing the TX queue helps deal with these microfreezes at the cost of higher latency and other disadvantages.
txqueuelen
can be temporarily increased via:
/sbin/ip link set tap<uuid> txqueuelen <new queue length>
txqueulen
can be permanently increased via a udev rule:
cat <<'EOF'>/etc/udev/rules.d/71-net-txqueuelen.rules SUBSYSTEM=="net", ACTION=="add", KERNEL=="tap*", ATTR{tx_queue_len}="10000" EOF
After reloading udev or rebooting the system, new tap interfaces will come up with a queue length of 10000. For example:
[root@overcloud-compute-0 ~]# ip link ls | grep tap 29: tap122be807-cd: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 5505 qdisc pfifo_fast master qbr122be807-cd state UNKNOWN mode DEFAULT group default qlen 10000
2.2.2. Diagnostic Steps
In order to verify the above, use following script:
[root@ibm-x3550m4-9 ~]# cat generate-tx-drops.sh #!/bin/bash trap 'cleanup' INT cleanup() { echo "Cleanup ..." if [ "x$HPING_PID" != "x" ]; then echo "Killing hping3 with PID $HPING_PID" kill $HPING_PID fi if [ "x$DD_PID" != "x" ]; then echo "Killing dd with PID $DD_PID" kill $DD_PID fi exit 0 } VM_IP=10.0.0.20 VM_TAP=tapc18eb09e-01 VM_INSTANCE_ID=instance-00000012 LAST_CPU=$( lscpu | awk '/^CPU\(s\):/ { print $NF - 1 }' ) # this is a 12 core system, we are sending everything to CPU 11, # so the taskset mask is 800 so set dd affinity only for last CPU TASKSET_MASK=800 # pinning vCPU to last pCPU echo "virsh vcpupin $VM_INSTANCE_ID 0 $LAST_CPU" virsh vcpupin $VM_INSTANCE_ID 0 $LAST_CPU # make sure that: nova secgroup-add-rule default udp 1 65535 0.0.0.0/0 # make sure that: nova secgroup-add-rule default tcp 1 65535 0.0.0.0/0 # make sure that: nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0 # --fast, --faster or --flood can also be used echo "hping3 -u -p 5000 $VM_IP --faster > /dev/null " hping3 -u -p 5000 $VM_IP --faster > /dev/null & HPING_PID=$! echo "hping is running, but dd not yet:" for i in { 1 .. 3 }; do date echo "ip -s -s link ls dev $VM_TAP" ip -s -s link ls dev $VM_TAP sleep 5 done echo "Starting dd and pinning it to the same pCPU as the instance" echo "dd if=/dev/zero of=/dev/null" dd if=/dev/zero of=/dev/null & DD_PID=$! echo "taskset -p $TASKSET_MASK $DD_PID" taskset -p $TASKSET_MASK $DD_PID for i in { 1 .. 3 }; do date echo "ip -s -s link ls dev $VM_TAP" ip -s -s link ls dev $VM_TAP sleep 5 done cleanup
Log into the instance and start dd if=/dev/zero of=/dev/null
to generate additional load on its only vCPU. Note that this is for demonstration purposes. You can repeat the same test with and without load from within the VM. It doesn’t matter. TX drop only occurs when another process on the hypervisor is stealing time from the instance’s vCPU.
The following example shows an instance before the test:
%Cpu(s): 22.3 us, 77.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 1884108 total, 1445636 free, 90536 used, 347936 buff/cache KiB Swap: 0 total, 0 free, 0 used. 1618720 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30172 root 20 0 107936 620 528 R 99.9 0.0 0:05.89 dd
Run the following script and have a look at the dropped packages in the TX queue. These only occur when the dd process is stealing a significant amount of processing time from the instance’s CPU.
[root@ibm-x3550m4-9 ~]# ./generate-tx-drops.sh virsh vcpupin instance-00000012 0 11 hping3 -u -p 5000 10.0.0.20 --faster > /dev/null hping is running, but dd not yet: Tue Nov 29 12:28:22 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500034259301 132047795 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5481296464 81741449 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:27 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500055729011 132445382 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5502766282 82139038 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:32 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500077122125 132841551 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5524159396 82535207 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:37 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500098181033 133231531 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5545218358 82925188 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:42 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500119152685 133619793 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5566184804 83313451 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Starting dd and pinning it to the same pCPU as the instance dd if=/dev/zero of=/dev/null taskset -p 800 8763 pid 8763's current affinity mask: fff pid 8763's new affinity mask: 800 Tue Nov 29 12:28:47 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500140267091 134010698 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5587300452 83704477 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:52 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500159822749 134372711 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5606853168 84066563 0 11188074 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:57 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500179161241 134730729 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5626179144 84424451 0 11223096 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:29:02 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500198344463 135085948 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5645365410 84779752 0 11260740 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:29:07 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500217014275 135431570 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5664031398 85125418 0 11302179 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Cleanup ... Killing hping3 with PID 8722 Killing dd with PID 8763 [root@ibm-x3550m4-9 ~]# --- 10.0.0.20 hping statistic --- 3919615 packets transmitted, 0 packets received, 100% packet loss round-trip min/avg/max = 0.0/0.0/0.0 ms
The following example shows an instance during the test with dd stealing on the hypervisor (look at the 45% st value):
%Cpu(s): 7.0 us, 27.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 20.2 si, 45.4 st KiB Mem : 1884108 total, 1445484 free, 90676 used, 347948 buff/cache KiB Swap: 0 total, 0 free, 0 used. 1618568 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30172 root 20 0 107936 620 528 R 54.3 0.0 1:00.50 dd
Note that ssh
may become sluggish during the second half of the test on the instance, including the possibility of timing out if the test runs too long.
2.3. Solution
Increasing the TX queue helps deal with these microfreezes. However, the real solution would be complete isolation with CPU pinning and isolcpus in the kernel parameters. Refer to Configure CPU pinning with NUMA in OpenStack for further details.
Chapter 3. TX Drops on Instance VHU Interfaces with Open vSwitch DPDK
Use this procedure to determine the cause of TX drops on instance VHU interfaces and how to diagnose the problem.
3.1. Symptom
The vhost-user interface (VHU) exchanges packets with the virtual machine. This interface allows the packet to go from the vswitch directly to the guest using the virtio transport without passing through the kernel or qemu processes.
The VHU is mostly implemented by DPDK librte_vhost that also offers functions to send or receive batches of packets. However, the backend of VHU is a virtio ring provided by qemu to exchange packets with the virtual machine. The virtio ring has a special format comprised of descriptors and buffers.
The TX/RX statistics are the OVS view of things: TX means the TX from the OVS prespective, meaning RX from the VM’s point of view. If the VM does not pull packets fast enough, for whatever reason, OVS will face a full TX queue and will drop packets.
3.1.1. Explanation for Spurious Drops
The reason for TX drops on the vhost-user device is a lack of space in the virtio ring. The virtio ring is located in the guest’s memory and it works like a queue where the vhost-user pushes packets and the VM consumes them. If the VM isn’t fast enough to consume the packets, the virtio ring runs out of buffers and the vhost-user drops packets.
More explicitly, the only reason for vhost-user ports to have TX drops is that the guest is not fetching packets fast enough, which causes the vhost-user port to run out of buffer space.
You can use the perf and ftrace tools to investigate possible causes for spurious drops. Perf can count the number of scheduler switches, which could show that the qemu thread was preempted by another thread. Ftrace can show how long and the reason for a preemption. The timer interrupts (kernel ticks), for instance, preempts the qemu threads plus the cost of at least two context switches. The timer interrupt can also run RCU callbacks which takes unpredictable amount of time. CPU power management and Hyper Threading can also disrupt the qemu thread. Note that these are just some of the possible reasons why the VM is not consuming packets fast enough from the virtio ring.
-
PERF:
perf rpm in rhel-7-server-rpms/7Server/x86_64
. For more information, see: About Perf -
FTRACE:
trace-cmd info rhel-7-server-rpms/7Server/x86_64
. For more information, see: About Ftrace
3.1.2. Explanation for other drops
The current implementation in OpenStack Platform 10 uses vhostuser ports. In the case of vhostuser ports, OVS is the server and qemu is the client. Regardless if a nova instance reboots from within the VM, is rebooted with nova, or is stopped and restarted, the vhost-user (VHU) port will continue to exist on the bridge. Frames will hit the port based on flow and/or MAC learning rules and will increase the tx_drop counter because the consumer (the VM) is down and with it the vhu port:
# in this example, the VM was stopped with `nova stop <UUID>`: [root@overcloud-compute-0 network-scripts]# ovs-vsctl list interface vhubd172106-73 | grep _state admin_state : up link_state : down
This is similar to what happens when the kernel port is shut down with ip link set dev <br internal port name>
down and frames are dropped in userspace.
When the VM comes up again, it will connect to the same vhu socket as before and will start consuming frames from the virtio ring buffer. Tx drops will stop and traffic is transmitted normally again. By increasing the TX and RX queue lengths for DPDK, you can change TX and RX queue lengths for DPDK.
3.1.3. Increasing the TX and RX queue lengths for DPDK
You can change TX and RX queue lengths for DPDK with the following OpenStack Director template modifications:
NovaComputeExtraConfig: nova::compute::libvirt::rx_queue_size: '"1024"' nova::compute::libvirt::tx_queue_size: '"1024"'
The following example shows the validation:
[root@overcloud-compute-1 ~]# ovs-vsctl get interface vhu9a9b0feb-2e status {features="0x0000000150208182", mode=client, num_of_vrings="2", numa="0", socket="/var/lib/vhost_sockets/vhu9a9b0feb-2e", status=connected, "vring_0_size"="1024", "vring_1_size"="1024"}
Due to kernel limitations, the queue size cannot be increased beyond 1024.
3.2. Diagnosis
TX drops towards the vhost user ports are observed when the guest cannot receive packets. Networks eventually drop packets and, in most cases, that’s not an issue. TCP, for instance, can easily recover. But other use-cases might have more strict requirements with less tolerance to packet drops.
DPDK accelerated OVS is used because the kernel datapath is too slow. The same happens inside the guest, so if the guest is running a regular kernel, the guest might not be able to run at the same pace as the host, and drops might happen.
3.3. Solution
Make sure the vCPUs processing the VM are running 100% allocated to the VM and not to other unrelated tasks is a good first step.
If the VM gets all CPU power possible, then look inside of the guest to make sure it is properly tuned.
Of course, running kernel datapath inside the guest is slower than running a DPDK application.
Chapter 4. Interpreting the output of the pmd-stats-show
command in Open vSwitch with DPDK
This procedure tells you how to interpret the output of the pmd-stats-show
command (ovs-appctl dpif-netdev/pmd-stats-show
) in Open vSwitch (OVS) with DPDK.
4.1. Symptom
An issue you could encounter when using the ovs-appctl dpif-netdev/pmd-stats-show
command is that the gathered statistics are charted since the PMD was started. This means that statistics collected before the current load are also reflected, which could provide an inaccurate measurement.
4.2. Diagnosis
If you want to obtain more current and useful output, you should put the system in to a steady state and reset the statistics that you want to measure:
# put system into steady state ovs-appctl dpif-netdev/pmd-stats-clear # wait <x> seconds sleep <x> ovs-appctl dpif-netdev/pmd-stats-show
Here’s an example of the output:
[root@overcloud-compute-0 ~]# ovs-appctl dpif-netdev/pmd-stats-clear && sleep 10 && ovs-appctl dpif-netdev/pmd-stats-show | egrep 'core_id (2|22):' -A9 pmd thread numa_id 0 core_id 22: emc hits:17461158 megaflow hits:0 avg. subtable lookups per hit:0.00 miss:0 lost:0 polling cycles:4948219259 (25.81%) processing cycles:14220835107 (74.19%) avg cycles per packet: 1097.81 (19169054366/17461158) avg processing cycles per packet: 814.43 (14220835107/17461158) -- pmd thread numa_id 0 core_id 2: emc hits:14874381 megaflow hits:0 avg. subtable lookups per hit:0.00 miss:0 lost:0 polling cycles:5460724802 (29.10%) processing cycles:13305794333 (70.90%) avg cycles per packet: 1261.67 (18766519135/14874381) avg processing cycles per packet: 894.54 (13305794333/14874381)
Note that core_id 2
is mainly busy, spending 70% of the time processing and 30% of the time polling.
polling cycles:5460724802 (29.10%) processing cycles:13305794333 (70.90%)
This example shows miss
indicates packets that were not classified in the DPDK datapath ('emc' or 'dp' classifier). Under normal circumstances, they would then be sent to the ofproto
layer. On some rare occasions, due to a flow revalidation lock or if the ofproto
layer returns an error, the packet is dropped. In this case, lost
will also be incremented to indicate the loss.
For more details, see: https://software.intel.com/en-us/articles/ovs-dpdk-datapath-classifier
emc hits:14874381 megaflow hits:0 avg. subtable lookups per hit:0.00 miss:0 lost:0
4.3. Solution
This section explains the procedures for resolving the problem.
4.3.1. Idle PMD
The following example shows a system where the core_ids serve the PMDs that are pinned to dpdk0, with only management traffic flowing through dpdk0:
[root@overcloud-compute-0 ~]# ovs-appctl dpif-netdev/pmd-stats-clear && sleep 10 && ovs-appctl dpif-netdev/pmd-stats-show | egrep 'core_id (2|22):' -A9 pmd thread numa_id 0 core_id 22: emc hits:0 megaflow hits:0 avg. subtable lookups per hit:0.00 miss:0 lost:0 polling cycles:12613298746 (100.00%) processing cycles:0 (0.00%) -- pmd thread numa_id 0 core_id 2: emc hits:5 megaflow hits:0 avg. subtable lookups per hit:0.00 miss:0 lost:0 polling cycles:12480023709 (100.00%) processing cycles:14354 (0.00%) avg cycles per packet: 2496007612.60 (12480038063/5) avg processing cycles per packet: 2870.80 (14354/5)
4.3.2. PMD under load test with packet drop
The following example shows a system where the core_ids serve the PMDs that are pinned to dpdk0, with a load test flowing through dpdk0, causing a high number of RX drops:
[root@overcloud-compute-0 ~]# ovs-appctl dpif-netdev/pmd-stats-clear && sleep 10 && ovs-appctl dpif-netdev/pmd-stats-show | egrep 'core_id (2|4|22|24):' -A9 pmd thread numa_id 0 core_id 22: emc hits:35497952 megaflow hits:0 avg. subtable lookups per hit:0.00 miss:0 lost:0 polling cycles:1446658819 (6.61%) processing cycles:20453874401 (93.39%) avg cycles per packet: 616.95 (21900533220/35497952) avg processing cycles per packet: 576.20 (20453874401/35497952) -- pmd thread numa_id 0 core_id 2: emc hits:30183582 megaflow hits:0 avg. subtable lookups per hit:0.00 miss:2 lost:0 polling cycles:1497174615 (6.85%) processing cycles:20354613261 (93.15%) avg cycles per packet: 723.96 (21851787876/30183584) avg processing cycles per packet: 674.36 (20354613261/30183584)
Where packet drops occur, you can see a high ratio of processing cycles vs polling cycles (more than 90% processing cycles):
polling cycles:1497174615 (6.85%) processing cycles:20354613261 (93.15%)
4.3.3. PMD under loadtest with 50% of mpps capacity
The following example shows a system where the core_ids serve the PMDs that are pinned to dpdk0, with a load test flowing through dpdk0, sending 6.4 Mpps (around 50% of the maximum capacity) of this dpdk0 interface (around 12.85 Mpps):
[root@overcloud-compute-0 ~]# ovs-appctl dpif-netdev/pmd-stats-clear && sleep 10 && ovs-appctl dpif-netdev/pmd-stats-show | egrep 'core_id (2|4|22|24):' -A9 pmd thread numa_id 0 core_id 22: emc hits:17461158 megaflow hits:0 avg. subtable lookups per hit:0.00 miss:0 lost:0 polling cycles:4948219259 (25.81%) processing cycles:14220835107 (74.19%) avg cycles per packet: 1097.81 (19169054366/17461158) avg processing cycles per packet: 814.43 (14220835107/17461158) -- pmd thread numa_id 0 core_id 2: emc hits:14874381 megaflow hits:0 avg. subtable lookups per hit:0.00 miss:0 lost:0 polling cycles:5460724802 (29.10%) processing cycles:13305794333 (70.90%) avg cycles per packet: 1261.67 (18766519135/14874381) avg processing cycles per packet: 894.54 (13305794333/14874381)
Where the pps are ca. half of the maximum for the interface, you can see a lower ratio of processing cycles vs polling cycles (ca. 70% processing cycles):
polling cycles:5460724802 (29.10%) processing cycles:13305794333 (70.90%)
4.3.4. Hit vs miss vs lost
The following examples shows the man pages regarding the subject:
an ovs-vswitchd (...) DPIF-NETDEV COMMANDS These commands are used to expose internal information (mostly statistics) about the ``dpif-netdev'' userspace datapath. If there is only one datapath (as is often the case, unless dpctl/ commands are used), the dp argument can be omitted. dpif-netdev/pmd-stats-show [dp] Shows performance statistics for each pmd thread of the datapath dp. The special thread ``main'' sums up the statistics of every non pmd thread. The sum of ``emc hits'', ``masked hits'' and ``miss'' is the number of packets received by the datapath. Cycles are counted using the TSC or similar facilities (when available on the platform). To reset these counters use dpif-netdev/pmd-stats-clear. The duration of one cycle depends on the measuring infrastructure. (...) Raw man ovs-dpctl (...) dump-dps Prints the name of each configured datapath on a separate line. [-s | --statistics] show [dp...] Prints a summary of configured datapaths, including their datapath numbers and a list of ports connected to each datapath. (The local port is identified as port 0.) If -s or --statistics is specified, then packet and byte counters are also printed for each port. The datapath numbers consists of flow stats and mega flow mask stats. The "lookups" row displays three stats related to flow lookup triggered by processing incoming packets in the datapath. "hit" displays number of packets matches existing flows. "missed" displays the number of packets not matching any existing flow and require user space processing. "lost" displays number of packets destined for user space process but subsequently dropped before reaching userspace. The sum of "hit" and "miss" equals to the total number of packets datapath processed. (...) Raw man ovs-vswitchd (...) dpctl/show [-s | --statistics] [dp...] Prints a summary of configured datapaths, including their datapath numbers and a list of ports connected to each datapath. (The local port is identified as port 0.) If -s or --statistics is specified, then packet and byte counters are also printed for each port. The datapath numbers consists of flow stats and mega flow mask stats. The "lookups" row displays three stats related to flow lookup triggered by processing incoming packets in the datapath. "hit" displays number of packets matches existing flows. "missed" displays the number of packets not matching any existing flow and require user space processing. "lost" displays number of packets destined for user space process but subsequently dropped before reaching userspace. The sum of "hit" and "miss" equals to the total number of packets datapath processed. (...)
Some of the documentation is referring to the kernel datapath, so when it says user space processing
it means the packet is not classified in the kernel sw
caches (equivalents to emc
& dpcls
) and sent to the ofproto layer in userspace.
Chapter 5. Attaching and Detaching SR-IOV ports in nova
Use this procedure to properly perform attaching and detaching SR-IOV ports.
5.1. Symptom
Cannot attach or detach SR-IOV ports in nova in Red Hat OpenStack Platform 10 and later. Nova logs report No conversion for VIF type hw_veb yet
.
5.2. Diagnosis
You cannot attach or detach SR-IOV ports to an instance that has already been spawned. SR-IOV ports need to be attached at instance creation.
5.3. Solution
The following example attempts to attach interfaces after an instance boot:
RHEL_INSTANCE_COUNT=1 NETID=$(neutron net-list | grep provider1 | awk '{print $2}') for i in `seq 1 $RHEL_INSTANCE_COUNT`;do # nova floating-ip-create provider1 portid1=`neutron port-create sriov1 --name sriov1 --binding:vnic-type direct | awk '$2 == "id" {print $(NF-1)}'` portid2=`neutron port-create sriov2 --name sriov2 --binding:vnic-type direct | awk '$2 == "id" {print $(NF-1)}'` openstack server create --flavor m1.small --image rhel --nic net-id=$NETID --key-name id_rsa sriov_vm${i} serverid=`openstack server list | grep sriov_vm${i} | awk '{print $2}'` status="NONE" while [ "$status" != "ACTIVE" ]; do echo "Server $serverid not active ($status)" ; sleep 5 ; status=`openstack server show $serverid | grep -i status | awk '{print $4}'` done nova interface-attach --port-id $portid1 $serverid nova interface-attach --port-id $portid2 $serverid done
This fails with the following error:
ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <type 'exceptions.KeyError'> (HTTP 500) (Request-ID: req-36b544f4-91a6-442e-a30d-6148220d1449)
The correct method is to spawn an instance directly with SR-IOV ports:
RHEL_INSTANCE_COUNT=1 NETID=$(neutron net-list | grep provider1 | awk '{print $2}') for i in `seq 1 $RHEL_INSTANCE_COUNT`;do # nova floating-ip-create provider1 portid1=`neutron port-create sriov1 --name sriov1 --binding:vnic-type direct | awk '$2 == "id" {print $(NF-1)}'` portid2=`neutron port-create sriov2 --name sriov2 --binding:vnic-type direct | awk '$2 == "id" {print $(NF-1)}'` openstack server create --flavor m1.small --image rhel --nic net-id=$NETID --nic port-id=$portid1 --nic port-id=$portid2 --key-name id_rsa sriov_vm${i} done
This works without issues.
Chapter 6. Configure and Test LACP Bonding with Open vSwitch DPDK
OVS bonds with LACP may not be supported depending on the version of OSP you are using. Please check the product documentation to verify that OVS bonds with LACP are supported.
To use Open vSwitch DPDK to configure and test LACP bonding, you need to:
- Configure the switch ports for LACP.
- Configure Linux kernel bonding for LACP as a baseline.
- Configure OVS DPDK bonding for LACP.
This topic describes switch configuration with a Dell S4048-ON switch. Whereas configuration of RHEL and OVS remains the same, different switch vendors' operating systems will use a different syntax to configure LACP.
6.1. Configuring the Switch Ports for LACP
Reset the switch interfaces to their default settings:
S4048-ON-sw#config t S4048-ON-sw(conf)#default int te1/2 S4048-ON-sw(conf)#default int te1/7
Configure the port-channel and other port settings as shown here:
S4048-ON-sw(conf)#int range te1/2,te1/7 S4048-ON-sw(conf-if-range-te-1/2,te-1/7)#port-channel-protocol lacp S4048-ON-sw(conf-if-range-te-1/2,te-1/7-lacp)# S4048-ON-sw(conf-if-range-te-1/2,te-1/7-lacp)#port-channel 1 mode active S4048-ON-sw(conf-if-range-te-1/2,te-1/7-lacp)#end S4048-ON-sw#config t S4048-ON-sw(conf)#int range te1/2,te1/7 S4048-ON-sw(conf-if-range-te-1/2,te-1/7)# no ip address S4048-ON-sw(conf-if-range-te-1/2,te-1/7)# mtu 9216 S4048-ON-sw(conf-if-range-te-1/2,te-1/7)# flowcontrol rx on tx off S4048-ON-sw(conf-if-range-te-1/2,te-1/7)# no shutdown S4048-ON-sw(conf-if-range-te-1/2,te-1/7)#end S4048-ON-sw#show run int te1/2 ! interface TenGigabitEthernet 1/2 no ip address mtu 9216 flowcontrol rx on tx off ! port-channel-protocol LACP port-channel 1 mode active no shutdown
Configure the VLANs:
S4048-ON-sw#config t S4048-ON-sw(conf)#int range vlan901-909 S4048-ON-sw(conf-if-range-vl-901-909)#tagged Port-channel 1 S4048-ON-sw(conf-if-range-vl-901-909)#end S4048-ON-sw#
Verify VLAN tagging:
S4048-ON-sw#show vlan id 902 Codes: * - Default VLAN, G - GVRP VLANs, R - Remote Port Mirroring VLANs, P - Primary, C - Community, I - Isolated O - Openflow, Vx - Vxlan Q: U - Untagged, T - Tagged x - Dot1x untagged, X - Dot1x tagged o - OpenFlow untagged, O - OpenFlow tagged G - GVRP tagged, M - Vlan-stack i - Internal untagged, I - Internal tagged, v - VLT untagged, V - VLT tagged NUM Status Description Q Ports 902 Active Tenant T Po1() T Te 1/1,1/3-1/6,1/8-1/20
Verify the LACP configuration:
S4048-ON-sw#show lacp 1 Port-channel 1 admin up, oper down, mode lacp LACP Fast Switch-Over Disabled Actor System ID: Priority 32768, Address 1418.7789.9a8a Partner System ID: Priority 0, Address 0000.0000.0000 Actor Admin Key 1, Oper Key 1, Partner Oper Key 1, VLT Peer Oper Key 1 LACP LAG 1 is an individual link LACP LAG 1 is a normal LAG A - Active LACP, B - Passive LACP, C - Short Timeout, D - Long Timeout E - Aggregatable Link, F - Individual Link, G - IN_SYNC, H - OUT_OF_SYNC I - Collection enabled, J - Collection disabled, K - Distribution enabled L - Distribution disabled, M - Partner Defaulted, N - Partner Non-defaulted, O - Receiver is in expired state, P - Receiver is not in expired state Port Te 1/2 is disabled, LACP is disabled and mode is lacp Port State: Not in Bundle Actor Admin: State ACEHJLMP Key 1 Priority 32768 Oper: State ACEHJLMP Key 1 Priority 32768 Partner is not present Port Te 1/7 is enabled, LACP is enabled and mode is lacp Port State: Not in Bundle Actor Admin: State ACEHJLMP Key 1 Priority 32768 Oper: State ACEHJLMP Key 1 Priority 32768 Partner is not present
6.2. Configuring Linux Kernel Bonding for LACP as a Baseline
As a preliminary step, it is always a good idea to start with the simplest scenario. It is easier to configure Linux kernel bonding as a baseline and verify the switch and RHEL can form an LACP bond.
Move all interfaces to the kernel space and test with kernel space bonding. In this example, p1p1 maps to bus address
0000:04:00.0
and p1p2 maps to bus address0000:04:00.1
.[root@baremetal ~]# driverctl unset-override 0000:04:00.0 [root@baremetal ~]# driverctl unset-override 0000:04:00.1
Load the bonding driver, configure a bond interface (
bond10
) and enslave interfacesp1p1
andp1p2
:[root@baremetal ~]# modprobe bonding miimon=100 mode=4 lacp_rate=1 [root@baremetal ~]# ip link add name bond10 type bond [root@baremetal ~]# ifenslave bond10 p1p1 p1p2 Illegal operation; the specified master interface 'bond10' is not up. [root@baremetal ~]# ip link set dev bond10 up [root@baremetal ~]# ifenslave bond10 p1p1 p1p2
Verify LACP from RHEL:
[root@baremetal ~]# cat /proc/net/bonding/bond10 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2 (0) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable System priority: 65535 System MAC address: a0:36:9f:e3:dd:c8 Active Aggregator Info: Aggregator ID: 1 Number of ports: 2 Actor Key: 13 Partner Key: 1 Partner Mac Address: 14:18:77:89:9a:8a Slave Interface: p1p1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: a0:36:9f:e3:dd:c8 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: monitoring Partner Churn State: monitoring Actor Churned Count: 0 Partner Churned Count: 0 details actor lacp pdu: system priority: 65535 system mac address: a0:36:9f:e3:dd:c8 port key: 13 port priority: 255 port number: 1 port state: 63 details partner lacp pdu: system priority: 32768 system mac address: 14:18:77:89:9a:8a oper key: 1 port priority: 32768 port number: 203 port state: 63 Slave Interface: p1p2 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: a0:36:9f:e3:dd:ca Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: monitoring Partner Churn State: monitoring Actor Churned Count: 0 Partner Churned Count: 0 details actor lacp pdu: system priority: 65535 system mac address: a0:36:9f:e3:dd:c8 port key: 13 port priority: 255 port number: 2 port state: 63 details partner lacp pdu: system priority: 32768 system mac address: 14:18:77:89:9a:8a oper key: 1 port priority: 32768 port number: 208 port state: 63
Verify LACP from the switch:
S4048-ON-sw#show lacp 1 Port-channel 1 admin up, oper up, mode lacp LACP Fast Switch-Over Disabled Actor System ID: Priority 32768, Address 1418.7789.9a8a Partner System ID: Priority 65535, Address a036.9fe3.ddc8 Actor Admin Key 1, Oper Key 1, Partner Oper Key 13, VLT Peer Oper Key 1 LACP LAG 1 is an aggregatable link LACP LAG 1 is a normal LAG A - Active LACP, B - Passive LACP, C - Short Timeout, D - Long Timeout E - Aggregatable Link, F - Individual Link, G - IN_SYNC, H - OUT_OF_SYNC I - Collection enabled, J - Collection disabled, K - Distribution enabled L - Distribution disabled, M - Partner Defaulted, N - Partner Non-defaulted, O - Receiver is in expired state, P - Receiver is not in expired state Port Te 1/2 is enabled, LACP is enabled and mode is lacp Port State: Bundle Actor Admin: State ACEHJLMP Key 1 Priority 32768 Oper: State ACEGIKNP Key 1 Priority 32768 Partner Admin: State BDFHJLMP Key 0 Priority 0 Oper: State ACEGIKNP Key 13 Priority 255 Port Te 1/7 is enabled, LACP is enabled and mode is lacp Port State: Bundle Actor Admin: State ACEHJLMP Key 1 Priority 32768 Oper: State ACEGIKNP Key 1 Priority 32768 Partner Admin: State BDFHJLMP Key 0 Priority 0 Oper: State ACEGIKNP Key 13 Priority 255 S4048-ON-sw#
Remove the bonding configuration:
[root@baremetal ~]# ip link del dev bond10 [root@baremetal ~]#
You can change the bonding mode by following: How to change the bonding mode without rebooting the system?
6.3. Configuring OVS DPDK Bonding for LACP
The next step is to configure an LACP bond within OVS DPDK.
6.3.1. Prepare Open vSwitch
Make sure that huge pages and other values are configured in RHEL:
[root@baremetal bonding]# cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.17.1.el7.x86_64 root=UUID=fa414390-f78d-49d4-a164-54615a32977b ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on isolcpus=2,4,6,8,10,12,14,16,18,22,24,26,28,30,32,34,36,38,3,5,7,9,11,13,15,17,19,23,25,27,29,31,33,35,37,39 skew_tick=1 nohz=on nohz_full=2,4,6,8,10,12,14,16,18,22,24,26,28,30,32,34,36,38,3,5,7,9,11,13,15,17,19,23,25,27,29,31,33,35,37,39 rcu_nocbs=2,4,6,8,10,12,14,16,18,22,24,26,28,30,32,34,36,38,3,5,7,9,11,13,15,17,19,23,25,27,29,31,33,35,37,39 tuned.non_isolcpus=00300003 intel_pstate=disable nosoftlockup
Configure OVS for DPDK:
[root@baremetal bonding]# ovs-vsctl list Open_vSwitch | grep other other_config : {} [root@baremetal bonding]# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init="true" [root@baremetal bonding]# ovs-vsctl --no-wait set Open_vSwitch . other_config:pmd-cpu-mask=0x17c0017c [root@baremetal bonding]# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x00000001
Switch interfaces into user space:
[root@baremetal bonding]# ethtool -i p1p1 | grep bus bus-info: 0000:04:00.0 [root@baremetal bonding]# ethtool -i p1p2 | grep bus bus-info: 0000:04:00.1 [root@baremetal bonding]# driverctl set-override 0000:04:00.0 vfio-pci [root@baremetal bonding]# driverctl set-override 0000:04:00.1 vfio-pci
Restart Open vSwitch,
journalctl -u ovs-vswitchd -f &
running in the background:[root@baremetal bonding]# systemctl restart openvswitch Apr 19 13:02:49 baremetal systemd[1]: Stopping Open vSwitch Forwarding Unit... Apr 19 13:02:49 baremetal systemd[1]: Stopping Open vSwitch Forwarding Unit... Apr 19 13:02:49 baremetal ovs-ctl[91399]: Exiting ovs-vswitchd (91202) [ OK ] Apr 19 13:02:49 baremetal ovs-ctl[91399]: Exiting ovs-vswitchd (91202) [ OK ] Apr 19 13:02:49 baremetal systemd[1]: Starting Open vSwitch Forwarding Unit... Apr 19 13:02:49 baremetal systemd[1]: Starting Open vSwitch Forwarding Unit... Apr 19 13:02:49 baremetal ovs-ctl[91483]: Starting ovs-vswitchd EAL: Detected 40 lcore(s) Apr 19 13:02:49 baremetal ovs-ctl[91483]: Starting ovs-vswitchd EAL: Detected 40 lcore(s) Apr 19 13:02:49 baremetal ovs-ctl[91483]: EAL: Probing VFIO support... Apr 19 13:02:49 baremetal ovs-vswitchd[91509]: EAL: Probing VFIO support... Apr 19 13:02:49 baremetal ovs-ctl[91483]: EAL: VFIO support initialized Apr 19 13:02:49 baremetal ovs-vswitchd[91509]: EAL: VFIO support initialized Apr 19 13:02:49 baremetal ovs-ctl[91483]: EAL: Probing VFIO support... Apr 19 13:02:49 baremetal ovs-vswitchd[91509]: EAL: Probing VFIO support... Apr 19 13:02:49 baremetal ovs-ctl[91483]: EAL: VFIO support initialized Apr 19 13:02:49 baremetal ovs-vswitchd[91509]: EAL: VFIO support initialized Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: PCI device 0000:04:00.0 on NUMA socket 0 Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: PCI device 0000:04:00.0 on NUMA socket 0 Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: PCI device 0000:04:00.0 on NUMA socket 0 Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: using IOMMU type 1 (Type 1) Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: using IOMMU type 1 (Type 1) Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: PCI device 0000:04:00.0 on NUMA socket 0 Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: using IOMMU type 1 (Type 1) Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: using IOMMU type 1 (Type 1) Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: Ignore mapping IO port bar(2) addr: 3021 Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: Ignore mapping IO port bar(2) addr: 3021 Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: Ignore mapping IO port bar(2) addr: 3021 Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: Ignore mapping IO port bar(2) addr: 3021 Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: PCI device 0000:04:00.1 on NUMA socket 0 Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: PCI device 0000:04:00.1 on NUMA socket 0 Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: PCI device 0000:04:00.1 on NUMA socket 0 Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: PCI device 0000:04:00.1 on NUMA socket 0 Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: Ignore mapping IO port bar(2) addr: 3001 Apr 19 13:02:59 baremetal ovs-ctl[91483]: EAL: Ignore mapping IO port bar(2) addr: 3001 Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: Ignore mapping IO port bar(2) addr: 3001 Apr 19 13:02:59 baremetal ovs-vswitchd[91509]: EAL: Ignore mapping IO port bar(2) addr: 3001 Apr 19 13:03:00 baremetal ovs-ctl[91483]: EAL: PCI device 0000:05:00.0 on NUMA socket 0 Apr 19 13:03:00 baremetal ovs-ctl[91483]: EAL: PCI device 0000:05:00.0 on NUMA socket 0 Apr 19 13:03:00 baremetal ovs-vswitchd[91509]: EAL: PCI device 0000:05:00.0 on NUMA socket 0 Apr 19 13:03:00 baremetal ovs-vswitchd[91509]: EAL: PCI device 0000:05:00.0 on NUMA socket 0 Apr 19 13:03:00 baremetal ovs-ctl[91483]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:03:00 baremetal ovs-ctl[91483]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:03:00 baremetal ovs-ctl[91483]: EAL: PCI device 0000:05:00.1 on NUMA socket 0 Apr 19 13:03:00 baremetal ovs-ctl[91483]: EAL: PCI device 0000:05:00.1 on NUMA socket 0 Apr 19 13:03:00 baremetal ovs-ctl[91483]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:03:00 baremetal ovs-ctl[91483]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:03:00 baremetal ovs-vswitchd[91509]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:03:00 baremetal ovs-vswitchd[91509]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:03:00 baremetal ovs-vswitchd[91509]: EAL: PCI device 0000:05:00.1 on NUMA socket 0 Apr 19 13:03:00 baremetal ovs-vswitchd[91509]: EAL: PCI device 0000:05:00.1 on NUMA socket 0 Apr 19 13:03:00 baremetal ovs-vswitchd[91509]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:03:00 baremetal ovs-vswitchd[91509]: EAL: probe driver: 8086:154d net_ixgbe Apr 19 13:03:00 baremetal ovs-ctl[91483]: [ OK ] Apr 19 13:03:00 baremetal ovs-ctl[91483]: [ OK ] Apr 19 13:03:00 baremetal ovs-ctl[91483]: Enabling remote OVSDB managers [ OK ] Apr 19 13:03:00 baremetal ovs-ctl[91483]: Enabling remote OVSDB managers [ OK ] Apr 19 13:03:00 baremetal systemd[1]: Started Open vSwitch Forwarding Unit. Apr 19 13:03:00 baremetal systemd[1]: Started Open vSwitch Forwarding Unit. [root@baremetal bonding]#
6.3.2. Configure LACP Bond
Add the bond:
[root@baremetal bonding]# ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev [root@baremetal bonding]# ovs-vsctl add-bond ovsbr0 dpdkbond dpdk0 dpdk1 bond_mode=balance-tcp lacp=active -- set interface dpdk0 type=dpdk -- set Interface dpdk1 type=dpdk
Verify from Open vSwitch:
[root@baremetal bonding]# ovs-appctl lacp/show dpdkbond ---- dpdkbond ---- status: active negotiated sys_id: a0:36:9f:e3:dd:c8 sys_priority: 65534 aggregation key: 1 lacp_time: slow slave: dpdk0: current attached port_id: 2 port_priority: 65535 may_enable: true actor sys_id: a0:36:9f:e3:dd:c8 actor sys_priority: 65534 actor port_id: 2 actor port_priority: 65535 actor key: 1 actor state: activity aggregation synchronized collecting distributing partner sys_id: 14:18:77:89:9a:8a partner sys_priority: 32768 partner port_id: 203 partner port_priority: 32768 partner key: 1 partner state: activity timeout aggregation synchronized collecting distributing slave: dpdk1: current attached port_id: 1 port_priority: 65535 may_enable: true actor sys_id: a0:36:9f:e3:dd:c8 actor sys_priority: 65534 actor port_id: 1 actor port_priority: 65535 actor key: 1 actor state: activity aggregation synchronized collecting distributing partner sys_id: 14:18:77:89:9a:8a partner sys_priority: 32768 partner port_id: 208 partner port_priority: 32768 partner key: 1 partner state: activity timeout aggregation synchronized collecting distributing [root@baremetal bonding]# ovs-appctl bond/show dpdkbond ---- dpdkbond ---- bond_mode: balance-tcp bond may use recirculation: yes, Recirc-ID : 1 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms next rebalance: 6817 ms lacp_status: negotiated active slave mac: a0:36:9f:e3:dd:c8(dpdk0) slave dpdk0: enabled active slave may_enable: true slave dpdk1: enabled may_enable: true
Verify from the switch:
S4048-ON-sw#show lacp 1 Port-channel 1 admin up, oper up, mode lacp LACP Fast Switch-Over Disabled Actor System ID: Priority 32768, Address 1418.7789.9a8a Partner System ID: Priority 65534, Address a036.9fe3.ddc8 Actor Admin Key 1, Oper Key 1, Partner Oper Key 1, VLT Peer Oper Key 1 LACP LAG 1 is an aggregatable link LACP LAG 1 is a normal LAG A - Active LACP, B - Passive LACP, C - Short Timeout, D - Long Timeout E - Aggregatable Link, F - Individual Link, G - IN_SYNC, H - OUT_OF_SYNC I - Collection enabled, J - Collection disabled, K - Distribution enabled L - Distribution disabled, M - Partner Defaulted, N - Partner Non-defaulted, O - Receiver is in expired state, P - Receiver is not in expired state Port Te 1/2 is enabled, LACP is enabled and mode is lacp Port State: Bundle Actor Admin: State ACEHJLMP Key 1 Priority 32768 Oper: State ACEGIKNP Key 1 Priority 32768 Partner Admin: State BDFHJLMP Key 0 Priority 0 Oper: State ADEGIKNP Key 1 Priority 65535 Port Te 1/7 is enabled, LACP is enabled and mode is lacp Port State: Bundle Actor Admin: State ACEHJLMP Key 1 Priority 32768 Oper: State ACEGIKNP Key 1 Priority 32768 Partner Admin: State BDFHJLMP Key 0 Priority 0 Oper: State ADEGIKNP Key 1 Priority 65535 S4048-ON-sw#
6.3.3. Enabling / Disabling Ports from OVS
Individual ports can be enables or shut down with ovs-ofctl mod-port <bridge> <port> [up|down]
Shut down a port:
[root@baremetal bonding]# ovs-ofctl mod-port ovsbr0 dpdk1 down
Verify the shutdown:
[root@baremetal bonding]# ovs-appctl lacp/show dpdkbond ---- dpdkbond ---- status: active negotiated sys_id: a0:36:9f:e3:dd:c8 sys_priority: 65534 aggregation key: 1 lacp_time: slow slave: dpdk0: current attached port_id: 2 port_priority: 65535 may_enable: true actor sys_id: a0:36:9f:e3:dd:c8 actor sys_priority: 65534 actor port_id: 2 actor port_priority: 65535 actor key: 1 actor state: activity aggregation synchronized collecting distributing partner sys_id: 14:18:77:89:9a:8a partner sys_priority: 32768 partner port_id: 203 partner port_priority: 32768 partner key: 1 partner state: activity timeout aggregation synchronized collecting distributing slave: dpdk1: defaulted detached port_id: 1 port_priority: 65535 may_enable: false actor sys_id: a0:36:9f:e3:dd:c8 actor sys_priority: 65534 actor port_id: 1 actor port_priority: 65535 actor key: 1 actor state: activity aggregation defaulted partner sys_id: 00:00:00:00:00:00 partner sys_priority: 0 partner port_id: 0 partner port_priority: 0 partner key: 0 partner state: [root@baremetal bonding]# ovs-appctl bond/show dpdkbond ---- dpdkbond ---- bond_mode: balance-tcp bond may use recirculation: yes, Recirc-ID : 1 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms next rebalance: 3315 ms lacp_status: negotiated active slave mac: a0:36:9f:e3:dd:c8(dpdk0) slave dpdk0: enabled active slave may_enable: true slave dpdk1: disabled may_enable: false
Verify on the switch:
S4048-ON-sw#show lacp 1 Port-channel 1 admin up, oper up, mode lacp LACP Fast Switch-Over Disabled Actor System ID: Priority 32768, Address 1418.7789.9a8a Partner System ID: Priority 65534, Address a036.9fe3.ddc8 Actor Admin Key 1, Oper Key 1, Partner Oper Key 1, VLT Peer Oper Key 1 LACP LAG 1 is an aggregatable link LACP LAG 1 is a normal LAG A - Active LACP, B - Passive LACP, C - Short Timeout, D - Long Timeout E - Aggregatable Link, F - Individual Link, G - IN_SYNC, H - OUT_OF_SYNC I - Collection enabled, J - Collection disabled, K - Distribution enabled L - Distribution disabled, M - Partner Defaulted, N - Partner Non-defaulted, O - Receiver is in expired state, P - Receiver is not in expired state Port Te 1/2 is enabled, LACP is enabled and mode is lacp Port State: Bundle Actor Admin: State ACEHJLMP Key 1 Priority 32768 Oper: State ACEGIKNP Key 1 Priority 32768 Partner Admin: State BDFHJLMP Key 0 Priority 0 Oper: State ADEGIKNP Key 1 Priority 65535 Port Te 1/7 is disabled, LACP is disabled and mode is lacp Port State: Not in Bundle Actor Admin: State ACEHJLMP Key 1 Priority 32768 Oper: State ACEHJLNP Key 1 Priority 32768 Partner is not present
Bring up the port again:
[root@baremetal bonding]# ovs-ofctl mod-port ovsbr0 dpdk1 up
Verify from RHEL:
[root@baremetal bonding]# ovs-appctl bond/show dpdkbond ---- dpdkbond ---- bond_mode: balance-tcp bond may use recirculation: yes, Recirc-ID : 1 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms next rebalance: 7846 ms lacp_status: negotiated active slave mac: a0:36:9f:e3:dd:c8(dpdk0) slave dpdk0: enabled active slave may_enable: true slave dpdk1: enabled may_enable: true [root@baremetal bonding]# ovs-appctl lacp/show dpdkbond ---- dpdkbond ---- status: active negotiated sys_id: a0:36:9f:e3:dd:c8 sys_priority: 65534 aggregation key: 1 lacp_time: slow slave: dpdk0: current attached port_id: 2 port_priority: 65535 may_enable: true actor sys_id: a0:36:9f:e3:dd:c8 actor sys_priority: 65534 actor port_id: 2 actor port_priority: 65535 actor key: 1 actor state: activity aggregation synchronized collecting distributing partner sys_id: 14:18:77:89:9a:8a partner sys_priority: 32768 partner port_id: 203 partner port_priority: 32768 partner key: 1 partner state: activity timeout aggregation synchronized collecting distributing slave: dpdk1: current attached port_id: 1 port_priority: 65535 may_enable: true actor sys_id: a0:36:9f:e3:dd:c8 actor sys_priority: 65534 actor port_id: 1 actor port_priority: 65535 actor key: 1 actor state: activity aggregation synchronized collecting distributing partner sys_id: 14:18:77:89:9a:8a partner sys_priority: 32768 partner port_id: 208 partner port_priority: 32768 partner key: 1 partner state: activity timeout aggregation synchronized collecting distributing
Verify from the switch:
S4048-ON-sw#show lacp 1 Port-channel 1 admin up, oper up, mode lacp LACP Fast Switch-Over Disabled Actor System ID: Priority 32768, Address 1418.7789.9a8a Partner System ID: Priority 65534, Address a036.9fe3.ddc8 Actor Admin Key 1, Oper Key 1, Partner Oper Key 1, VLT Peer Oper Key 1 LACP LAG 1 is an aggregatable link LACP LAG 1 is a normal LAG A - Active LACP, B - Passive LACP, C - Short Timeout, D - Long Timeout E - Aggregatable Link, F - Individual Link, G - IN_SYNC, H - OUT_OF_SYNC I - Collection enabled, J - Collection disabled, K - Distribution enabled L - Distribution disabled, M - Partner Defaulted, N - Partner Non-defaulted, O - Receiver is in expired state, P - Receiver is not in expired state Port Te 1/2 is enabled, LACP is enabled and mode is lacp Port State: Bundle Actor Admin: State ACEHJLMP Key 1 Priority 32768 Oper: State ACEGIKNP Key 1 Priority 32768 Partner Admin: State BDFHJLMP Key 0 Priority 0 Oper: State ADEGIKNP Key 1 Priority 65535 Port Te 1/7 is enabled, LACP is enabled and mode is lacp Port State: Bundle Actor Admin: State ACEHJLMP Key 1 Priority 32768 Oper: State ACEGIKNP Key 1 Priority 32768 Partner Admin: State BDFHJLMP Key 0 Priority 0 Oper: State ADEGIKNP Key 1 Priority 65535
Chapter 7. Deploying different bond modes with OVS DPDK
Use this procedure to deploy different bond modes with OVS DPDK in Red Hat OpenStack Platform.
7.1. Solution
Make the following changes to the compute.yaml
environment file. Note that this example also sets the MTU value to 2000.
(...) - type: ovs_user_bridge name: br-link mtu: 2000 use_dhcp: false members: - type: ovs_dpdk_bond name: dpdkbond0 ovs_options: "bond_mode=balance-slb" mtu: 2000 ovs_extra: - set interface dpdk0 mtu_request=$MTU - set interface dpdk1 mtu_request=$MTU members: - type: ovs_dpdk_port name: dpdk0 members: - type: interface name: p1p2 - type: ovs_dpdk_port name: dpdk1 members: - type: interface name: p1p1 (...)
Deploy or redeploy the Overcloud with the template changes made above. When complete, perform the following steps on an Overcloud node.
Verify the os-net-config
configuration:
cat /etc/os-net-config/config.json | python -m json.tool (...) { "members": [ { "members": [ { "members": [ { "name": "p1p2", "type": "interface" } ], "name": "dpdk0", "type": "ovs_dpdk_port" }, { "members": [ { "name": "p1p1", "type": "interface" } ], "name": "dpdk1", "type": "ovs_dpdk_port" } ], "mtu": 2000, "name": "dpdkbond0", "ovs_extra": [ "set interface dpdk0 mtu_request=$MTU", "set interface dpdk1 mtu_request=$MTU" ], "ovs_options": "bond_mode=balance-slb", "type": "ovs_dpdk_bond" } ], "mtu": 2000, "name": "br-link", "type": "ovs_user_bridge", "use_dhcp": false }, (...)
Verify the bond:
[root@overcloud-compute-0 ~]# ovs-appctl bond/show dpdkbond0 ---- dpdkbond0 ---- bond_mode: balance-slb bond may use recirculation: no, Recirc-ID : -1 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms next rebalance: 9221 ms lacp_status: off active slave mac: a0:36:9f:e5:da:82(dpdk1) slave dpdk0: enabled may_enable: true slave dpdk1: enabled active slave may_enable: true
Chapter 8. Receiving the Could not open network device dpdk0 (No such device) in ovs-vsctl show
message
8.1. Symptom
Receiving the Could not open network device dpdk0 (No such device) in ovs-vsctl show
message. Is the used NIC supported in Red Hat OpenStack Platform? Is there a certified / supported hardware list for OSP with DPDK / SR-IOV?
8.2. Diagnosis
Red Hat only supports a subset of the Poll Mode Drivers (PMDs) listed in DPDK Supported Hardware.
Red Hat unsupported PMDs were disabled in:
[root@overcloud-compute-0 ~]# rpm -q --changelog openvswitch | head -n9 | tail -n3 * Tue Aug 22 2017 Aaron Conole <aconole@redhat.com> - 2.6.1-14.git20161206 - Disable unsupported PMDs (#1482679) - software and hardware PMDs audited by the team
Upstream PMDs may have security and/or performance issues. For these reasons, a PMD needs to go through significant testing to pass Red Hat’s qualification tests.
Note that the list in /usr/share/doc/openvswitch-<version>/README.DPDK-PMDS
shows enabled PMDs and that only a subset of these PMDs may actually be supported by Red Hat. Poll Mode Drivers not listed in README.DPDK-PMDS
are not supported.
8.3. Solution
You can retrieve a list of enabled PMDs for the currently installed version of OVS in /usr/share/doc/openvswitch-<version>/README.DPDK-PMDS
. The following example shows the supported PMDs for openvswitch-2.6.1:
[root@overcloud-compute-0 ~]# cat /usr/share/doc/openvswitch-2.6.1/README.DPDK-PMDS DPDK drivers included in this package: E1000 ENIC I40E IXGBE RING VIRTIO For further information about the drivers, see http://dpdk.org/doc/guides-16.11/nics/index.html
This example shows the supported PMDs for openvswitch-2.9.0:
[root@undercloud-r430 ~]# cat /usr/share/doc/openvswitch-2.9.0/README.DPDK-PMDS DPDK drivers included in this package: BNXT E1000 ENIC FAILSAFE I40E IXGBE MLX4 MLX4_GLUE MLX5 MLX5_GLUE NFP RING SOFTNIC VIRTIO For further information about the drivers, see http://dpdk.org/doc/guides-17.11/nics/index.html
Chapter 9. Insufficient Free Host Memory Pages Available to Allocate Guest RAM with Open vSwitch DPDK
9.1. Symptom
When spawning an instance and scheduling it onto a compute node which still has sufficient pCPUs for the instance and also sufficient free huge pages for the instance memory, nova returns:
[stack@undercloud-4 ~]$ nova show 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc (...) | fault | {"message": "Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc. Last exception: internal error: process exited while connecting to monitor: 2017-11-23T19:53:20.311446Z qemu-kvm: -chardev pty,id=cha", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages /nova/conductor/manager.py\", line 492, in build_instances | | | filter_properties, instances[0].uuid) | | | File \"/usr/lib/python2.7/site-packages/nova/scheduler/utils.py\", line 184, in populate_retry | | | raise exception.MaxRetriesExceeded(reason=msg) | | | ", "created": "2017-11-23T19:53:22Z"} (...)
And /var/log/nova/nova-compute.log
on the compute node gives the following ERROR message:
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] 2017-11-23T19:53:20.477183Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt /qemu/7-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM
Additionally, libvirt creates the following log file:
[root@overcloud-compute-1 qemu]# cat instance-00000006.log 2017-11-23 19:53:02.145+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-08-22-08:54:01, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu- kvm-rhev-2.9.0-10.el7), hostname: overcloud-compute-1.localdomain LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000006,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain- 5-instance-00000006/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/5 -instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.8-5.el7ost,serial=4f88fcca-0cd3-4e19-8dc4-4436a54daff8,uuid=1b72e7a1-c298-4c92-8d2c -0a9fe886e9bc,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt /qemu/domain-5-instance-00000006/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3 -usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc /disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive- virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhu9758ef15-d2 -netdev vhost- user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:d6:89:65,bus=pci.0,addr=0x3 -add-fd set=0,fd=29 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.8:2 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon- pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 2017-11-23T19:53:03.217386Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1) 2017-11-23T19:53:03.359799Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt /qemu/5-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM 2017-11-23 19:53:03.630+0000: shutting down, reason=failed 2017-11-23 19:53:10.052+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-08-22-08:54:01, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu- kvm-rhev-2.9.0-10.el7), hostname: overcloud-compute-1.localdomain LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000006,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain- 6-instance-00000006/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/6- instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.8-5.el7ost,serial=4f88fcca-0cd3-4e19-8dc4-4436a54daff8,uuid=1b72e7a1-c298-4c92-8d2c- 0a9fe886e9bc,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt /qemu/domain-6-instance-00000006/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3 -usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc /disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive- virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhu9758ef15-d2 -netdev vhost- user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:d6:89:65,bus=pci.0,addr=0x3 -add-fd set=0,fd=29 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.8:2 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon- pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 2017-11-23T19:53:11.466399Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1) 2017-11-23T19:53:11.729226Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt /qemu/6-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM 2017-11-23 19:53:12.159+0000: shutting down, reason=failed 2017-11-23 19:53:19.370+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-08-22-08:54:01, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu- kvm-rhev-2.9.0-10.el7), hostname: overcloud-compute-1.localdomain LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000006,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain- 7-instance-00000006/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/7- instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.8-5.el7ost,serial=4f88fcca-0cd3-4e19-8dc4-4436a54daff8,uuid=1b72e7a1-c298-4c92-8d2c -0a9fe886e9bc,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt /qemu/domain-7-instance-00000006/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3- usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc /disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive- virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhu9758ef15-d2 -netdev vhost- user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:d6:89:65,bus=pci.0,addr=0x3 -add-fd set=0,fd=29 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.8:2 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon- pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 2017-11-23T19:53:20.311446Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1) 2017-11-23T19:53:20.477183Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt /qemu/7-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM 2017-11-23 19:53:20.724+0000: shutting down, reason=failed
9.2. Diagnosis
Without additional settings, nova does not know that a certain amount of hugepage memory is used by other processes. Out of the box, nova assumes that all hugepage memory is available for instances. Nova by default will first fill up NUMA node 0 if it believes that there are still free pCPUs and free hugepage memory on this NUMA node. This issue happens when:
- The requested pCPUs still fit into NUMA 0
- The combined memory of all existing instances plus the memory of the instance to be spawned still fit into NUMA node 0
- Another process such as OVS holds a certain amount of hugepage memory on NUMA node 0
9.2.1. Diagnostic Steps
On a hypervisor with 2MB hugepages and 512 free hugepages per NUMA node:
[root@overcloud-compute-1 ~]# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 2048 kB Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 512 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 2048 kB Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 512 Node 1 HugePages_Surp: 0
And with the following NUMA architecture:
[root@overcloud-compute-1 nova]# lscpu | grep -i NUMA NUMA node(s): 2 NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7
Where OVS reserves 512MB of hugepages per NUMA node:
[root@overcloud-compute-1 virt]# ovs-vsctl list Open_vSwitch | grep mem other_config : {dpdk-init="true", dpdk-lcore-mask="3", dpdk-socket-mem="512,512", pmd-cpu-mask="1e"}
Spawn instances with the following flavor (1 vCPU and 512 MB or memory):
[stack@undercloud-4 ~]$ nova flavor-show m1.tiny +----------------------------+-------------------------------------------------------------+ | Property | Value | +----------------------------+-------------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | disk | 8 | | extra_specs | {"hw:cpu_policy": "dedicated", "hw:mem_page_size": "large"} | | id | 49debbdb-c12e-4435-97ef-f575990b352f | | name | m1.tiny | | os-flavor-access:is_public | True | | ram | 512 | | rxtx_factor | 1.0 | | swap | | | vcpus | 1 | +----------------------------+-------------------------------------------------------------+
The new instance will boot and will use memory from NUMA 1:
[stack@undercloud-4 ~]$ nova list | grep d98772d1-119e-48fa-b1d9-8a68411cba0b | d98772d1-119e-48fa-b1d9-8a68411cba0b | cirros-test0 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe8d:a6ef, 10.0.0.102 |
[root@overcloud-compute-1 nova]# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 2048 kB Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 2048 kB Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 256 Node 1 HugePages_Surp: 0
nova boot --nic net-id=$NETID --image cirros --flavor m1.tiny --key-name id_rsa cirros-test0
The following instance fails to boot:
[stack@undercloud-4 ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+-----------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+-----------------------------------------+ | 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc | cirros-test0 | ERROR | - | NOSTATE | | | a44c43ca-49ad-43c5-b8a1-543ed8ab80ad | cirros-test0 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe0f:565b, 10.0.0.105 | | e21ba401-6161-45e6-8a04-6c45cef4aa3e | cirros-test0 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe69:18bd, 10.0.0.111 | +--------------------------------------+--------------+--------+------------+-------------+-----------------------------------------+
From the compute node, we can see that free hugepages on NUMA Node 0 are exhausted, whereas in theory there’s still enough space on NUMA node 1:
[root@overcloud-compute-1 qemu]# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 2048 kB Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 2048 kB Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 512 Node 1 HugePages_Surp: 0
/var/log/nova/nova-compute.log
reveals that the instance CPU shall be pinned to NUMA node 0:
<name>instance-00000006</name> <uuid>1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc</uuid> <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="14.0.8-5.el7ost"/> <nova:name>cirros-test0</nova:name> <nova:creationTime>2017-11-23 19:53:00</nova:creationTime> <nova:flavor name="m1.tiny"> <nova:memory>512</nova:memory> <nova:disk>8</nova:disk> <nova:swap>0</nova:swap> <nova:ephemeral>0</nova:ephemeral> <nova:vcpus>1</nova:vcpus> </nova:flavor> <nova:owner> <nova:user uuid="5d1785ee87294a6fad5e2bdddd91cc20">admin</nova:user> <nova:project uuid="8c307c08d2234b339c504bfdd896c13e">admin</nova:project> </nova:owner> <nova:root type="image" uuid="6350211f-5a11-4e02-a21a-cb1c0d543214"/> </nova:instance> </metadata> <memory unit='KiB'>524288</memory> <currentMemory unit='KiB'>524288</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0'/> </hugepages> </memoryBacking> <vcpu placement='static'>1</vcpu> <cputune> <shares>1024</shares> <vcpupin vcpu='0' cpuset='2'/> <emulatorpin cpuset='2'/> </cputune> <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune>
Also look at the nodeset='0' in the numatune section, which indicates that memory shall be claimed from NUMA 0.
9.3. Solution
Nova includes a feature which allows administrators to let it know how much hugepage memory is consumed by anything other than instances.
[root@overcloud-compute-1 virt]# grep reserved_huge /etc/nova/nova.conf -B1 [DEFAULT] reserved_huge_pages=node:0,size:2048,count:512 reserved_huge_pages=node:1,size:2048,count:512
Set the size to the hugepage size in MB. For example, use 2048 for 2MB hugepages or use 1GB for hugepages of that size. Set the count to the number of hugepages that are used by OVS per NUMA node. For example, for 4096 of socket memory used by Open vSwitch, set this to:
[DEFAULT] reserved_huge_pages=node:0,size:1GB,count:4 reserved_huge_pages=node:1,size:1GB,count:4
See How to set reserved_huge_pages in /etc/nova/nova.conf in Red Hat OpenStack Platform 10 for details about how to implement this with OpenStack Director.
This option is undocumented in Red Hat OpenStack Platform 10: OpenStack nova.conf - configuration options
In Red Hat OpenStack Platform 11, this is documented here: OpenStack nova.conf - configuration options
reserved_huge_pages = None (Unknown) Number of huge/large memory pages to reserved per NUMA host cell. Possible values: A list of valid key=value which reflect NUMA node ID, page size (Default unit is KiB) and number of pages to be reserved. reserved_huge_pages = node:0,size:2048,count:64 reserved_huge_pages = node:1,size:1GB,count:1 In this example we are reserving on NUMA node 0 64 pages of 2MiB and on NUMA node 1 1 page of 1GiB.
With debug enabled in /etc/nova/nova.conf
, you should see the following in the logs after a restart of openstack-nova-compute
:
[root@overcloud-compute-1 virt]# systemctl restart openstack-nova-compute (...) [root@overcloud-compute-1 virt]# grep reserved_huge_pages /var/log/nova/nova-compute.log | tail -n1 2017-12-19 17:56:40.727 26691 DEBUG oslo_service.service [req-e681e97d-7d99-4ba8-bee7-5f7a3f655b21 - - - - -] reserved_huge_pages = [{'node': '0', 'count': '512', 'size': '2048'}, {'node': '1', 'count': '512', 'size': '2048'}] log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2622 [root@overcloud-compute-1 virt]#
Chapter 10. Troubleshoot OVS DPDK PMD CPU Usage with perf and Collect and Send the Troubleshooting Data
10.1. Solution
10.1.1. Prerequisites
Use the steps in this section to install the components you need to perform troubleshooting.
Install perf on the compute node:
yum install perf -y
Install Open vSwitch debug RPMs:
subscription-manager repos --enable=rhel-7-server-openstack-10-debug-rpms
Install sysstat (needed for the
pidstat
command):yum install sysstat -y
10.2. Diagnosis
Use the steps in this section to perform troubleshooting and collect the data.
10.2.1. PMD Threads
Determine where the PMD threads are running:
IFS=$'\n' ; for l in $(ps -T -p `pidof ovs-vswitchd` | grep pmd);do PID=`echo $l | awk '{print $2}'`; PMD=`echo $l | awk '{print $NF}'` ; PCPU=`taskset -c -p $PID | awk '{print $NF}'` ; echo "$PMD with PID $PID in on pCPU $PCPU"; done
For example:
[root@overcloud-compute-1 ~]# IFS=$'\n' ; for l in $(ps -T -p `pidof ovs-vswitchd` | grep pmd);do PID=`echo $l | awk '{print $2}'`; PMD=`echo $l | awk '{print $NF}'` ; PCPU=`taskset -c -p $PID | awk '{print $NF}'` ; echo "$PMD with PID $PID in on pCPU $PCPU"; done pmd545 with PID 412314 in on pCPU 2 pmd555 with PID 412315 in on pCPU 4 pmd550 with PID 412316 in on pCPU 6 pmd551 with PID 412317 in on pCPU 8 pmd553 with PID 412318 in on pCPU 22 pmd554 with PID 412319 in on pCPU 24 pmd549 with PID 412320 in on pCPU 26 pmd556 with PID 412321 in on pCPU 28 pmd546 with PID 412322 in on pCPU 3 pmd548 with PID 412323 in on pCPU 5 pmd547 with PID 412324 in on pCPU 23 pmd552 with PID 412325 in on pCPU 25
While reproducing the issue, run perf record and perf report and save the output. If this is for a Red Hat support ticket, provide the output of the following commands including the timestamps.
Create the script
gather_perf_data_a.sh
:cat<<'EOF'>>gather_perf_data_a.sh #!/bin/bash -x IFS=$'\n' ; dir_name=/tmp/perf_record_a mkdir ${dir_name} rm -f ${dir_name}/* for l in $(ps -T -p `pidof ovs-vswitchd` | grep pmd);do PID=`echo $l | awk '{print $2}'`; PMD=`echo $l | awk '{print $NF}'` ; PCPU=`taskset -c -p $PID | awk '{print $NF}'` ; echo "$PMD with PID $PID in on pCPU $PCPU"; done > ${dir_name}/pmds.txt for l in $(ps -T -p `pidof ovs-vswitchd` | grep pmd);do PID=`echo $l | awk '{print $2}'`; PMD=`echo $l | awk '{print $NF}'` ; PCPU=`taskset -c -p $PID | awk '{print $NF}'` ; echo "$PMD with PID $PID in on pCPU $PCPU"; date perf record -C $PCPU -g -o perf_record_-g_$PCPU sleep 60 & done sleep 80 for l in $(ps -T -p `pidof ovs-vswitchd` | grep pmd);do PID=`echo $l | awk '{print $2}'`; PMD=`echo $l | awk '{print $NF}'` ; PCPU=`taskset -c -p $PID | awk '{print $NF}'` ; echo "$PMD with PID $PID in on pCPU $PCPU"; date perf record -C $PCPU -o perf_record_$PCPU sleep 60 & done sleep 80 for f in perf_record_-g_*;do perf report -g -i $f | cat > ${dir_name}/perf_report_$f.txt ; rm -f $f done for f in perf_record_*;do perf report -i $f | cat > ${dir_name}/perf_report_$f.txt ; rm -f $f done archive_name="${dir_name}_`hostname`_`date '+%F_%H%m%S'`.tar.gz" tar -czf $archive_name ${dir_name} echo "Archived all data in archive ${archive_name}" EOF
Execute the script:
chmod +x gather_perf_data_a.sh ./gather_perf_data_a.sh
If this is for a case that was opened with Red Hat support, attach the resulting tar archive to the case.
10.2.2. Additional Data
Create the script
gather_perf_data_b.sh
to collect additional data:cat<<'EOF'>>gather_perf_data_b.sh #!/bin/bash -x dir_name=/tmp/perf_record_b mkdir ${dir_name} rm -f ${dir_name}/* date > ${dir_name}/pidstat1.txt pidstat -u -t -p `pidof ovs-vswitchd`,`pidof ovsdb-server` 5 12 >> ${dir_name}/pidstat1.txt & perf record -p `pidof ovs-vswitchd` -g --call-graph dwarf sleep 60 sleep 20 date > ${dir_name}/pidstat2.txt pidstat -u -t -p `pidof ovs-vswitchd`,`pidof ovsdb-server` 1 60 >> ${dir_name}/pidstat2.txt mv perf.data perf.data_openvswitch perf script -F tid -i perf.data_openvswitch | sort -u | grep -o '[0-9]*' | xargs -n1 -I{} perf report -i perf.data_openvswitch --no-children --percentage relative --stdio --tid {} -g none > ${dir_name}/perf_reports.txt perf script -F tid -i perf.data_openvswitch | sort -u | grep -o '[0-9]*' | xargs -n1 -I{} perf report -i perf.data_openvswitch --no-children --percentage relative --stdio --tid {} > ${dir_name}/perf_reports_callgraph.txt rm -f perf.data_openvswitch archive_name="${dir_name}_`hostname`_`date '+%F_%H%m%S'`.tar.gz" tar -czf $archive_name ${dir_name} echo "Archived all data in archive ${archive_name}" EOF
Execute the script:
chmod +x gather_perf_data_b.sh ./gather_perf_data_b.sh
NoteMake sure that there is sufficient disk space. The 'perf.data' file can easily take up several Gigabytes of disk space.
If this is for a Red Hat support ticket, attach the resulting tar archive to the case.
10.2.3. Open vSwitch Logs
Provide all Open vSwitch logs. Make sure that
/var
has sufficient disk space. Usedf -h
anddu -sh /var/log/openvswitch
to determine both the total size of OVS logs and free disk space on /var.tar -cvzf /var/openvswitch_`hostname`_`date +"%F_%H%M%S"`.tar.gz /var/log/openvswitch
-
Attach the resulting file, for example,
/var/openvswitch_overcloud-compute-0_2018-02-27_153713.tar.gz
, to the support case for analysis. Generate and provide an sosreport. Make sure that
/var
has sufficient disk space. Usedf -h
to determine free disk space on/var
.sosreport --batch --all-logs
Chapter 11. Using virsh emulatorpin in virtual environments with NFV
Use this procedure to determine the impact of using virsh emulatorpin in virtual environments, particularly in Red Hat OpenStack Platform with NFV.
11.1. Symptom
The behavior of nova in Red Hat OpenStack Platform 10 and above, and how administrators should place qemu-kvm’s emulator threads to avoid spurious packet loss, especially when isolcpus is used.
In Red Hat OpenStack Platform 10, customers need a support exception to pin emulator threads. However, pinning emulator threads is strongly recommended by Red Hat in almost all NFV cases. Keeping emulator threads at the default values can significantly lower NFV throughput. Do not hesitate to open a ticket with Red Hat support and request a support exception.
11.2. Solution
Use the following topics to determine a solution.
11.2.1. qemu-kvm Emulator Threads
qemu-kvm emulator threads are any threads other than the ones running the actual vCPUs. See the following example.
[root@overcloud-compute-0 ~]# ps -Tp `pgrep -f instance-00000009` PID SPID TTY TIME CMD 364936 364936 ? 00:00:02 qemu-kvm 364936 364946 ? 00:00:00 qemu-kvm 364936 364952 ? 00:00:52 CPU 0/KVM 364936 364953 ? 00:00:26 CPU 1/KVM 364936 364954 ? 00:00:30 CPU 2/KVM 364936 364956 ? 00:00:00 vnc_worker
Thanks to the Linux CFS scheduler, emulator threads will normally float across the pCPUs that are defined in libvirt’s emulatorpin set.
In NFV contexts, emulator threads cause problems in combination with isolcpus because this kernel configuration disables the CFS scheduling on specific CPUs. In addition, even if isolcpus is not used, emulator threads may interrupt CPUs that are dedicated for packet processing within the instance and cause packet loss.
Examples of emulator threads include:
- qemu-kvm threads
- vnc_worker threads
- vhost-<qemu-kvm PID> kernel threads (When virtio-net is used (kernel networking on the hypervisor)
11.2.2. Default Behavior for Emulator Thread Pinning
By default, nova will configure an emulator thread pin set which spans the pCPUs assigned to all vCPUs. If isolcpus is not used and CFS can freely schedule the emulator threads, then this lets emulator threads float around freely among all pinned CPUs:
virsh dumpxml instance-0000001d (...) <vcpu placement='static'>4</vcpu> <cputune> <shares>4096</shares> <vcpupin vcpu='0' cpuset='34'/> <vcpupin vcpu='1' cpuset='14'/> <vcpupin vcpu='2' cpuset='10'/> <vcpupin vcpu='3' cpuset='30'/> <emulatorpin cpuset='10,14,30,34'/> </cputune> (...)
[root@overcloud-compute-0 ~]# virsh dumpxml instance-00000009 (...) <nova:vcpus>3</nova:vcpus> <vcpu placement='static'>3</vcpu> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='2'/> <vcpupin vcpu='2' cpuset='3'/> (...) <emulatorpin cpuset='1-3'/> (...)
This means that any of these CPUs can be preempted by qemu’s emulator threads. When these emulator threads wake up, they are scheduled on one of the CPUs of the vcpuset. If you do not repin and let the emulator threads float around freely, you risk spurious packet drop.
11.2.3. The Current Implementation for Emulator Thread Pinning in OpenStack nova (OpenStack Platform 10)
In Red Hat OpenStack Platforms 10, there is no officially supported way to pin emulator threads. Temporarily, emulator threads can be moved to a set of pCPUs by using virsh emulatorpin (…) --live
, as shown in the following example.
# to pin emulator threads of instance instance-0000001d to CPU 34 virsh emulatorpin instance-0000001d 34 -live # to pin emulator threads of instance instance-0000001d to CPUs 32,34 virsh emulatorpin instance-0000001d 32,34 --live
These changes only last for the runtime of the instance. If the instance is stopped or rebooted, these changes are lost.
Permanent modifications require an external mechanism, such as a cron job, bash script or Ansible task. This has to be the subject of a support exception.
11.2.4. Later Changes to OpenStack nova (OpenStack Platform 12 and Above) for Emulator Thread Pinning
In Red Hat OpenStack Platform 12 (Pike), the emulator threads can run on a dedicated pCPU. This is good for isolation and for virtual machines that run real time applications. However, this sacrifices one extra physical CPU just for the emulator threads. This extra physical CPU cannot be used for anything else.
There is an ongoing discussion for later versions of OpenStack nova, where you may be able to configure a mask that lets you choose a set of pCPUs that can be used for the emulator threads.
11.2.4.1. OSP 12
For information, see: Configure Emulator Threads to run on Dedicated Physical CPU
11.2.4.2. OSP 14
For details on the current progress of new features for emulator thread pinning, see Bug 1468004 and OpenStack Change 510897
At time of this writing, the draft specified the following thread policies:
Valid THREAD-POLICY values are: - ``share``: (default) The emulator threads float across the pCPUs associated to the guest. To place a workload's emulator threads on a set of isolated physical CPUs, set ``share``` and ``[compute]/cpu_shared_set`` configuration option to the set of host CPUs that should be used for best-effort CPU resources. - ``isolate``: The emulator threads are isolated on a single pCPU.
11.2.5. About the Impact of isolcpus on Emulator Thread Scheduling
When isolcpus is used, CFS scheduler is disabled and all emulator threads will run on the lowest available pCPU. As a consequence, without intervention or further configuration, one vCPU of the instance runs a high risk for resource contention with the emulator threads. This vCPU is prone to seeing high amounts of % steal.
Further details about this behavior can be found at Kernel.org Bugzilla – Bug 116701.
You can use a simple algorithm to determine with which vCPU emulator threads will overlap:
PCPU=MIN([EMULATORPINSET]) VCPU=REVERSE_CPUSET(PCPU) REVERSE_CPUSET := SELECT pcpu from `virsh dumpxml <instance name> | grep "cpuset=$PCPU"`
For example, in this instance, all emulator threads and children inherit affinity 1-3 from the default emulator pin set:
[root@overcloud-compute-0 ~]# taskset -a -c -p `pgrep -f instance-00000009` pid 364936's current affinity list: 1-3 pid 364946's current affinity list: 1-3 pid 364952's current affinity list: 1 pid 364953's current affinity list: 2 pid 364954's current affinity list: 3 pid 364956's current affinity list: 1-3 [root@overcloud-compute-0 ~]# ps -Tp `pgrep -f instance-00000009` PID SPID TTY TIME CMD 364936 364936 ? 00:00:02 qemu-kvm 364936 364946 ? 00:00:00 qemu-kvm 364936 364952 ? 00:00:51 CPU 0/KVM 364936 364953 ? 00:00:26 CPU 1/KVM 364936 364954 ? 00:00:30 CPU 2/KVM 364936 364956 ? 00:00:00 vnc_worker [root@overcloud-compute-0 ~]# pgrep -f vhost- | xargs -I {} taskset -a -c -p {} pid 364948's current affinity list: 1-3 pid 364949's current affinity list: 1-3 pid 364950's current affinity list: 1-3 [root@overcloud-compute-0 ~]#
In combination with isolcpus, all emulator threads and the vhost-* threads execute on pCPU1 and are never rescheduled:
cat /proc/sched_debug | sed '/^cpu#/,/^runnable/{//!d}' | grep vhost -C3 (...) cpu#1, 2099.998 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- watchdog/1 11 -2.995579 410285 0 0.000000 5025.887998 0.000000 0 / migration/1 12 0.000000 79 0 0.000000 3.375060 0.000000 0 / ksoftirqd/1 13 5172444.259776 54 120 0.000000 0.570500 0.000000 0 / kworker/1:0 14 5188475.472257 370 120 0.000000 14.707114 0.000000 0 / kworker/1:0H 15 8360.049510 10 100 0.000000 0.150151 0.000000 0 / kworker/1:1 2707 5045807.055876 16370 120 0.000000 793.611916 0.000000 0 / kworker/1:1H 2763 5187682.987749 11755 100 0.000000 191.949725 0.000000 0 / qemu-kvm 364936 3419.522791 50276 120 0.000000 2476.880384 0.000000 0 /machine.slice/machine-qemu\x2d6\x2dinstance\x2d00000009.scope/emulator qemu-kvm 364946 1270.815296 102 120 0.000000 23.204111 0.000000 0 /machine.slice/machine-qemu\x2d6\x2dinstance\x2d00000009.scope/emulator CPU 0/KVM 364952 52703.660314 53709 120 0.000000 52715.105472 0.000000 0 /machine.slice/machine-qemu\x2d6\x2dinstance\x2d00000009.scope/vcpu0 vnc_worker 364956 123.609634 1 120 0.000000 0.016849 0.000000 0 /machine.slice/machine-qemu\x2d6\x2dinstance\x2d00000009.scope/emulator vhost-364936 364948 3410.527677 1039 120 0.000000 84.254772 0.000000 0 /machine.slice/machine-qemu\x2d6\x2dinstance\x2d00000009.scope/emulator vhost-364936 364949 3407.341502 55 120 0.000000 2.894394 0.000000 0 /machine.slice/machine-qemu\x2d6\x2dinstance\x2d00000009.scope/emulator vhost-364936 364950 3410.395220 174 120 0.000000 10.969077 0.000000 0 /machine.slice/machine-qemu\x2d6\x2dinstance\x2d00000009.scope/emulator cpu#2, 2099.998 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- watchdog/2 16 -5.995418 410285 0 0.000000 5197.571153 0.000000 0 / migration/2 17 0.000000 79 0 0.000000 3.384688 0.000000 0 / ksoftirqd/2 18 -7.031102 3 120 0.000000 0.019079 0.000000 0 / kworker/2:0 19 0.119413 39 120 0.000000 0.588589 0.000000 0 / kworker/2:0H 20 -1.047613 8 100 0.000000 0.086272 0.000000 0 / kworker/2:1 2734 1475469.236026 11322 120 0.000000 241.388582 0.000000 0 / CPU 1/KVM 364953 27258.370583 33294 120 0.000000 27269.017017 0.000000 0 /machine.slice/machine-qemu\x2d6\x2dinstance\x2d00000009.scope/vcpu1 cpu#3, 2099.998 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- watchdog/3 21 -5.996592 410285 0 0.000000 4970.777439 0.000000 0 / migration/3 22 0.000000 79 0 0.000000 3.886799 0.000000 0 / ksoftirqd/3 23 -7.035295 3 120 0.000000 0.014677 0.000000 0 / kworker/3:0 24 17.758583 38 120 0.000000 0.637152 0.000000 0 / kworker/3:0H 25 -1.047727 8 100 0.000000 0.077141 0.000000 0 / kworker/3:1 362530 154177.523420 83 120 0.000000 6.544285 0.000000 0 / CPU 2/KVM 364954 32456.061889 25966 120 0.000000 32466.719084 0.000000 0 /machine.slice/machine-qemu\x2d6\x2dinstance\x2d00000009.scope/vcpu2
11.2.6. Optimal Location of Emulator Threads
This section provides descriptions for placing emulator threads with:
- DPDK networking within the instance and netdev datapath in Open vSwitch
- DPDK networking within the instance and system datapath in Open vSwitch and kernel space networking on the hypervisor
- Kernel networking within the instance and netdev datapath in Open vSwitch
11.2.6.1. Optimal Placement of Emulator Threads with DPDK Networking Within the Instance and netdev datapath in Open vSwitch
In a scenario where DPDK runs within the instance, packet processing is done entirely in the user space. If recommended practices are followed, instance PMDs will run on CPUs 1 and above. vCPU0 remains for the OS and for interrupt handling. As the PMD CPUs within the instance run an active loop and need 100% of the CPU, they should not be preempted. Packet loss can occur if one of these vCPUs is preempted. Thus, the emulatorpin cpuset needs to be configured in such a way that it does not overlap with the physical CPUs that handle the virtual CPUs numbered 1 and above.
With DPDK networking within the instance, the optimal location for emulator threads is either the pCPU that is handling vCPU 0 or a dedicated physical CPU that is not handling any virtual CPUs at all.
If OVS-DPDK is used on the hypervisor and DPDK within the instance, then a preference should be on vCPU 0’s physical CPU.
11.2.6.2. Optimal Placement of Emulator Threads with DPDK Networking Within the Instance and System datapath in Open vSwitch /
Kernel Space Networking on the Hypervisor
In a scenario where DPDK runs within the instance, packet processing within the instance is done entirely in user space. If kernel space networking is used on the hypervisor, then packet processing on the hypervisor is executed within the kernel.
If recommended practices are followed, instance PMDs will run on CPUs 1 and above. vCPU0 remains for the OS and for interrupt handling. As the PMD CPUs within the instance run an active loop and need 100% of the CPU, they should not be preempted. If one of these vCPUs is preempted, one will end up with packet loss. The emulatorpin cpuset hence needs to be configured in such a way that it does not overlap with the physical CPUs that handle the virtual CPUs numbered 1 and above.
With DPDK networking within the instance, the optimal location for emulator threads is either the pCPU that is handling vCPU 0, or a dedicated physical CPU that is not handling any virtual CPUs at all.
Note that in this scenario, packet processing for the vNIC queues is executed within vhost-<qemu-kvm PID>
kernel threads of the hypervisor. Under high traffic, these kernel threads may generate a significant load. The optimal location of the emulator threads needs to be determined on a case by case basis.
[root@overcloud-compute-0 ~]# ps aux | grep vhost- root 364948 0.0 0.0 0 0 ? S 20:32 0:00 [vhost-364936] root 364949 0.0 0.0 0 0 ? S 20:32 0:00 [vhost-364936] root 364950 0.0 0.0 0 0 ? S 20:32 0:00 [vhost-364936]
11.2.6.3. Optimal Placement of Emulator Threads with Kernel Networking within the Instance and netdev datapath in Open vSwitch
With kernel networking within the instance, there are two options:
- Advanced optimization of interrupt distribution such as softirqs within the instance. In such a case, you do not have to sacrifice an additional pCPU for emulator threads and can tie the emulator threads to a pCPU that is not handling any network interrupts.
- Using a dedicated pCPU only for emulator threads. Place this pCPU on the same NUMA node as the vCPUs.
Due to the complexity of the first option, the second option is recommended.
11.3. Diagnosis
Use the procedures in this section to perform diagnosis.
11.3.1. The Demonstration Environment
The demonstration environment runs one instance: instance-0000001d
. Its associated qemu-kvm thread has the following PID:
[root@overcloud-compute-0 ~]# pidof qemu-kvm 73517
11.3.2. How Emulatorpin works
By default, Red Hat OpenStack Platform deploys the following settings:
virsh dumpxml instance-0000001d (...) <vcpu placement='static'>4</vcpu> <cputune> <shares>4096</shares> <vcpupin vcpu='0' cpuset='34'/> <vcpupin vcpu='1' cpuset='14'/> <vcpupin vcpu='2' cpuset='10'/> <vcpupin vcpu='3' cpuset='30'/> <emulatorpin cpuset='10,14,30,34'/> </cputune> (...)
This leads to an unpredictable allocation of the emulator threads, such as qemu-kvm, vnc_worker, and so on:
[root@overcloud-compute-0 ~]# ps -T -p 73517 PID SPID TTY TIME CMD 73517 73517 ? 00:00:00 qemu-kvm 73517 73527 ? 00:00:00 qemu-kvm 73517 73535 ? 00:00:06 CPU 0/KVM 73517 73536 ? 00:00:02 CPU 1/KVM 73517 73537 ? 00:00:03 CPU 2/KVM 73517 73538 ? 00:00:02 CPU 3/KVM 73517 73540 ? 00:00:00 vnc_worker [root@overcloud-compute-0 ~]# taskset -apc 73517 pid 73517's current affinity list: 10,14,30,34 pid 73527's current affinity list: 10,14,30,34 pid 73535's current affinity list: 34 pid 73536's current affinity list: 14 pid 73537's current affinity list: 10 pid 73538's current affinity list: 30 pid 73540's current affinity list: 10,14,30,34
[root@overcloud-compute-0 ~]# virsh vcpupin instance-0000001d | awk '$NF~/[0-9]+/ {print $NF}' | sort -n | while read CPU; do sed '/cpu#/,/runnable task/{//!d}' /proc/sched_debug | sed -n "/^cpu#${CPU},/,/^$/p" ; done cpu#10, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/10 64 0.000000 107 0 0.000000 90.232791 0.000000 0 / ksoftirqd/10 65 -13.045337 3 120 0.000000 0.004679 0.000000 0 / kworker/10:0 66 -12.892617 40 120 0.000000 0.157359 0.000000 0 / kworker/10:0H 67 -9.320550 8 100 0.000000 0.015065 0.000000 0 / kworker/10:1 17996 9695.675528 23 120 0.000000 0.222805 0.000000 0 / qemu-kvm 73517 1994.534332 27105 120 0.000000 886.203254 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator qemu-kvm 73527 722.347466 84 120 0.000000 18.236155 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator CPU 2/KVM 73537 3356.749162 18051 120 0.000000 3370.045619 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu2 vnc_worker 73540 354.007735 1 120 0.000000 0.047002 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator worker 74584 1970.499537 5 120 0.000000 0.130143 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator worker 74585 1970.492700 4 120 0.000000 0.071887 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator worker 74586 1982.467246 3 120 0.000000 0.033604 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator worker 74587 1994.520768 1 120 0.000000 0.076039 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator worker 74588 2006.500153 1 120 0.000000 0.004878 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator cpu#14, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/14 88 0.000000 107 0 0.000000 90.107596 0.000000 0 / ksoftirqd/14 89 -13.045376 3 120 0.000000 0.004782 0.000000 0 / kworker/14:0 90 -12.921990 40 120 0.000000 0.128166 0.000000 0 / kworker/14:0H 91 -9.321186 8 100 0.000000 0.016870 0.000000 0 / kworker/14:1 17999 6247.571171 5 120 0.000000 0.028576 0.000000 0 / CPU 1/KVM 73536 2274.381281 6679 120 0.000000 2287.691654 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu1 cpu#30, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/30 180 0.000000 107 0 0.000000 89.206960 0.000000 0 / ksoftirqd/30 181 -13.045892 3 120 0.000000 0.003828 0.000000 0 / kworker/30:0 182 -12.929272 40 120 0.000000 0.120754 0.000000 0 / kworker/30:0H 183 -9.321056 8 100 0.000000 0.018042 0.000000 0 / kworker/30:1 18012 6234.935501 5 120 0.000000 0.026505 0.000000 0 / CPU 3/KVM 73538 2474.183301 12595 120 0.000000 2487.479666 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu3 cpu#34, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/34 204 0.000000 107 0 0.000000 89.067908 0.000000 0 / ksoftirqd/34 205 -13.046824 3 120 0.000000 0.002884 0.000000 0 / kworker/34:0 206 -12.922407 40 120 0.000000 0.127423 0.000000 0 / kworker/34:0H 207 -9.320822 8 100 0.000000 0.017381 0.000000 0 / kworker/34:1 18016 10788.797590 7 120 0.000000 0.042631 0.000000 0 / CPU 0/KVM 73535 5969.227225 14233 120 0.000000 5983.425363 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu0
The emulator threads can be moved by using virsh emulatorpin:
virsh emulatorpin instance-0000001d 34
With this setting, the affinity for all non-CPU threads changes:
[root@overcloud-compute-0 ~]# ps -T -p 73517 PID SPID TTY TIME CMD 73517 73517 ? 00:00:00 qemu-kvm 73517 73527 ? 00:00:00 qemu-kvm 73517 73535 ? 00:00:06 CPU 0/KVM 73517 73536 ? 00:00:02 CPU 1/KVM 73517 73537 ? 00:00:03 CPU 2/KVM 73517 73538 ? 00:00:02 CPU 3/KVM 73517 73540 ? 00:00:00 vnc_worker [root@overcloud-compute-0 ~]# taskset -apc 73517 pid 73517's current affinity list: 34 pid 73527's current affinity list: 34 pid 73535's current affinity list: 34 pid 73536's current affinity list: 14 pid 73537's current affinity list: 10 pid 73538's current affinity list: 30 pid 73540's current affinity list: 34
Note that /proc/sched_debug contains historical data and that the number of switches needs to be considered. In the following example, PID 73517 already moved to cpu#34. The other emulator workers did not run since the last output, and hence still show on cpu#10:
[root@overcloud-compute-0 ~]# virsh vcpupin instance-0000001d | awk '$NF~/[0-9]+/ {print $NF}' | sort -n | while read CPU; do sed '/cpu#/,/runnable task/{//!d}' /proc/sched_debug | sed -n "/^cpu#${CPU},/,/^$/p" ; done cpu#10, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/10 64 0.000000 107 0 0.000000 90.232791 0.000000 0 / ksoftirqd/10 65 -13.045337 3 120 0.000000 0.004679 0.000000 0 / kworker/10:0 66 -12.892617 40 120 0.000000 0.157359 0.000000 0 / kworker/10:0H 67 -9.320550 8 100 0.000000 0.015065 0.000000 0 / kworker/10:1 17996 9747.429082 26 120 0.000000 0.255547 0.000000 0 / qemu-kvm 73527 722.347466 84 120 0.000000 18.236155 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator CPU 2/KVM 73537 3424.520709 21610 120 0.000000 3437.817166 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu2 vnc_worker 73540 354.007735 1 120 0.000000 0.047002 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator cpu#14, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/14 88 0.000000 107 0 0.000000 90.107596 0.000000 0 / ksoftirqd/14 89 -13.045376 3 120 0.000000 0.004782 0.000000 0 / kworker/14:0 90 -12.921990 40 120 0.000000 0.128166 0.000000 0 / kworker/14:0H 91 -9.321186 8 100 0.000000 0.016870 0.000000 0 / kworker/14:1 17999 6247.571171 5 120 0.000000 0.028576 0.000000 0 / CPU 1/KVM 73536 2283.094453 7028 120 0.000000 2296.404826 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu1 cpu#30, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/30 180 0.000000 107 0 0.000000 89.206960 0.000000 0 / ksoftirqd/30 181 -13.045892 3 120 0.000000 0.003828 0.000000 0 / kworker/30:0 182 -12.929272 40 120 0.000000 0.120754 0.000000 0 / kworker/30:0H 183 -9.321056 8 100 0.000000 0.018042 0.000000 0 / kworker/30:1 18012 6234.935501 5 120 0.000000 0.026505 0.000000 0 / CPU 3/KVM 73538 2521.828931 14047 120 0.000000 2535.125296 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu3 cpu#34, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/34 204 0.000000 107 0 0.000000 89.067908 0.000000 0 / ksoftirqd/34 205 -13.046824 3 120 0.000000 0.002884 0.000000 0 / kworker/34:0 206 -12.922407 40 120 0.000000 0.127423 0.000000 0 / kworker/34:0H 207 -9.320822 8 100 0.000000 0.017381 0.000000 0 / kworker/34:1 18016 10788.797590 7 120 0.000000 0.042631 0.000000 0 / qemu-kvm 73517 2.613794 27706 120 0.000000 941.839262 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator CPU 0/KVM 73535 5994.533905 15169 120 0.000000 6008.732043 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu0
Note how thread 73517 moves to cpu#34. If you now interact with a VNC session, you can see that /proc/sched_debug shows the vnc_worker threads on cpu#34 as well.
[root@overcloud-compute-0 ~]# virsh vcpupin instance-0000001d | awk '$NF~/[0-9]+/ {print $NF}' | sort -n | while read CPU; do sed '/cpu#/,/runnable task/{//!d}' /proc/sched_debug | sed -n "/^cpu#${CPU},/,/^$/p" ; done cpu#10, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/10 64 0.000000 107 0 0.000000 90.232791 0.000000 0 / ksoftirqd/10 65 -13.045337 3 120 0.000000 0.004679 0.000000 0 / kworker/10:0 66 -12.892617 40 120 0.000000 0.157359 0.000000 0 / kworker/10:0H 67 -9.320550 8 100 0.000000 0.015065 0.000000 0 / kworker/10:1 17996 9963.300958 27 120 0.000000 0.273007 0.000000 0 / qemu-kvm 73527 722.347466 84 120 0.000000 18.236155 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator CPU 2/KVM 73537 3563.793234 26162 120 0.000000 3577.089691 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu2 cpu#14, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/14 88 0.000000 107 0 0.000000 90.107596 0.000000 0 / ksoftirqd/14 89 -13.045376 3 120 0.000000 0.004782 0.000000 0 / kworker/14:0 90 -12.921990 40 120 0.000000 0.128166 0.000000 0 / kworker/14:0H 91 -9.321186 8 100 0.000000 0.016870 0.000000 0 / kworker/14:1 17999 6247.571171 5 120 0.000000 0.028576 0.000000 0 / CPU 1/KVM 73536 2367.789075 9648 120 0.000000 2381.099448 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu1 cpu#30, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/30 180 0.000000 107 0 0.000000 89.206960 0.000000 0 / ksoftirqd/30 181 -13.045892 3 120 0.000000 0.003828 0.000000 0 / kworker/30:0 182 -12.929272 40 120 0.000000 0.120754 0.000000 0 / kworker/30:0H 183 -9.321056 8 100 0.000000 0.018042 0.000000 0 / kworker/30:1 18012 6234.935501 5 120 0.000000 0.026505 0.000000 0 / CPU 3/KVM 73538 2789.628278 24788 120 0.000000 2802.924643 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu3 cpu#34, 2197.477 MHz runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/34 204 0.000000 107 0 0.000000 89.067908 0.000000 0 / ksoftirqd/34 205 -13.046824 3 120 0.000000 0.002884 0.000000 0 / kworker/34:0 206 -12.922407 40 120 0.000000 0.127423 0.000000 0 / kworker/34:0H 207 -9.320822 8 100 0.000000 0.017381 0.000000 0 / kworker/34:1 18016 11315.391422 25 120 0.000000 0.196078 0.000000 0 / qemu-kvm 73517 471.930276 30975 120 0.000000 1295.543576 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator CPU 0/KVM 73535 6160.062172 19201 120 0.000000 6174.260310 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/vcpu0 vnc_worker 73540 459.653524 38 120 0.000000 7.535037 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator worker 78703 449.098251 2 120 0.000000 0.120313 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator worker 78704 449.131175 3 120 0.000000 0.066961 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator worker 78705 461.100994 4 120 0.000000 0.022897 0.000000 0 /machine.slice/machine-qemu\x2d1\x2dinstance\x2d0000001d.scope/emulator
Chapter 12. Mixing System (Kernel Space) and netdev (DPDK User Space) Datapath OVS Bridges
12.1. Symptom
Does Red Hat support mixing system
(kernel space) and netdev
(DPDK user space) datapath OVS bridges in Red Hat OpenStack Platform 10?
12.2. Solution
The combination of the following issues prohibit Red Hat from supporting the combination of datapath types netdev
and system
on the same compute host.
This situation may change in future versions of Red Hat OpenStack that may rely on plugins other than OVS ML2.
- Connecting different datapath type bridges in Open vSwitch
-
On the same compute node, you can, theoretically, run both interconnected bridges of the datapath type
system
and bridges of the datapath typenetdev
. However, switching a frame betweensystem
andnetdev
bridges is extremely slow. The context switches, mem copies, and so on, that happen during a transition between user space and kernel space are much slower than switching operations that remain exclusively in either datapath type. Because of this, it is mandatory that both types of bridges are not patched together with virtual patch cords. - Neutron OVS ML2 and different datapath type bridges
With Neutron’s Open vSwitch ML2 Plugin, the virtual machine ports are connected to a single bridge named
br-int
. This bridge is either of the datapath typenetdev
, or of the datapath typesystem
. This means that, in the end, any traffic flow configured through the OVS ML2 plugin will flow throughbr-int
.With OVS ML2, there is no way to tell Nova to connect a virtual machine port to a specific bridge, and all bridges are automatically connected to
br-int
through virtual patch cords by the plugin. This means that there is no way with OVS ML2 and nova to mixsystem
andnetdev
datapath type bridges without impacting the performance significantly.
Chapter 13. DPDK vhost-user port does not Send or Receive Traffic
13.1. Symptom
The DPDK vhost-user port does not send or receive any traffic and is listed as LINK_DOWN in ovs-ofctl show br-int
in Red Hat OpenStack Platform 10.
In this example, the port state indicates LINK_DOWN
, even though the instance is running:
[root@overcloud-compute-0 ~]# ovs-ofctl show br-int OFPT_FEATURES_REPLY (xid=0x2): dpid:0000a2752c2bf94a n_tables:254, n_buffers:256 capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 1(int-br-link): addr:c6:87:d2:a8:f1:01 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 3(vhub3fc898c-9c): addr:00:00:00:00:00:00 config: 0 state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max 4(vhost-user2): addr:00:00:00:00:00:00 config: 0 state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max LOCAL(br-int): addr:a2:75:2c:2b:f9:4a [root@overcloud-compute-0 ~]#
13.2. Solution
The instance needs to use hugepages for OVS-DPDK to work.
Use this procedure to configure an aggregate and a flavor:
Create a Flavor and Deploy an Instance for OVS-DPDK
For example:
Create an aggregate group, add a host, and create flavors:
openstack aggregate create --zone=dpdk dpdk openstack aggregate add host dpdk overcloud-compute-0.localdomain openstack flavor set --property hw:cpu_policy=dedicated --property hw:mem_page_size=large m1.tiny openstack flavor set --property hw:cpu_policy=dedicated --property hw:mem_page_size=large m1.small
Ensure that the
memory-backend-file
exists and that it uses hugepages:[root@overcloud-compute-0 tmp]# ps aux | grep qemu-kvm | sed 's/ -/\n-/g' | grep memory-backend-file -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/2-instance- 00000006,share=yes,size=1073741824,host-nodes=0,policy=bind
Verify that the state equals
0
:[root@overcloud-compute-0 ~]# ovs-ofctl show br-int OFPT_FEATURES_REPLY (xid=0x2): dpid:0000a2752c2bf94a n_tables:254, n_buffers:256 capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 1(int-br-link): addr:c6:87:d2:a8:f1:01 config: 0 state: 0 in bar speed: 0 Mbps now, 0 Mbps max 3(vhub3fc898c-9c): addr:00:00:00:00:00:00 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 4(vhost-user2): addr:00:00:00:00:00:00 config: 0 state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max LOCAL(br-int): addr:a2:75:2c:2b:f9:4a [root@overcloud-compute-0 ~]#
13.3. Diagnosis
Compare the following three qemu-kvm processes.
This example won’t work:
[root@overcloud-compute-0 tmp]# ps aux | grep qemu-kvm | sed 's/ -/\n -/g' qemu 3542 2.5 4.5 1800584 354864 ? Sl 18:10 1:49 /usr/libexec/qemu-kvm -name guest=instance-00000005,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-instance-00000005/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 3101b29c-50b6-49cc-96f2-f9723454f930 -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.7-11.el7ost,serial=ece3291e-67d2-4665- afb6-7e7774511eb2,uuid=3101b29c-50b6-49cc-96f2-f9723454f930,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-instance-00000005/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/3101b29c-50b6-49cc-96f2-f9723454f930/disk,format=qcow2,if=none,id=drive-virtio- disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhu97f52bf7-6b -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:88:0d:e3,bus=pci.0,addr=0x3 -add-fd set=0,fd=27 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.10:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
This manual test instance works:
[root@overcloud-compute-0 tmp]# ps aux | grep qemu-kvm | sed 's/ -/\n -/g' root 15139 57.9 0.4 1251124 38332 pts/0 Sl+ 19:17 2:11 /usr/libexec/qemu-kvm -enable-kvm -m 1024 -smp 1 -chardev socket,id=char0,path=/var/run/openvswitch/vhost-user2 -netdev dhivhost-user,id=mynet1,chardev=char0,vhostforce -device virtio-net-pci,netdev=mynet1,mac=52:54:00:02:d9:01 -object memory-backend-file,id=mem,size=1024M,mem-path=/dev/hugepages,share=on -numa node,memdev=mem -mem-prealloc -net user,hostfwd=tcp::10021-:22 -net nic -vnc 0.0.0.0:10 rhel.qemu.qcow2 root 15693 0.0 0.0 112660 968 pts/1 S+ 19:20 0:00 grep --color=auto qemu-kvm
This example fixes the issue by adding flavor specs:
[root@overcloud-compute-0 tmp]# ps aux | grep qemu-kvm | sed 's/ -/\n -/g' qemu 5710 9.6 0.3 1771876 28284 ? Rl 20:25 0:05 /usr/libexec/qemu-kvm -name guest=instance-00000006,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-instance-00000006/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/2-instance- 00000006,share=yes,size=1073741824,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 048d07c8-48f0-4f72-87c8-217edc801fa2 -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.7-11.el7ost,serial=ece3291e-67d2-4665- afb6-7e7774511eb2,uuid=048d07c8-48f0-4f72-87c8-217edc801fa2,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-2-instance-00000006/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/048d07c8-48f0-4f72-87c8-217edc801fa2/disk,format=qcow2,if=none,id=drive-virtio- disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhub3fc898c-9c -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:64:2d:ae,bus=pci.0,addr=0x3 -add-fd set=0,fd=27 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.10:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on root 5870 0.0 0.0 112660 964 pts/0 S+ 20:26 0:00 grep --color=auto qemu-kvm
Chapter 14. Using ovs-tcpdump on vhost-user Interfaces
14.1. Symptom
ovs-tcpdump
works on physical interfaces, such as dpdk0, but not on vhost-user ports with long names.
14.2. Diagnosis
In Red Hat OpenStack environments, neutron creates vhu ports with fairly long names. The issue is that the vhost-user
interface name is too long when ovs-tcpdump
adds the mi
prefix to neutron-generated vhu
ports.
14.3. Solution
ovs-tcpdump
is shipped by RPM openvswitch-test
, which is part of rhel-7-fast-datapath-rpms
.
Using ovs-tcpdump
may significantly lower the performance of the virtual switch, especially in deployments with DPDK. ovs-tcpdump
creates a mirror that sends all mirrored user space packets to the kernel. In scenarios with high traffic, this consumes a lot of time in the kernel space and may affect OVS DPDK performance.
At time of this writing, a proper fix was implemented upstream, but did not make it into Red Hat’s RPMs yet.
Currently, the workaround is to use the --mirror-to
parameter.
[root@overcloud-compute-0 ~]# virsh dumpxml 2| grep vhu <source type='unix' path='/var/run/openvswitch/vhuc359d121-a9' mode='client'/> <source type='unix' path='/var/run/openvswitch/vhu8a801025-d6' mode='client'/> <source type='unix' path='/var/run/openvswitch/vhu8ffc8d2b-66' mode='client'/>
[root@overcloud-compute-0 ~]# ovs-tcpdump -i vhuc359d121-a9 --mirror-to test0 -c 10 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on test0, link-type EN10MB (Ethernet), capture size 262144 bytes 23:47:41.898750 IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28 23:47:42.181745 IP6 :: > ff02::1:ff59:a2f3: ICMP6, neighbor solicitation, who has overcloud-compute-0, length 24 23:47:43.183749 IP6 overcloud-compute-0 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28 23:47:43.745746 IP6 overcloud-compute-0 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28 23:47:47.490332 ARP, Request who-has 192.0.2.9 tell controller-0.storage.localdomain, length 46 23:47:48.492499 ARP, Request who-has 192.0.2.9 tell controller-0.storage.localdomain, length 46 23:47:49.494479 ARP, Request who-has 192.0.2.9 tell controller-0.storage.localdomain, length 46 23:47:50.496537 ARP, Request who-has 192.0.2.9 tell controller-0.storage.localdomain, length 46 23:47:51.498487 ARP, Request who-has 192.0.2.9 tell controller-0.storage.localdomain, length 46 23:47:51.621942 IP gateway > 224.0.0.1: igmp query v2 10 packets captured 10 packets received by filter 0 packets dropped by kernel [root@overcloud-compute-0 ~]#
Use the following approach as an alternative to installing the openvswitch-test rpm.
Using the following approach will significantly lower the performance of the virtual switch. Additionally, removing the port mirror may cause complete traffic loss to instances. Because of these limitations, only use this approach in a lab environment. The recovery is to reboot.
The functionality of ovs-tcpdump
can also partly be reproduced with a script. Use the following steps for informational purposes only:
Run
ovs-vsctl show
:#!/bin/bash show_help(){ echo "Usage: $(basename "$0") (add|remove) <valid bridge name> <valid vhu interface name>" echo "Example: $(basename "$0") add br-int vhuabcdefgh1234" echo "Valid Bridges:" ovs-vsctl show | grep Bridge echo "Valid vhu interfaces:" ovs-vsctl show | grep vhu exit 1 } ACTION=$1 BRIDGE=$2 VHU=$3 if [ "$ACTION" != "add" ] && [ "$ACTION" != "remove" ]; then show_help fi if ! `ovs-vsctl show | grep Bridge | grep -q $BRIDGE`; then show_help fi if ! `ovs-vsctl show | grep -q $VHU`; then show_help fi if [ "$ACTION" == "add" ]; then ip link add name ${BRIDGE}-sniffer0 type dummy ip link set dev ${BRIDGE}-sniffer0 up ovs-vsctl add-port ${BRIDGE} ${BRIDGE}-sniffer0 ovs-vsctl -- set Bridge ${BRIDGE} mirrors=@m \ -- --id=@${BRIDGE}-sniffer0 get Port ${BRIDGE}-sniffer0 \ -- --id=@${BRIDGE} get Port ${VHU} \ -- --id=@m create Mirror name=mymirror1 \ select-dst-port=@${BRIDGE} \ select-src-port=@${BRIDGE} \ output-port=@${BRIDGE}-sniffer0 echo "Traffic from $VHU on bridge $BRIDGE is being mirrored to ${BRIDGE}-sniffer0" elif [ "$ACTION" == "remove" ]; then ovs-vsctl clear Bridge ${BRIDGE} mirrors ip link del dev ${BRIDGE}-sniffer0 type dummy fi
Run tcpdump on the interface:
tcpdump -i ${BRIDGE}-sniffer0 -nne
Clean up:
ovs-vsctl clear bridge ${BRIDGE} mirrors ovs-vsctl del-port ${BRIDGE}-sniffer0 ip link del dev ${BRIDGE}-sniffer0