Red Hat OpenStack Platform 10

TIM End to End OVS-DPDK Troubleshooting Guide

How to use the end to end OVS-DPDK troubleshooting procedures

Abstract

A guide to using the end to end OVS-DPDK troubleshooting procedures.

Preface

This troubleshooting guide provides detailed procedures for OVS-DPDK workloads The procedures documented in this Guide supersede the Knowledge Base articles.

Chapter 1. High Packet Loss in the TX Queue of the Instance’s Tap Interface

Use this procedure to determine the cause of packet loss in the TX queue and how to diagnose the problem.

1.1. Symptom

During a test of a VNF using host-only networking, high packet loss can be observed in the TX queue of the instance’s tap interface. The test setup sends packets from one VM on a node to another VM on the same node. The packet loss appears in bursts.

The following example shows a high number of dropped packets in the tap’s TX queue.

ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500034259301 132047795 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5481296464 81741449 0       11155280 0       0
    TX errors: aborted  fifo   window heartbeat transns
           0        0       0       0       0

1.2. Diagnosis

Note

This procedure deals with drop on tap (kernel path) interfaces. For drops on vhost user interfaces in the user datapath, see https://access.redhat.com/solutions/3381011

TX drops are due to interference between the instance’s vCPU and other processes on the hypervisor. The TX queue of the tap interface is a buffer that can store packets for a short while in case that the instance cannot pick up the packets. This would happen if the instance’s CPU is held off from running (or freezes) for a long enough time.

A tuntap is a virtual device where one end is a kernel network interface, and the other end is a user space file descriptor.

tuntap can run in two modes:

Tap mode feeds L2 ethernet frames with L2 header into the device, and expects to receive the same out from user space. This mode is used for VMs.
Tun mode feeds L3 IP packets with L3 header into the device, and expects to receive the same out from user space. This mode is mostly used for VPN clients.

In KVM networking, the user space file descriptor is owned by the qemu-kvm process. Any frames that are sent into the tap (TX from the hypervisor’s perspective) end up as L2 frames inside qemu-kvm, which can then feed those frames to the virtual network device in the VM as network packets received into the virtual network interface (RX from the VM’s perspective).

The key concept with tuntap is: hypervisor TX == VM RX. The opposite is also true: hypervisor RX == VM TX.

There is no "ring buffer" of packets on a virtio-net device. This means that if the tuntap device’s TX queue fills up because the VM is not receiving (either fast enough or at all) then there is nowhere for new packets to go, and the hypervisor sees TX loss on the tap.

If you notice TX loss on a tuntap, then increasing the tap txqueuelen is one way to help avoid that, similar to increasing the RX ring buffer to stop receive loss on a physical NIC.

However, this assumes the VM is just "slow" and "bursty" at receive. If the VM is not executing fast enough all the time, or otherwise not receiving at all, then tuning the TX queue length won’t help. You will need to find out why the VM is not running or receiving.

If you only need to improve VM packet handling performance, one can enable virtio-net multiqueue on the hypervisor and then balance those multiple virtual device interrupts on difference cores inside the VM. This is documented in the libvirt domain spec for KVM (it can be done with virsh edit on RHEL KVM hypervisor).

If you cannot configure virtio-net multiqueue in Red Hat OpenStack Platform, consider configuring RPS inside the VM to balance receive load across multiple CPU cores with software. See scaling.txt in the kernel-doc package, or see the RPS section in the RHEL product documentation.

1.2.1. Workaround

Increasing the TX queue helps deal with these microfreezes at the cost of higher latency and other disadvantages.

txqueuelen can be temporarily increased via:

/sbin/ip link set tap<uuid> txqueuelen <new queue length>

txqueulen can be permanently increased via a udev rule:

cat <<'EOF'>/etc/udev/rules.d/71-net-txqueuelen.rules
SUBSYSTEM=="net", ACTION=="add", KERNEL=="tap*", ATTR{tx_queue_len}="10000"
EOF

After reloading udev or rebooting the system, new tap interfaces will come up with a queue length of 10000. For example:

[root@overcloud-compute-0 ~]# ip link ls | grep tap
29: tap122be807-cd: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 5505
qdisc pfifo_fast master qbr122be807-cd state UNKNOWN mode DEFAULT
group default qlen 10000

1.2.2. Diagnostic Steps

In order to verify the above, use following script:

[root@ibm-x3550m4-9 ~]# cat generate-tx-drops.sh
#!/bin/bash

trap 'cleanup' INT

cleanup() {
  echo "Cleanup ..."
  if [ "x$HPING_PID" != "x" ]; then
    echo "Killing hping3 with PID $HPING_PID"
    kill $HPING_PID
  fi
  if [ "x$DD_PID" != "x" ]; then
    echo "Killing dd with PID $DD_PID"
    kill $DD_PID
  fi
  exit 0
}

VM_IP=10.0.0.20
VM_TAP=tapc18eb09e-01
VM_INSTANCE_ID=instance-00000012
LAST_CPU=$( lscpu | awk '/^CPU\(s\):/ { print $NF - 1 }' )
# this is a 12 core system, we are sending everything to CPU 11,
# so the taskset mask is 800 so set dd affinity only for last CPU
TASKSET_MASK=800

# pinning vCPU to last pCPU
echo "virsh vcpupin $VM_INSTANCE_ID 0 $LAST_CPU"
virsh vcpupin $VM_INSTANCE_ID 0 $LAST_CPU

# make sure that: nova secgroup-add-rule default udp 1 65535 0.0.0.0/0
# make sure that: nova secgroup-add-rule default tcp 1 65535 0.0.0.0/0
# make sure that: nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0
# --fast, --faster or --flood can also be used
echo "hping3 -u -p 5000 $VM_IP --faster > /dev/null "
hping3 -u -p 5000 $VM_IP --faster > /dev/null &
HPING_PID=$!

echo "hping is running, but dd not yet:"
for i in { 1 .. 3 }; do
  date
  echo "ip -s -s link ls dev $VM_TAP"
  ip -s -s link ls dev $VM_TAP
  sleep 5
done

echo "Starting dd and pinning it to the same pCPU as the instance"
echo "dd if=/dev/zero of=/dev/null"
dd if=/dev/zero of=/dev/null &
DD_PID=$!
echo "taskset -p $TASKSET_MASK $DD_PID"
taskset -p $TASKSET_MASK $DD_PID

for i in { 1 .. 3 }; do
  date
  echo "ip -s -s link ls dev $VM_TAP"
  ip -s -s link ls dev $VM_TAP
  sleep 5
done

cleanup

Log into the instance and start dd if=/dev/zero of=/dev/null to generate additional load on its only vCPU. Note that this is for demonstration purposes. You can repeat the same test with and without load from within the VM. It doesn’t matter. TX drop only occurs when another process on the hypervisor is stealing time from the instance’s vCPU.

The following example shows an instance before the test:

%Cpu(s): 22.3 us, 77.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1884108 total,  1445636 free,    90536 used,   347936 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  1618720 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
30172 root      20   0  107936    620    528 R 99.9  0.0   0:05.89 dd

Run the following script and have a look at the dropped packages in the TX queue. These only occur when the dd process is stealing a significant amount of processing time from the instance’s CPU.

[root@ibm-x3550m4-9 ~]# ./generate-tx-drops.sh
virsh vcpupin instance-00000012 0 11

hping3 -u -p 5000 10.0.0.20 --faster > /dev/null
hping is running, but dd not yet:
Tue Nov 29 12:28:22 EST 2016
ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500034259301 132047795 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5481296464 81741449 0       11155280 0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       0
Tue Nov 29 12:28:27 EST 2016
ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500055729011 132445382 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5502766282 82139038 0       11155280 0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       0
Tue Nov 29 12:28:32 EST 2016
ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500077122125 132841551 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5524159396 82535207 0       11155280 0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       0
Tue Nov 29 12:28:37 EST 2016
ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500098181033 133231531 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5545218358 82925188 0       11155280 0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       0
Tue Nov 29 12:28:42 EST 2016
ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500119152685 133619793 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5566184804 83313451 0       11155280 0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       0
Starting dd and pinning it to the same pCPU as the instance
dd if=/dev/zero of=/dev/null
taskset -p 800 8763
pid 8763's current affinity mask: fff
pid 8763's new affinity mask: 800
Tue Nov 29 12:28:47 EST 2016
ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500140267091 134010698 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5587300452 83704477 0       11155280 0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       0
Tue Nov 29 12:28:52 EST 2016
ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500159822749 134372711 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5606853168 84066563 0       11188074 0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       0
Tue Nov 29 12:28:57 EST 2016
ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500179161241 134730729 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5626179144 84424451 0       11223096 0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       0
Tue Nov 29 12:29:02 EST 2016
ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500198344463 135085948 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5645365410 84779752 0       11260740 0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       0
Tue Nov 29 12:29:07 EST 2016
ip -s -s link ls dev tapc18eb09e-01
69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000
    link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    5500217014275 135431570 0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5664031398 85125418 0       11302179 0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       0
Cleanup ...
Killing hping3 with PID 8722
Killing dd with PID 8763
[root@ibm-x3550m4-9 ~]#
--- 10.0.0.20 hping statistic ---
3919615 packets transmitted, 0 packets received, 100% packet loss
round-trip min/avg/max = 0.0/0.0/0.0 ms

The following example shows an instance during the test with dd stealing on the hypervisor (look at the 45% st value):

%Cpu(s):  7.0 us, 27.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi, 20.2 si, 45.4 st
KiB Mem :  1884108 total,  1445484 free,    90676 used,   347948 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  1618568 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
30172 root      20   0  107936    620    528 R 54.3  0.0   1:00.50 dd

Note that ssh may become sluggish during the second half of the test on the instance, including the possibility of timing out if the test runs too long.

1.3. Solution

Increasing the TX queue helps deal with these microfreezes. However, the real solution would be complete isolation with CPU pinning and isolcpus in the kernel parameters. Refer to Configure CPU pinning with NUMA in OpenStack for further details.

Chapter 2. TX Drops on Instance VHU Interfaces with Open vSwitch DPDK

Use this procedure to determine the cause of TX drops on instance VHU interfaces and how to diagnose the problem.

2.1. Symptom

The vhost-user interface (VHU) exchanges packets with the virtual machine. This interface allows the packet to go from the vswitch directly to the guest using the virtio transport without passing through the kernel or qemu processes.

The VHU is mostly implemented by DPDK librte_vhost that also offers functions to send or receive batches of packets. However, the backend of VHU is a virtio ring provided by qemu to exchange packets with the virtual machine. The virtio ring has a special format comprised of descriptors and buffers.

The TX/RX statistics are the OVS view of things: TX means the TX from the OVS prespective, meaning RX from the VM’s point of view. If the VM does not pull packets fast enough, for whatever reason, OVS will face a full TX queue and will drop packets.

2.1.1. Explanation for Spurious Drops

The reason for TX drops on the vhost-user device is a lack of space in the virtio ring. The virtio ring is located in the guest’s memory and it works like a queue where the vhost-user pushes packets and the VM consumes them. If the VM isn’t fast enough to consume the packets, the virtio ring runs out of buffers and the vhost-user drops packets.

More explicitly, the only reason for vhost-user ports to have TX drops is that the guest is not fetching packets fast enough, which causes the vhost-user port to run out of buffer space.

You can use the perf and ftrace tools to investigate possible causes for spurious drops. Perf can count the number of scheduler switches, which could show that the qemu thread was preempted by another thread. Ftrace can show how long and the reason for a preemption. The timer interrupts (kernel ticks), for instance, preempts the qemu threads plus the cost of at least two context switches. The timer interrupt can also run RCU callbacks which takes unpredictable amount of time. CPU power management and Hyper Threading can also disrupt the qemu thread. Note that these are just some of the possible reasons why the VM is not consuming packets fast enough from the virtio ring.

PERF: perf rpm in rhel-7-server-rpms/7Server/x86_64. For more information, see: About Perf
FTRACE: trace-cmd info rhel-7-server-rpms/7Server/x86_64. For more information, see: About Ftrace

2.1.2. Explanation for other drops

The current implementation in OpenStack Platform 10 uses vhostuser ports. In the case of vhostuser ports, OVS is the server and qemu is the client. Regardless if a nova instance reboots from within the VM, is rebooted with nova, or is stopped and restarted, the vhost-user (VHU) port will continue to exist on the bridge. Frames will hit the port based on flow and/or MAC learning rules and will increase the tx_drop counter because the consumer (the VM) is down and with it the vhu port:

# in this example, the VM was stopped with `nova stop <UUID>`:
[root@overcloud-compute-0 network-scripts]# ovs-vsctl list interface vhubd172106-73 | grep _state
admin_state         : up
link_state          : down

This is similar to what happens when the kernel port is shut down with ip link set dev <br internal port name> down and frames are dropped in userspace.

When the VM comes up again, it will connect to the same vhu socket as before and will start consuming frames from the virtio ring buffer. Tx drops will stop and traffic is transmitted normally again. By increasing the TX and RX queue lengths for DPDK, you can change TX and RX queue lengths for DPDK.

2.1.3. Increasing the TX and RX queue lengths for DPDK

You can change TX and RX queue lengths for DPDK with the following OpenStack Director template modifications:

NovaComputeExtraConfig:
        nova::compute::libvirt::rx_queue_size: '"1024"'
        nova::compute::libvirt::tx_queue_size: '"1024"'

The following example shows the validation:

[root@overcloud-compute-1 ~]# ovs-vsctl get interface vhu9a9b0feb-2e status
{features="0x0000000150208182", mode=client, num_of_vrings="2", numa="0",
socket="/var/lib/vhost_sockets/vhu9a9b0feb-2e", status=connected, "vring_0_size"="1024",
"vring_1_size"="1024"}

Due to kernel limitations, the queue size cannot be increased beyond 1024.

2.2. Diagnosis

TX drops towards the vhost user ports are observed when the guest cannot receive packets. Networks eventually drop packets and, in most cases, that’s not an issue. TCP, for instance, can easily recover. But other use-cases might have more strict requirements with less tolerance to packet drops.

DPDK accelerated OVS is used because the kernel datapath is too slow. The same happens inside the guest, so if the guest is running a regular kernel, the guest might not be able to run at the same pace as the host, and drops might happen.

2.3. Solution

Make sure the vCPUs processing the VM are running 100% allocated to the VM and not to other unrelated tasks is a good first step.

If the VM gets all CPU power possible, then look inside of the guest to make sure it is properly tuned.

Of course, running kernel datapath inside the guest is slower than running a DPDK application.

Chapter 3. Interpreting the output of the `pmd-stats-show` command in Open vSwitch with DPDK

This procedure tells you how to interpret the output of the pmd-stats-show command (ovs-appctl dpif-netdev/pmd-stats-show) in Open vSwitch (OVS) with DPDK.

3.1. Symptom

An issue you could encounter when using the ovs-appctl dpif-netdev/pmd-stats-show command is that the gathered statistics are charted since the PMD was started. This means that statistics collected before the current load are also reflected, which could provide an inaccurate measurement.

3.2. Diagnosis

If you want to obtain more current and useful output, you should put the system in to a steady state and reset the statistics that you want to measure:

# put system into steady state
ovs-appctl dpif-netdev/pmd-stats-clear
# wait <x> seconds
sleep <x>
ovs-appctl dpif-netdev/pmd-stats-show

Here’s an example of the output:

[root@overcloud-compute-0 ~]# ovs-appctl dpif-netdev/pmd-stats-clear && sleep 10 && ovs-appctl dpif-netdev/pmd-stats-show |
egrep 'core_id (2|22):' -A9
pmd thread numa_id 0 core_id 22:
    emc hits:17461158
    megaflow hits:0
    avg. subtable lookups per hit:0.00
    miss:0
    lost:0
    polling cycles:4948219259 (25.81%)
    processing cycles:14220835107 (74.19%)
    avg cycles per packet: 1097.81 (19169054366/17461158)
    avg processing cycles per packet: 814.43 (14220835107/17461158)
--
pmd thread numa_id 0 core_id 2:
    emc hits:14874381
    megaflow hits:0
    avg. subtable lookups per hit:0.00
    miss:0
    lost:0
    polling cycles:5460724802 (29.10%)
    processing cycles:13305794333 (70.90%)
    avg cycles per packet: 1261.67 (18766519135/14874381)
    avg processing cycles per packet: 894.54 (13305794333/14874381)

Note that core_id 2 is mainly busy, spending 70% of the time processing and 30% of the time polling.

polling cycles:5460724802 (29.10%)
processing cycles:13305794333 (70.90%)

This example shows miss indicates packets that were not classified in the DPDK datapath ('emc' or 'dp' classifier). Under normal circumstances, they would then be sent to the ofproto layer. On some rare occasions, due to a flow revalidation lock or if the ofproto layer returns an error, the packet is dropped. In this case, lost will also be incremented to indicate the loss.

For more details, see: https://software.intel.com/en-us/articles/ovs-dpdk-datapath-classifier

emc hits:14874381
megaflow hits:0
avg. subtable lookups per hit:0.00
miss:0
lost:0

3.3. Solution

This section explains the procedures for resolving the problem.

3.3.1. Idle PMD

The following example shows a system where the core_ids serve the PMDs that are pinned to dpdk0, with only management traffic flowing through dpdk0:

[root@overcloud-compute-0 ~]# ovs-appctl dpif-netdev/pmd-stats-clear && sleep 10 && ovs-appctl dpif-netdev/pmd-stats-show |
egrep 'core_id (2|22):' -A9
pmd thread numa_id 0 core_id 22:
    emc hits:0
    megaflow hits:0
    avg. subtable lookups per hit:0.00
    miss:0
    lost:0
    polling cycles:12613298746 (100.00%)
    processing cycles:0 (0.00%)
--
pmd thread numa_id 0 core_id 2:
    emc hits:5
    megaflow hits:0
    avg. subtable lookups per hit:0.00
    miss:0
    lost:0
    polling cycles:12480023709 (100.00%)
    processing cycles:14354 (0.00%)
    avg cycles per packet: 2496007612.60 (12480038063/5)
    avg processing cycles per packet: 2870.80 (14354/5)

3.3.2. PMD under load test with packet drop

The following example shows a system where the core_ids serve the PMDs that are pinned to dpdk0, with a load test flowing through dpdk0, causing a high number of RX drops:

[root@overcloud-compute-0 ~]# ovs-appctl dpif-netdev/pmd-stats-clear && sleep 10 && ovs-appctl dpif-netdev/pmd-stats-show |
egrep 'core_id (2|4|22|24):' -A9
pmd thread numa_id 0 core_id 22:
    emc hits:35497952
    megaflow hits:0
    avg. subtable lookups per hit:0.00
    miss:0
    lost:0
    polling cycles:1446658819 (6.61%)
    processing cycles:20453874401 (93.39%)
    avg cycles per packet: 616.95 (21900533220/35497952)
    avg processing cycles per packet: 576.20 (20453874401/35497952)
--
pmd thread numa_id 0 core_id 2:
    emc hits:30183582
    megaflow hits:0
    avg. subtable lookups per hit:0.00
    miss:2
    lost:0
    polling cycles:1497174615 (6.85%)
    processing cycles:20354613261 (93.15%)
    avg cycles per packet: 723.96 (21851787876/30183584)
    avg processing cycles per packet: 674.36 (20354613261/30183584)

Where packet drops occur, you can see a high ratio of processing cycles vs polling cycles (more than 90% processing cycles):

polling cycles:1497174615 (6.85%)
processing cycles:20354613261 (93.15%)

3.3.3. PMD under loadtest with 50% of mpps capacity

The following example shows a system where the core_ids serve the PMDs that are pinned to dpdk0, with a load test flowing through dpdk0, sending 6.4 Mpps (around 50% of the maximum capacity) of this dpdk0 interface (around 12.85 Mpps):

[root@overcloud-compute-0 ~]# ovs-appctl dpif-netdev/pmd-stats-clear && sleep 10 && ovs-appctl dpif-netdev/pmd-stats-show |
egrep 'core_id (2|4|22|24):' -A9
pmd thread numa_id 0 core_id 22:
    emc hits:17461158
    megaflow hits:0
    avg. subtable lookups per hit:0.00
    miss:0
    lost:0
    polling cycles:4948219259 (25.81%)
    processing cycles:14220835107 (74.19%)
    avg cycles per packet: 1097.81 (19169054366/17461158)
    avg processing cycles per packet: 814.43 (14220835107/17461158)
--
pmd thread numa_id 0 core_id 2:
    emc hits:14874381
    megaflow hits:0
    avg. subtable lookups per hit:0.00
    miss:0
    lost:0
    polling cycles:5460724802 (29.10%)
    processing cycles:13305794333 (70.90%)
    avg cycles per packet: 1261.67 (18766519135/14874381)
    avg processing cycles per packet: 894.54 (13305794333/14874381)

Where the pps are ca. half of the maximum for the interface, you can see a lower ratio of processing cycles vs polling cycles (ca. 70% processing cycles):

polling cycles:5460724802 (29.10%)
processing cycles:13305794333 (70.90%)

3.3.4. Hit vs miss vs lost

The following examples shows the man pages regarding the subject:

an ovs-vswitchd
(...)
DPIF-NETDEV COMMANDS
These commands are used to expose internal information (mostly statistics)
about the ``dpif-netdev'' userspace datapath. If there is only one datapath
(as is often the case, unless dpctl/ commands are used), the dp argument can
be omitted.

dpif-netdev/pmd-stats-show [dp]
Shows performance statistics for each pmd thread of the datapath dp.
The special thread ``main'' sums up the statistics of every non pmd
thread. The sum of ``emc hits'', ``masked hits'' and ``miss'' is the
number of packets received by the datapath. Cycles are counted using
the TSC or similar facilities (when available on the platform). To
reset these counters use dpif-netdev/pmd-stats-clear. The duration of
one cycle depends on the measuring infrastructure.
(...)

Raw

man ovs-dpctl
(...)
dump-dps
Prints the name of each configured datapath on a separate line.

[-s | --statistics] show [dp...]
Prints a summary of configured datapaths, including their datapath numbers and a list of ports connected to each datapath. (The local port is
identified as port 0.) If -s or --statistics is specified, then packet and byte counters are also printed for each port.

The datapath numbers consists of flow stats and mega flow mask stats.

The "lookups" row displays three stats related to flow lookup triggered by processing incoming packets in the datapath. "hit" displays number
of packets matches existing flows. "missed" displays the number of packets not matching any existing flow and require user space processing.
"lost" displays number of packets destined for user space process but subsequently dropped before reaching userspace. The sum of "hit" and
"miss" equals to the total number of packets datapath processed.
(...)

Raw

man ovs-vswitchd
(...)
dpctl/show [-s | --statistics] [dp...]
Prints a summary of configured datapaths, including their datapath numbers and a list of ports connected to each datapath. (The local port is identified as
port 0.) If -s or --statistics is specified, then packet and byte counters are also printed for each port.

The datapath numbers consists of flow stats and mega flow mask stats.

The "lookups" row displays three stats related to flow lookup triggered by processing incoming packets in the datapath. "hit" displays number of packets
matches existing flows. "missed" displays the number of packets not matching any existing flow and require user space processing. "lost" displays number of
packets destined for user space process but subsequently dropped before reaching userspace. The sum of "hit" and "miss" equals to the total number of packets
datapath processed.
(...)

Note

Some of the documentation is referring to the kernel datapath, so when it says user space processing it means the packet is not classified in the kernel sw caches (equivalents to emc & dpcls) and sent to the ofproto layer in userspace.

Chapter 4. Attaching and Detaching SR-IOV ports in nova

Use this procedure to properly perform attaching and detaching SR-IOV ports.

4.1. Symptom

Cannot attach or detach SR-IOV ports in nova in Red Hat OpenStack Platform 10 and later. Nova logs report No conversion for VIF type hw_veb yet.

4.2. Diagnosis

You cannot attach or detach SR-IOV ports to an instance that has already been spawned. SR-IOV ports need to be attached at instance creation.

4.3. Solution

The following example attempts to attach interfaces after an instance boot:

RHEL_INSTANCE_COUNT=1
NETID=$(neutron net-list | grep provider1 | awk '{print $2}')
for i in `seq 1 $RHEL_INSTANCE_COUNT`;do
#  nova floating-ip-create provider1
  portid1=`neutron port-create sriov1 --name sriov1 --binding:vnic-type direct  | awk '$2 == "id" {print $(NF-1)}'`
  portid2=`neutron port-create sriov2 --name sriov2 --binding:vnic-type direct  | awk '$2 == "id" {print $(NF-1)}'`
  openstack server create --flavor m1.small  --image rhel --nic net-id=$NETID --key-name id_rsa   sriov_vm${i}
  serverid=`openstack server list | grep sriov_vm${i} | awk '{print $2}'`
  status="NONE"
  while [ "$status" != "ACTIVE" ]; do
    echo "Server $serverid not active ($status)" ; sleep 5 ;
    status=`openstack server show $serverid | grep -i status | awk '{print $4}'`
  done
  nova interface-attach --port-id $portid1 $serverid
  nova interface-attach --port-id $portid2 $serverid
done

This fails with the following error:

ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<type 'exceptions.KeyError'> (HTTP 500) (Request-ID: req-36b544f4-91a6-442e-a30d-6148220d1449)

The correct method is to spawn an instance directly with SR-IOV ports:

RHEL_INSTANCE_COUNT=1
NETID=$(neutron net-list | grep provider1 | awk '{print $2}')
for i in `seq 1 $RHEL_INSTANCE_COUNT`;do
#  nova floating-ip-create provider1
  portid1=`neutron port-create sriov1 --name sriov1 --binding:vnic-type direct  | awk '$2 == "id" {print $(NF-1)}'`
  portid2=`neutron port-create sriov2 --name sriov2 --binding:vnic-type direct  | awk '$2 == "id" {print $(NF-1)}'`
  openstack server create --flavor m1.small  --image rhel --nic net-id=$NETID --nic port-id=$portid1 --nic port-id=$portid2 --key-name id_rsa   sriov_vm${i}
done

This works without issues.

Legal Notice

The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.

Portions adopted from the OpenStack Configuration Reference. See "Configuration Reference" in Red Hat OpenStack Platform Licenses for Documentation.

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.

Linux® is the registered trademark of Linus Torvalds in the United States and other countries.

Java® is a registered trademark of Oracle and/or its affiliates.

XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.

MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.

Node.js® is an official trademark of Joyent. Red Hat Software Collections is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.

The OpenStack® Word Mark and OpenStack Logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation's permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.

All other trademarks are the property of their respective owners.

TIM End to End OVS-DPDK Troubleshooting Guide

How to use the end to end OVS-DPDK troubleshooting procedures

Preface

Chapter 1. High Packet Loss in the TX Queue of the Instance’s Tap Interface

1.1. Symptom

1.2. Diagnosis

1.2.1. Workaround

1.2.2. Diagnostic Steps

1.3. Solution

Chapter 2. TX Drops on Instance VHU Interfaces with Open vSwitch DPDK

2.1. Symptom

2.1.1. Explanation for Spurious Drops

2.1.2. Explanation for other drops

2.1.3. Increasing the TX and RX queue lengths for DPDK

2.2. Diagnosis

2.3. Solution

Chapter 3. Interpreting the output of the pmd-stats-show command in Open vSwitch with DPDK

3.1. Symptom

3.2. Diagnosis

3.3. Solution

3.3.1. Idle PMD

3.3.2. PMD under load test with packet drop

3.3.3. PMD under loadtest with 50% of mpps capacity

3.3.4. Hit vs miss vs lost

Chapter 4. Attaching and Detaching SR-IOV ports in nova

4.1. Symptom

4.2. Diagnosis

4.3. Solution

Legal Notice

Chapter 3. Interpreting the output of the `pmd-stats-show` command in Open vSwitch with DPDK