Myrinet timing

I've spent a lot of time lately improving and trying to understand the myrinet performance.

Here I show round trip performance between vxworks nodes. First of all, I see a significantly better performance between the node "bdb" (used to be gb) and other vxworks nodes. The myrinet card on this node is newest board that does NOT have the Myrinet 2 functionality. (lanai 4 is very old sector broker boards, lanai 7 bdb and some of the l3 nodes, lanai 9 is the new nodes used by trigger and some of the newest detectors) The newer nodes are running in a "backward compatability mode". Also, the BDB node is different. I still don't know if the performance difference is due to the drivers, the electronics, or the processor itself.

Round Trip Messages

Here I show the round trip times (I don't divide by two). I don't see any latencies > 400 usec, although there is a tail with about 15 of 50000 messages with times greater than 200 usec. This is node dependent, and I don't see it on the trigger nodes. I don't know the source.

Most messages take about 60-70 usec round trip, but there is a striking peak (on a log plot) at ~150 usec. I beleive this is due to the way the myrinet control program on the NIC works. The physical layer of myrinet is NOT "reliable" (although it has low error rates). The way that they guarentee reliability is to send ack signals back to the sender. However, the ack signals themselves are not reliable, so they also put in a timeout. However, the timeout is handled in a strange way, every millisecond a timer goes off. All messages that have not been acknowledged are then assumed to be bad, and are queued for re-sending. Because the timer signals are out of sync with the messages this means that it is "normal" for messages to be interupted this way. If you assume message takes 20usec to send, and that when you send round trips you are sending messages 1/2 of the time, you would expect about 1 in 100 trips would be interrupted. This is about the ratio of messages in the second peak to the first.

Unidirectional messages

Timing while sending unidirectional messages, is a bit more confusing. The distribution of times depends very much on which node sends to which node. The tails on the distributions are much longer than for round-trip messages. The reasons for these observations are tied up with the internal buffering of the messages. This is not an issue with ping-pong messages, because there is never more than one message in the system at a time.

Before describing what is happening its usefull to know the rates for unidirectional sends. The sustained average rate depends almost completely on the sending node:

l2ana01 (linux lanai9) = 9.3 usec
bdb (vxworks lanai7) = 14 usec (user task high priority)
bdb (vxworks lanai7) = 17 usec (user task low priority)
l1/ctb (vxworks lanai9) = 20 usec
pmd (vxworks lanai9 slower cpu?) = 25 usec

You also need to know that to user myriLib you have at least two tasks. The first is the user task, which performs the message sends. The second is the receive task which receives all information coming from the network card. This information consists of the arriving messages, and also the confirmations that sending messages have been sent.

There is some competition for resources between these two tasks, so I do observe some effects of inverting the priorities for them. I assume, of course that the user task blocks when not in use, so if you are polling you don't have a choice.

Sending unidirectional messages I get better performance keeping the user task at high priority. This ensures that the card always has messages to send.

In this plot, you see spray messages sent from the slowest node (pmd) to a faster one (bdb). In the case where the user task has a high priority (uh), you see a nice peak at about 25usec (the sustained rate), and no additional structure. What happens is that the users buffers are filled up and one message is sent at a time.

However, when the priority of the user task is lower than the receive task what happens is that several messages get put into the queue which the NIC is processing the first one. However, as messages finish getting processed, the receive tasks starves the user task as it processes the "event sent" notifications. The result is that the time distribution has a sharp peak at ~10usec followed by another broad peak at 40-100 usec.

I hacked up the driver to have a counter of the number of messages currently in the system. Then I take the difference between each successive message send to see how many messages finished being processed during this send. Here is the result for a low priority user task. (If the user task has high priority, the number is always 1 after the first 29 sends fill up the buffers).

This feature is even more pronounced with a faster NIC doing the sending

Here, for the low priority user task, you see exactly the same situation, although the distribution is shifted to slower values because the more events can processed at a time.

You see in addition a new feature with the HIGH priority sender. In this case, the user buffers are always completely full, however I beleive the same queueing effect occurs with the buffers in the destination receiving node. Unfortunately, I don't have any way to put hooks in to demonstrate that this leads to the peak in the high priority messages.

Here I show the rates corrected by the number of sends that were processed during the period of the send. (t / nserved) Note that the rates are slightly inflated, because the time spend sending 0 messages is not in the data of this plot. However, you do see the point of buffing the messages in the first place -- the rate per message drops as the number of sends increases.

Finally, I show the timing distributions for the spray messages on the trigger nodes. Sending from l1 to ctb (same speed), I don't see much structure. Sending from l2 I see strong evidence of the queueing effect.

Spray messages with empty buffers

The situation in L1 and on the DSM clients is different. Here you need to do some processing that is independent of the myrinet card every event. The myrinet card is doing its own work in parallel.

I have mimicked this situation (poorly), by sending 5 messages at a time and then waiting for 10ms (the shortest wait in vxworks). This gives a good indication of how much of the latency is due to the host CPU.

We see that the times are much more deterministic, and very much shorter. Here we can also see the difference between the single message versions of myriMsgSend and the multiple message version. The multiple message version saves about 10% (8 usec / msg compared to 9 usec/msg. The reason for this is just that the multiple message version saves several semaphore operations.

Simplified trigger protocol

The final test I made was a much simplified version of the L2 protocol. Here I sent two messages from L1, first a message to CTB and then to L2. CTB forwards the message to L2. Finally L2 returns the message to L1. I show the round trip times.

The result is quite similiar to the results from ping-pong messages, exept the times are somewhat longer (more messages) and the distribution a bit broader as would be expected. I do, unfortunately see some very long events. They don't show up every time, because my test only runs 50000 cycles. My last plot is the same plot but for a run that had a long delay in it. It shows that the delayed event does not seem to be correlated with other delayed events. It also shows that the delayed events are not frequent enought to adversely effect the overall rate.

The rate of cycles with > 2ms is approximately 1 in 100,000 to 200,000 cycles.

Jeff Landgraf

Last modified: Tue Aug 19 18:25:12 EDT 2003