CPlant Portals

Portals

A portal consists of a portal table, possibly a matching list, and any combination of four types of memory descriptors. We regard these pieces as basic building blocks for other message passing paradigms. A library writer or runtime system designer should be able to pick the appropriate set of pieces and build a communication subsystem tailored to the needs of the particular library or runtime system being implemented.

As a proof of concept, and to make portals more user friendly to application programmers, we have implemented MPI, Intel NX, and nCUBE Vertex emulation libraries, as well as collective communication algorithms using portals as basic building blocks.

Portals have been designed to be efficient, portable, scalable, and flexible to support the above projects. The following discusses the portal table, the memory descriptors, and the matching list. A simple example of how these building blocks can be combined to present a message passing system, like MPI, to an application is presented.

Portal Table

The Portal Table

A message arriving at a node contains in its header the portal number for which it is destined. The kernel uses this number as an index into the portal table. The entries in the portal table are maintained by the user (application or library code) and point to a matching list or a memory descriptor.

If a valid memory descriptor is present, the kernel sets up the DMA units and initiates transfer of the message body into the memory descriptor. If the portal table entry points to a matching list, the kernel traverses the matching list to find an entry that matches the criteria found in the current message head. If a match is found and the memory descriptor attached to that matching list entry is valid, then the kernel starts a DMA transfer directly into the memory descriptor.

User level code sets up the data structures that make up a portal to tell the kernel how and where to receive messages. These data structures reside in user space, and no expensive kernel calls are necessary to change them. Therefore, they can be rapidly built and torn down as the communication needs of an application change.

The kernel must validate pointers and indices as it traverses these structures. This strategy makes these structures somewhat difficult to use, since the slightest error in setup forces the kernel to discard the incoming message. Most users will not use portals directly, but will benefit from their presence in libraries.

Single Block

Single Block Memory Descriptor

Four types of memory descriptors can be used by an application to tell the kernel how and where data should be deposited. This method gives applications and libraries complete control over incoming messages. A memory descriptor is laid over the exact area of user memory where the kernel should put incoming data. Most memory copies can be avoided through the appropriate use of memory descriptors.

The least complex memory descriptor is for a single, contiguous region of memory. Senders can specify an offset within this block. This descriptor enables protocols where a number of senders cooperate and deposit their individual data items at specific offsets in the single block memory descriptor.

For example, the individual nodes of a parallel file server can read their stripes from disk and send them to the memory descriptor set up by the user's I/O library. The library does not need to know how many nodes the parallel server consists of, and the server nodes do not need to synchronize access to the user's memory.

Several options can be specified with a single block memory descriptor. In the parallel file server example, the offset into the memory descriptor is specified by the sender. Alternatively, the application that sets up the memory descriptor may control the offset. Instead of writing to the memory descriptor, other nodes have the option to read from it. It is also possible to have the kernel generate an acknowledgment when data is written to a portal.

Independent Block

Independent Block Memory Descriptor

An independent block memory descriptor consists of a set of single blocks. Each block is written to or read from independently. That is, the first message will go into the first block, the second message into the second block, and so forth.

With a memory descriptor, if a message does not fit, it will be discarded and an error indicator on the receive side will be set. This is true for each individual block in the independent block memory descriptor.

No offset is specified for this type of memory descriptor, but it is now possible to save the message header, the message body and header, or only the message body. The user also specifies whether the independent blocks should be used in a circular or linear fashion.

Combined Block

Combined Block Memory Descriptor

A combined block memory descriptor is almost the same as an independent block memory descriptor. The difference is, that data can flow from the end of one block into the next one in the list. A single message long enough to fill all blocks in a combined block memory descriptor will be scattered across all blocks. If the memory descriptor is read from, it can be used in gather operations.

Dynamic Block

Dynamic Memory Descriptor

The last memory descriptor is the dynamic memory descriptor. Here, the user specifies a region of memory and the kernel treats it as a heap. For each incoming message, the kernel allocates enough memory out of this heap to deposit the message.

This memory descriptor is not as fast as the others, but it is very convenient to use if a user application cannot predict the exact sequence, the number, or the type of messages that will arrive. It is the user's responsibility to remove messages from the heap that are no longer needed.

Matching Lists

A matching list can be inserted in front of any memory descriptor. This list allows the kernel to screen incoming messages and put them into a memory descriptor only if a message matches the criteria specified by the user.

Matching occurs on source group identifier, source group rank, and 64 matching bits. A 64-bit mask selects the bits that must match the 64 match bits. Source group identifier and source group rank can be wild-carded.

The matching list consists of a series of entries. Each points to a memory descriptor into which the message is deposited if a match occurs. The entries are triply linked. If there is no match, the kernel follows the first link to the next match list entry to be checked. If a match occurs, but the message is too long to fit into the memory descriptor, then the kernel follows the second link. If the memory descriptor is not valid, the kernel follows the third link.

Building a matching list with the appropriate set of links and memory descriptors allows the implementation of many message passing protocols.

Portal Example

Portal Example

The above figure shows how the elements described in earlier sections can be combined to implement a message passing protocol.

Messages that are preposted by the user are inserted into the matching list. When a message arrives, the kernel goes through the matching list and tries to pair the message with an earlier receive request. If a match is found, the message is deposited directly into the memory specified by the user.

If the user has not posted a receive yet, the search will fail, and the kernel will reach the last entry in the matching list. It points to a dynamic memory descriptor. It is used as a large buffer for unmatched incoming messages. When the user issues a receive, this buffer is searched first (from user level). If nothing appropriate is found, the receive criteria are inserted into the matching list.

More complex and robust protocols can be built. For example, instead of storing the whole message in the dynamic memory descriptor and possibly filling it up very quickly, another scheme can be used. A second dynamic memory descriptor can be added at the end of the matching list. If the first one fills up, the kernel will continue down the matching list and then just save the message header in the second dynamic memory descriptor. When a receive is posted for one of these messages, the protocol can then request that the body of that message be sent again.

Puma Portals

Work in Progress.

Linux Portals

Work in Progress.