Designing a Cluster Computer

Designing a Cluster Computer

The first step in designing a cluster is to choose the building block. The processing power, memory, and disk space of each node as well as the communication bandwidth between the nodes are all factors that can be chosen. You will need to decide which are important based on the mixture of applications you intend to run on the cluster, and the amount of money you have to spend.

Best performance for the price ==> PC (currently dual-Xeon systems)

If maximizing memory and/or disk is important, choose faster workstations

For maximum bandwidth, more expensive workstations may be needed

PCs running Linux are by far the most common choice. They provide the best performance for the price at the moment, providing good CPU speed with cheap memory and disk space. They have smaller L2 cache sizes than some more expensive workstations which can limit the SMP performance. They have less main memory bandwidth which can limit the performance for applications that do not reuse data cache well. The availability of 64-bit PCI-X slots and memory upto 16 GBytes removes several bottlenecks, but new 64-bit architectures will still perform better for large-memory applications.

For applications that require more networking than Gigabit Ethernet can provide, more expensive workstations may be the way to go. You will have fewer but faster nodes, requiring less overall communications, plus the memory subsystem can support communication rates upwards of 200-800 MB/sec.

When in doubt, it is always a good idea to benchmark your code on the machines that you are considering. If that is not possible, there are many generic benchmarks that you can look at to help you decide. The HINT benchmark developed at the SCL, or a similar benchmark based on the DAXPY kernel shown below, show the performance of each processor for a range of problem sizes.

If your application uses little memory, or heavily reuses data cache, it will operate mainly on the left side of the graph. Here the clock rate is important, and the compiler choice can make a big difference. If your application is large and does not reuse data much, the right side will be more representative and the memory speed will be the dominate factor.

Designing the network

Along with the basic building block, you will need to choose the fabric that connects the nodes. As explained above, this depends greatly on the applications you intend to run, the processors you choose, and how much money you have to spend.

Gigabit Ethernet is clearly the cheapest. If your application can function with a lower level of communication, this is cheap, reliable, but scales only to around 14 nodes using a flat switch (completely connected cluster, no topology).

There are many custom network options such as Myrinet and Giganet. These are costly, but provide the best performance at the moment. They do not scale well, and therefore will force you to have a multilayered network topology of some sort. Don't go this route unless you know what you're doing.

Fast Ethernet 11.25 MB/sec ~free > 100 nodes

Gigabit Ethernet ~110 MB/sec ~$100 / machine --> 24 nodes

Myrinet ~200 MB/sec > $1000 / machine stackable small switches

SCI 150? MB/sec $1100/4-port card 2D mesh

InfiniBand ~800 MB/sec $900/card $1000/port limited to small swithes now

Netpipe graphs can be very useful in characterizing the different network interconnects, at least from a point-to-point view. These graphs show the maximum bandwidth, and the effects of the latency pulling the performance down for smaller message sizes. They can also be very useful for fine tuning a system, from the hardware to the drivers to the message-passing layer.

Which OS?

The choice of an OS is largely dictated by the machine that you choose. Linux is always an option on any machine, and is the most common choice. Many of the cluster computing tools were developed under Linux. Linux, and many compilers that run on it, are also available free.

With all that being said, there are PC clusters running Windows NT, IBM clusters running AIX, and we have even built a G4 cluster running Linux.

Loading up the software

Message-passing libraries: MPICH, LAM MPI, MPI/Pro, MP_Lite, PVM
Compilers: GNU gcc and g77, Intel, Portland Group, Compaq compilers
Libraries: Intel MKL (BLAS), ScaLAPACK, ASCI Red BLAS for Linux PCs
Queues: PBS, DQS
Status monitors: statmon

I would recommend choosing one MPI implementation and going with that. PVM is still around, but MPI the way to go (IMHO). LAM/MPI is distributed as an RPM so it is easiest to install. It also performs reasonably well on clusters.

There are many free compilers available, and the availability will of course depend on the OS you choose. For PCs running Linux, the GNU compilers are acceptible. The Intel compilers provide better performance in most cases for the Intel processors, and pricing is reasonable. The Intel or PGI compilers may help on the AMD processors. However, the cluster licenses for the PGI compilers are prohibitively expensive at this point. For Linux on the Alpha processors, Compaq freely distributes the same compilers that are available under Tru64 Unix.

There are also many parallel libraries such as ScaLAPACK available. For Linux PCs, you may also want to install a BLAS library like the Intel MKL or one Sandia developed.

If you have many users on a cluster, it may be worthwhile to put on a queueing system. PBS (portable batch system) is currently the most advanced, and is under heavy development. DQS can also handle multiprocessor jobs, but is not quite as efficient.

You will also want users to have a quick view of the status of the cluster as a whole. There are several status monitors freely available, such as statmon developed locally. None are up to where I'd like them to be yet, although commercial versions give a more active and interactive view.

Assembling the cluster

A freestanding rack costs around $100, and can hold 16 PCs. If you want to get fancier and reduce the footprint of your system, most machines can be ordered with rackmount attachments.

You will also need a way to connect a keyboard and monitor to each machine for when things go wrong. You can do this manually, or spend a little money on a KVM (keyboard, video, mouse) switch that makes it easy to access any computer.

Pre-built clusters

If you have no desire to assemble a system yourself, there are many vendors who sell complete clusters to your design. These are 1U or 2U rackmounted nodes pre-configured to your specifications. They are compact, easy to setup and maintain, and usually have good custom tools like web based status monitors. The price really isn't too much more than building your own systems now. Below is a partial list of vendors, others can be listed through http://www.beowulf.org/beowulf/vendors/ .

Atipa Xeon, Athlon
Opteron, Itanium2 Gigabit Ethernet
Myrinet
SCI

Microway Xeon, Athlon
Opteron, Itanium2
Alpha Gigabit Ethernet
Myrinet

RackSaver Xeon, Athlon
Opteron, Itanium2
Gigabit Ethernet
Myrinet
SCI
InfiniBand

Aspen Systems Xeon, Athlon
Opteron, Itanium2
Gigabit Ethernet
Myrinet

Einux Xeon
Opteron
Gigabit Ethernet
Myrinet
SCI
Quadrix
InfiniBand

eRacks P4/Xeon, Athlon
Opteron
Gigabit Ethernet
???

Linux Networx Xeon, Athlon
???
Dual Fast Ethernet
Gigabit Ethernet
Myrinet
Quadrix

Cluster administration

With large clusters, it is common to have a dedicated master node that is the only machine connected to the outside world. This machine then acts as the file server, and the compile node. This provides a single-system image to the user, who launches the jobs from the master node without ever logging into any nodes.

There are boot disks available that can help in setting up the individual nodes of a cluster. Once the master is configured, these boot disks can be configured to perform a complete system installation for each node over the network. Most cluster administrators also develop other utilities, like scripts that operate on every node in the cluster. The rdist utility can also be very helpful.

If you purchase a cluster from a vendor, it should come with software installed to make it easy to use and maintain the system. If you build your own system, there are some software packages available to do the same. OSCAR is a fully integrated software bundle designed to make it easy to build a cluster. Scyld Beowulf is a commercial package that enhances the Linux kernel providing system tools that produce a cluster with a single system image.

If set up properly, a cluster can be relatively easy to maintain. The operations that you would normally do on a single machine simply need to be replicated across many machines. If you have a very large cluster, you should keep a few spare machines to make it easy to recover from hardware problems.

Links to more advanced topics

Ames Laboratory | Condensed Matter Physics | Disclaimer | ISU Physics

Fast Ethernet	11.25 MB/sec	~free	> 100 nodes
Gigabit Ethernet	~110 MB/sec	~$100 / machine	--> 24 nodes
Myrinet	~200 MB/sec	> $1000 / machine	stackable small switches
SCI	150? MB/sec	$1100/4-port card	2D mesh
InfiniBand	~800 MB/sec	$900/card $1000/port	limited to small swithes now

Atipa	Xeon, Athlon Opteron, Itanium2	Gigabit Ethernet Myrinet SCI
Microway	Xeon, Athlon Opteron, Itanium2 Alpha	Gigabit Ethernet Myrinet
RackSaver	Xeon, Athlon Opteron, Itanium2	Gigabit Ethernet Myrinet SCI InfiniBand
Aspen Systems	Xeon, Athlon Opteron, Itanium2	Gigabit Ethernet Myrinet
Einux	Xeon Opteron	Gigabit Ethernet Myrinet SCI Quadrix InfiniBand
eRacks	P4/Xeon, Athlon Opteron	Gigabit Ethernet ???
Linux Networx	Xeon, Athlon ???	Dual Fast Ethernet Gigabit Ethernet Myrinet Quadrix