Cluster Price/Performance Trends

Introduction

The price/performance of clusters installed at Fermilab for lattice QCD has fallen steadily over the last five years, with a halving time of around 1.2 years, as shown by the solid line in the graph below. Product roadmaps provided by vendors of system components make clear that this trend is likely to continue for the next several years. In this document, we discuss benchmarks leading to current performance expectations, and the roadmaps leading to expectations for future trends that are shown by the dotted line and the right-hand blue points in the graph below.

To make the projections specific and concrete, we focus on single-processor Pentium 4 nodes and the Infiniband communications fabric. However, in addition to these there are a number of other processor and network choices available. Regarding processors, the other candidates are Intel Xeon processors, which are dual (SMP) and higher (MP) capable, the AMD ("Opteron") and new Intel (code name "Yamhill") x86 architecture processors with 64-bit extensions, the Intel Itanium 64-bit processor family, and the IBM PowerPC 970, also known as the Apple G5. Alternate network choices include switched gigabit ethernet, gigabit ethernet meshes, Myrinet, SCI, and Quadrics.

[Notes on the graph above. See also the performance tables below.

The points near 1999 correspond to Steve Gottlieb's CANDYCANE cluster, Sandia's Roadrunner cluster, and an 8-node Myrinet cluster at Fermilab. MILC Staggered code was run on the first two clusters, and MILC Wilson code on the Fermilab cluster. The point near 2001 is the Fermilab 80-node Pentium III cluster, running MILC improved staggered code. The point near 2003 is the Fermilab 128-node 2.4 GHz Xeon cluster, again running MILC improved staggered. All projected points (blue, with error bars) are assumed to be Pentium 4E ("Prescott") clusters running improved staggered code with software prefetching.
All MILC improved staggered code results include the use of inline SSE routines which replace the standard MILC "C"-language SU3 matrix-vector multiplies in the dslash routines. Note that these are "Level 1" optimizations (QLA), in the parlance of the SciDAC software effort - that is, they are optimizations for matrix algebra at individual sites. Further performance gains would result from the use of "Level 2" (QDP) or "Level 3" (full assembly language inverter).
The two QCDOC points were obtained from articles from the CHEP'03 and Lattice'03 conferences, available at this QCDOC site (hep-lat/0309096 and hep-lat/0306023). The green point corresponds to the performance of standard MILC "C"-language single precision code. The red point corresponds to the performance of an assembly language asqtad inverter (i.e., Level 3 code). Note that this assembly language inverter is double precision, and that all other points on the graph represent single precision performance.
Local volumes for the QCDOC points are 4^4. Local volumes for the cluster points are 14^4. For scaling behaviour of MILC asqtad code with different local volumes, see this link.

]

Benchmarks

See this link for references to a number of benchmarks performed with MILC code on various platforms.

Intel ia32 Single Node Performance

All of the performance data reported in this page refers to the improved staggered inverter, reported as "CONGRAD", in the MILC code.

On recent ia32 processors, single node performance in main memory is dictacted by memory bandwidth. The plot below shows performance on the following processors:

2.4 GHz Xeon with 400 MHz FSB ("front side bus" = effective frequency of transfers on the 64-bit wide data bus connecting the cpu to main memory)
2.4 GHz Xeon with 533 MHz FSB
2.8 GHz Pentium 4 with 800 MHz FSB
2.8 GHz Pentium 4E ("Prescott") with 800 MHz FSB. Note that the same binary was used on each of the three processors listed above. On the Prescott, a different binary which included software prefetching was used. This software prefetching code gives no performance enhancement on non-Prescott ia32 processors.

ia32 Cluster Performance

The table below shows performance of the MILC improved staggered code running on clusters, using Steve Gottlieb's standard MILC benchmarking procedure (constant volume per node). For the P4 and P4E, the performance is estimated by using the same ratio of single node to cluster performance seen on the 128 Xeon node cluster. All data are for a lattice size of 14^4 per node. For the scaling behavior of MILC asqtad code running on clusters with different local volumes, see this link.

*Single Node and Cluster Performance of MILC Staggered Code*
Processor	Network	Single Node Performance	Cluster Node Count	Cluster Performance	Cluster:Single Ratio
2.4 GHz Xeon 400 MHz FSB	Myrinet 2000	783 MFlops	128	588 MFlops	0.751
2.4 GHz Xeon 533 MHz FSB	Infiniband	889 MFlops	16	691 MFlops	0.777
2.8 GHz P4 800 MHz FSB		1285 MFlops		964 MFlops est.	0.75 est.
2.8 GHz P4E 800 MHz FSB		1665 MFlops		1249 MFlops est.	0.75 est.

Network Performance

In Spring 2004 Fermilab will purchase 128 P4E nodes and will reuse one of our existing Myrinet 2000 fabrics. These nodes will use motherboards with 64-bit, 66 MHz PCI-X slots for the Myrinet I/O cards. The red lines in the graph below show Steve Gottlieb's performance model, which estimates the network bandwidth required to sustain a given level of floating point performance in the dslash routine. Note that the version of the model used is for staggered code - that is, only nearest neighbor fetches are done. Naively, improved staggered requires four times the communication, and twice the number of flops (carefully written code would require three times the communication, rather than four, because the nearest neighbor fetches can be reused for the next-next nearest neighbor fetch of the lattice point two sites away in any given dimension). For improved staggered, then, a conservative modification of this plot simply divides the MF labels on the straight lines by two.

In the graph, the green and blue lines show the measured bandwidths of the Myrinet 2000 network we will use on this cluster, as well as an Infiniband network. We used the Pallas MPI sendrecv, a common I/O benchmark, and motherboards identical to those we will purchase. Note that the model uses only the receive bandwidth, and that on real systems sends will occur simultaneously. We used the "sendrecv" benchmark to conservatively assess performance in this regime (thus, half the "sendrecv" performance was plotted). At the measured value of 1665 MFlops for 14^4 on the 2.8 GHz P4E, this model predicts that our Myrinet network should have sufficient bandwidth. Note that the PCI-X performance of these motherboards is disappointing - aggregate bidirectional bandwidth saturates at about 200 MB/sec, compared with over 700 MB/sec we have observed on Xeon motherboards. The PCI Express I/O slots in the motherboards available next Summer should have better bandwidth performance than the Xeon motherboards, as well as improved latencies. The performance of the full inverter also depends on collective operations, which are much more sensitive to the latency of communications. (Note: "Torre Pines" is the Intel pre-release code name for the specific P4E motherboard used in the network measurements shown in the graph.)

The graph shows the present Infiniband performance is capable of keeping up with very fast processor. Infiniband has a well-developed road map for improving performance over the next couple of years.

Cluster Price/Performance

The table below shows the measured price/performance of clusters used since 1998. The cost per node listed includes the network cost.

*Price/Performance Measurements*
Date	Cluster Size	Processor	Network	Sustained Performance	Cost/node	Price/Performance
Nov 1998	32 Nodes	350 MHz Pentium II	Fast Ethernet	50 MFlop/Node	$2000	$40/MFlop
July 1999	64 Nodes	450 MHz Pentium II	Fast Ethernet	63 MFlop/Node	$2200	$35/MFlop
July 1999	8 Nodes	400 MHz Pentium II	Myrinet	77 MFlop/Node	$4170	$54/MFlop
Nov 2000	80 Nodes	700 MHz Pentium III	Myrinet	185 MFlop/Node	$2800	$15.1/MFlop
Oct 2002	128 Nodes	2.4 GHz Xeon	Myrinet	780 MFlop/Node	$3200	$4.1/MFlop

Predicted Cluster Price/Performance

The table below shows probable price/performance of future lattice clusters.

*Price/Performance Predictions*
Date	Cluster Size	Processor	Sustained Performance	Node cost	Per node network cost	Price/Performance
Spring 2004	128	2.8 GHz P4E	1.2 GFlop/Node	$900	$1000	$1.58/MFlop
Late 2004	256	3.2 GHz P4E	1.4 GFlop/Node	$900	$1000	$1.36/MFlop
Late 2005	512	4.0 GHz P4E	1.9 GFlop/Node	$900	$900	$0.95/MFlop
Late 2006	512	5.0 GHz P4E	3.0 GFlop/Node	$900	$500	$0.47/MFlop

Discussion:

Spring 2004: Justified by data presented above. We have strong price estimates for both the node cost (1 GB memory per node) and the network cost (Infiniband), though we will not be purchasing the network because of reuse of our Myrinet 2000.
Late 2004: The performance boost results from the faster processor and the improved bandwidth of the Infiniband network.
Late 2005: The performance boost results from the faster processor and improved processor memory bandwidth (1066 MHz FSB). We estimate a net Infiniband cost reduction - even though a cascaded network will likely be required, Infiniband list prices will likely have dropped, and there should be price incentives for this large a purchase.
Late 2006: The performance boost results from the faster processor and improved memory bandwidth (1066+ MHz FSB and fully buffered DIMM technology). By this time Infiniband will have either succeeded in the market, or an alternative will have arrived. It is likely that an integrated high performance network (i.e., a chip on the motherboard instead of a separate card) will drive the per node price down significantly. We have already heard of engineering designs for embedded Infiniband in late 2004.

Scalability

The network fabric exemplar used here, Infiniband, has the following characteristics which enable scaling to large (order thousands of nodes) clusters of systems based on processors available in the next 5 years:

Bandwidth: Infiniband currently supports 4X and 12X links. All connections to computers (via "host channel adapters", or HCA's) are via 4X links, which support communications at 10 Gbit/sec in each direction. Most HCA's on the market provide two independent 4X ports. In practice, PCI-X HCA's in Xeon systems exhibit peak bandwidths of 700 MBytes/sec, limited by the PCI-X bus and not the Infiniband fabric. PCI Express HCA's have been announced, and PCI Express motherboards will be available in Summer 2004. Over 8X PCI Express, Infiniband HCA's will exhibit bandwidths near the full wire rate of 10 Gbit/sec (for 4X Infiniband) simultaneously over both ports, or better than 4 GByte/sec aggregate bidirectional bandwidth.
12X Infiniband links are available now and can be used to interconnect switches. These links will deliver 30 Gbit/sec bandwidth simultaneously in each direction. In a few years, when processors and memory buses have increased in performance, 12X Infiniband links will be available on HCA's.
Note in the dslash network model graph for the Torre Pines motherboard, shown above, that the Infiniband performance is limited by the poor PCI implemention to 200 MByte/sec aggregate bidirectional bandwidth. A factor of five increase in bandwidth, which will result from the PCI Express implementations available in Summer 2004, will easily allow sustained dslash computations at 10 GFlops, far in excess of the available processor performance (perhaps 2 GFlops sustained per cpu by mid 2004).
Fabric size: Most Infiniband switches are now based on the 24-port crossbar silicon available from Mellanox (Infiniscale III). This chip has driven the list price per port of 24-port Infiniband switches to $300. Announced 144-port switches, available in Summer 2004, will have a list price per port of about $450. These switches can be cascaded to make larger fabrics. Examples of fabrics based on the older, 8-port, Mellanox silicon include the 1100-node Virginia Tech G5 cluster announced in Fall 2003, which uses 24 96-port switches, the 192-node multi-vendor demonstration cluster at SC'2003, a 128-node cluster at Sandia, and an announced 512-node cluster at Riken in Japan.
Currently, practical fabric sizes are limited to order one thousand nodes because of limitations in Infiniband software subnet managers. The principal Infinband switch manufacturers all have development of subnet managers sufficient for many thousand node fabrics on their roadmaps.
Latency: In Summer 2003 on a test cluster, we measured small message MPI latency over Infiniband at 7 microseconds. Subsequent improvements in HCA drivers have lowered this figure to less than 6 microseconds. PCI Express is expected to lower latencies further, perhaps by 1 to 2 microseconds.

Alternatives to Infiniband include Myrinet and Quadrics (both very well established commercially with many examples of production clusters of size 1000 nodes and higher), SCI, switched gigabit ethernet, and gigabit ethernet meshes. Bandwidths similar to 4X Infiniband can be obtained with any of these, though multiple gigE links are required. There are these claims for latencies:

Myrinet: 6.4 microseconds over GM
Infiniband: < 6 microseconds now, < 5 microseconds with PCI Express
Myrinet: < 4 micoseconds over MX (not released yet)
SCI: 2.3 microseconds
Quadrics: 1.6 microseconds

The availability of multiple high performance fabric solutions, including Infiniband with multiple manufacturers, ensures continuing competition resulting in improving performance and lower prices. Per node list prices for Infiniband and Myrinet are currently $1000 and $900, respectively, for clusters up to 128 nodes, with discounts available.