Cluster Price/Performance Trends
Introduction
The price/performance of clusters installed at Fermilab for lattice QCD
has fallen steadily over the last five years, with a halving time
of around 1.2 years, as shown by the solid line in the graph below.
Product roadmaps provided by vendors of system components make clear that
this trend is likely to continue for the next several years.
In this document, we discuss benchmarks leading to current performance
expectations, and the roadmaps leading to expectations for future
trends that are shown by the dotted line and the right-hand blue points
in the graph below.
To make the projections specific and concrete, we focus on single-processor
Pentium 4 nodes and the Infiniband communications fabric. However, in
addition to these there are a number of other processor and network choices
available. Regarding processors, the other candidates are Intel Xeon
processors, which are dual (SMP) and higher (MP) capable, the AMD ("Opteron")
and new Intel (code name "Yamhill") x86 architecture processors with 64-bit
extensions, the Intel Itanium 64-bit processor family, and the IBM PowerPC
970, also known as the Apple G5. Alternate network choices include switched
gigabit ethernet, gigabit ethernet meshes, Myrinet, SCI, and Quadrics.
[Notes on the graph above. See also the performance tables below.
- The points near 1999 correspond to Steve Gottlieb's CANDYCANE cluster,
Sandia's Roadrunner cluster, and an 8-node Myrinet cluster at
Fermilab. MILC Staggered code was run on the first two clusters, and
MILC Wilson code
on the Fermilab cluster. The point near 2001 is the Fermilab 80-node
Pentium III cluster, running MILC improved staggered code. The point near
2003 is the Fermilab 128-node 2.4 GHz Xeon cluster, again running
MILC improved staggered. All projected points (blue, with error bars) are
assumed to be Pentium 4E ("Prescott") clusters running improved
staggered code with software prefetching.
- All MILC improved staggered code results include the use of
inline SSE routines
which replace the standard MILC "C"-language SU3 matrix-vector
multiplies in the dslash routines. Note that these are "Level 1"
optimizations (QLA), in the parlance of the SciDAC software effort -
that is, they are optimizations for matrix algebra at individual
sites. Further performance gains would result from the use of "Level
2" (QDP) or "Level 3" (full assembly language inverter).
- The two QCDOC points were obtained from articles from the CHEP'03 and
Lattice'03 conferences, available at this
QCDOC site
(hep-lat/0309096 and hep-lat/0306023).
The green point corresponds to the performance of standard MILC
"C"-language single precision code. The red point corresponds to the
performance of an assembly language asqtad inverter (i.e., Level 3 code).
Note that this assembly language inverter is double precision, and that
all other points on the graph represent single precision
performance.
- Local volumes for the QCDOC points are 4^4. Local volumes for the
cluster points are 14^4. For scaling behaviour of MILC asqtad code
with different local volumes, see this
link.
]
Benchmarks
See this link for references to a
number of benchmarks performed with MILC code on various platforms.
Intel ia32 Single Node Performance
All of the performance data reported in this page refers to the improved
staggered inverter, reported as "CONGRAD", in the MILC code.
On recent ia32 processors, single node performance in main memory is dictacted
by memory bandwidth. The plot below shows performance on the following
processors:
- 2.4 GHz Xeon with 400 MHz FSB ("front side bus" = effective frequency of
transfers on the 64-bit wide data bus connecting the cpu to main
memory)
- 2.4 GHz Xeon with 533 MHz FSB
- 2.8 GHz Pentium 4 with 800 MHz FSB
- 2.8 GHz Pentium 4E ("Prescott") with 800 MHz FSB. Note that the same
binary was used on each of the three
processors listed above. On the Prescott,
a different binary which included software prefetching was used. This
software prefetching code gives no performance enhancement on
non-Prescott ia32 processors.
ia32 Cluster Performance
The table below shows performance of the MILC improved staggered code running
on clusters, using
Steve Gottlieb's
standard MILC benchmarking procedure
(constant volume per node). For the P4 and P4E, the performance is estimated
by using the same ratio of single node to cluster performance seen on the 128 Xeon
node cluster. All data are for a lattice size of 14^4 per node. For the
scaling behavior of MILC asqtad code running on clusters with different local
volumes, see this link.
Single Node and Cluster Performance of MILC Staggered Code
Processor |
Network |
Single Node Performance |
Cluster Node Count |
Cluster Performance |
Cluster:Single Ratio |
2.4 GHz Xeon 400 MHz FSB |
Myrinet 2000 |
783 MFlops |
128 |
588 MFlops |
0.751 |
2.4 GHz Xeon 533 MHz FSB |
Infiniband |
889 MFlops |
16 |
691 MFlops |
0.777 |
2.8 GHz P4 800 MHz FSB |
|
1285 MFlops |
|
964 MFlops est. |
0.75 est. |
2.8 GHz P4E 800 MHz FSB |
|
1665 MFlops |
|
1249 MFlops est. |
0.75 est. |
Network Performance
In Spring 2004 Fermilab will purchase 128 P4E nodes and will reuse one of our
existing Myrinet 2000 fabrics. These nodes will use motherboards with 64-bit,
66 MHz PCI-X slots for the Myrinet I/O cards. The
red lines in the graph below show
Steve Gottlieb's performance
model,
which estimates the network bandwidth required to sustain a given level of
floating point performance in the dslash routine. Note that the
version of the model used is for staggered code - that is, only nearest neighbor
fetches are done. Naively, improved staggered requires four
times the communication, and twice the number of flops (carefully written code
would require three times the communication, rather than four, because the
nearest neighbor fetches can be reused for the next-next nearest neighbor
fetch of the lattice point two sites away in any given dimension). For
improved staggered, then, a conservative modification of this plot simply
divides the MF labels on the straight lines by two.
In the graph, the green and blue lines
show the measured bandwidths of the Myrinet 2000 network we will
use on this cluster, as well as an Infiniband network. We used the Pallas MPI
sendrecv, a common I/O
benchmark, and motherboards identical to those we will purchase. Note
that the model uses only the receive bandwidth, and that on real systems sends
will occur simultaneously. We used the "sendrecv" benchmark to conservatively
assess performance in this regime (thus, half the "sendrecv" performance was
plotted). At the measured value of 1665 MFlops for 14^4 on the 2.8 GHz P4E,
this model predicts that our Myrinet network should have sufficient bandwidth.
Note that the PCI-X performance of these motherboards is disappointing -
aggregate bidirectional bandwidth saturates at about 200 MB/sec, compared with
over 700 MB/sec we have observed on Xeon motherboards.
The PCI Express I/O slots in the
motherboards available next Summer should have better bandwidth performance
than the Xeon motherboards, as well as improved latencies.
The performance of the full inverter also depends on collective operations,
which are much more sensitive to the latency of communications.
(Note: "Torre Pines" is the Intel pre-release code name for the specific P4E
motherboard used in the network measurements shown in the graph.)
The graph shows the present Infiniband performance is capable
of keeping up with very fast processor. Infiniband has a well-developed
road map for improving performance over the next couple of years.
Cluster Price/Performance
The table below shows the measured price/performance of clusters used since
1998. The cost per node listed includes the network cost.
Price/Performance Measurements
Date |
Cluster Size |
Processor |
Network |
Sustained Performance |
Cost/node |
Price/Performance |
Nov 1998 | 32 Nodes | 350 MHz Pentium II | Fast Ethernet | 50 MFlop/Node | $2000 | $40/MFlop |
July 1999 | 64 Nodes | 450 MHz Pentium II | Fast Ethernet | 63 MFlop/Node | $2200 | $35/MFlop |
July 1999 | 8 Nodes | 400 MHz Pentium II | Myrinet | 77 MFlop/Node | $4170 | $54/MFlop |
Nov 2000 | 80 Nodes | 700 MHz Pentium III | Myrinet | 185 MFlop/Node | $2800 | $15.1/MFlop |
Oct 2002 | 128 Nodes | 2.4 GHz Xeon | Myrinet | 780 MFlop/Node | $3200 | $4.1/MFlop |
Predicted Cluster Price/Performance
The table below shows probable price/performance of future lattice clusters.
Price/Performance Predictions
Date |
Cluster Size |
Processor |
Sustained Performance |
Node cost |
Per node network cost |
Price/Performance |
Spring 2004 |
128 |
2.8 GHz P4E |
1.2 GFlop/Node |
$900 |
$1000 |
$1.58/MFlop |
Late 2004 |
256 |
3.2 GHz P4E |
1.4 GFlop/Node |
$900 |
$1000 |
$1.36/MFlop |
Late 2005 |
512 |
4.0 GHz P4E |
1.9 GFlop/Node |
$900 |
$900 |
$0.95/MFlop |
Late 2006 |
512 |
5.0 GHz P4E |
3.0 GFlop/Node |
$900 |
$500 |
$0.47/MFlop |
Discussion:
- Spring 2004: Justified by data presented above. We have strong price
estimates for both the node cost (1 GB memory per node) and the
network cost (Infiniband), though we will not be purchasing the
network because of reuse of our Myrinet 2000.
- Late 2004: The performance boost results from the faster processor and
the improved bandwidth of the Infiniband network.
- Late 2005: The performance boost results from the faster processor and
improved processor memory bandwidth (1066 MHz FSB). We estimate a net Infiniband
cost reduction - even though a cascaded network will likely be
required, Infiniband list prices will likely have dropped, and there
should be price incentives for this large a purchase.
- Late 2006: The performance boost results from the faster processor and
improved memory bandwidth (1066+ MHz FSB and fully buffered DIMM
technology). By this time Infiniband will have either succeeded in the
market, or an alternative will have arrived. It is likely that
an integrated high
performance network (i.e., a chip on the motherboard instead
of a separate card) will drive the per node price down significantly.
We have already heard of engineering designs for embedded Infiniband in
late 2004.
Scalability
The network fabric exemplar used here, Infiniband, has the following
characteristics which enable scaling to large (order thousands of nodes)
clusters of systems based on processors available in the next 5 years:
- Bandwidth: Infiniband currently supports 4X and 12X links. All
connections to computers (via "host channel adapters", or HCA's) are
via 4X links, which support communications at 10 Gbit/sec
in each direction. Most HCA's on the market provide two independent 4X
ports. In practice, PCI-X HCA's in Xeon systems exhibit peak
bandwidths of 700 MBytes/sec, limited by the PCI-X bus and not the
Infiniband fabric. PCI Express HCA's have been announced, and PCI
Express motherboards will be available in Summer 2004. Over 8X PCI
Express, Infiniband HCA's will exhibit bandwidths near the full wire rate of
10 Gbit/sec (for 4X Infiniband) simultaneously over both ports, or
better than 4 GByte/sec aggregate bidirectional bandwidth.
12X Infiniband links are available now and can be used to interconnect
switches. These links will deliver 30 Gbit/sec bandwidth simultaneously in
each direction. In a few years, when processors and memory buses have
increased in performance, 12X Infiniband links will be available on
HCA's.
Note in the dslash network model graph for the Torre Pines motherboard,
shown above, that the Infiniband performance is limited by the poor PCI
implemention to 200 MByte/sec aggregate bidirectional bandwidth. A
factor of five increase in bandwidth, which will result from the PCI
Express implementations available in Summer 2004, will easily allow sustained dslash
computations at 10 GFlops, far in excess of the available processor
performance (perhaps 2 GFlops sustained per cpu by mid 2004).
- Fabric size: Most Infiniband switches are now based on the 24-port crossbar
silicon available from Mellanox
(Infiniscale
III). This chip has driven the list price per port of 24-port
Infiniband switches to $300. Announced 144-port switches, available in
Summer 2004, will have a list price per port of about $450. These
switches can be cascaded to make larger fabrics. Examples
of fabrics based on the older, 8-port, Mellanox silicon include the
1100-node Virginia
Tech G5 cluster announced in Fall 2003, which uses 24 96-port
switches, the 192-node multi-vendor demonstration cluster at
SC'2003, a 128-node
cluster at
Sandia,
and an announced 512-node cluster at Riken in
Japan.
Currently, practical fabric sizes are limited to order one thousand
nodes because of limitations in Infiniband software subnet managers.
The principal Infinband switch manufacturers all have development of
subnet managers sufficient for many thousand node fabrics on their
roadmaps.
- Latency: In Summer 2003 on a test cluster, we
measured small message MPI
latency over Infiniband at 7 microseconds. Subsequent improvements in
HCA drivers have lowered this figure to less than 6 microseconds. PCI
Express is expected to lower latencies further, perhaps by 1 to 2
microseconds.
Alternatives to Infiniband include Myrinet and Quadrics (both very well
established commercially with many examples of production clusters of size
1000 nodes and higher), SCI, switched gigabit ethernet, and gigabit ethernet
meshes. Bandwidths similar to 4X Infiniband can be obtained with any of
these, though multiple gigE links are required. There are these claims for
latencies:
- Myrinet: 6.4 microseconds over GM
- Infiniband: < 6 microseconds now, < 5 microseconds with PCI Express
- Myrinet: < 4 micoseconds over MX (not released yet)
- SCI: 2.3 microseconds
- Quadrics: 1.6 microseconds
The availability of multiple high performance fabric solutions, including
Infiniband with multiple manufacturers, ensures continuing competition
resulting in improving performance and lower prices. Per node list prices for
Infiniband and Myrinet are currently $1000 and $900, respectively, for
clusters up to 128 nodes, with discounts available.
|