CDF Central Analysis Facility - Server Evaluation

Hardware Configurations

Several different server hardware configurations have been evaluated. The picture below shows two of the packaged evaluation units from ASA and Polywell computers. Also shown is a Fiber Channel disk array which was connected via optical link to a CDF Level3 trigger node for evaluation.

ASA Server

ASA Computers Inc.
2354, Calle Del Mundo
Santa Clara CA 95054
Telephone : (408) 654-2901 xtn 205
(408) 654-2900 ask for Sean
(800) REAL-PCS (1-800-732-5727)
Fax: (408) 654-2910
E-mail : sean@asacomputers.com
URL : http://www.asacomputers.com

-- Intel STL2 Dual P3 ATX MB -- Pentium III 1000MHz-133FSB 256Kcache
-- 1GB SDRAM DIMM 32x72 PC-133 ECC Reg
-- 4U Rackmount Chassis
-- 400W PS2 / Mini Redundant ATX power supply
-- Two 3Ware 7850 8-port Ultra-IDE RAID-5 Controllers
-- -- 16 WD 100GB Ultra-100 IDE 7200RPM Hard Drive
--- 2.4.2 kernel

The two pictures below show the ASA server:

Polywell Server

Jerry Tighe
Polywell Computers, Inc.
1461 San Mateo Ave
So. San Francisco, CA 94080
E-mail : jerrytighe@polywell.com
URL : http://www.polywell.com
Phone:1-800-999-1278 Ext.128
650-583-7222 Ext.128
Fax:650/583-1974

-- Poly 865DU3 Dual P3-370 ATX MB w/2xU160
-- Pentium III 1000MHz-133FSB 256Kcache
-- 1GB SDRAM DIMM 32x72 PC-133 ECC Reg
-- 20-Bay Cube Server Chassis
-- 2x400W Redundant ATX Power Supply
-- 3Ware 6200 2-port Ultra-IDE RAID Controller
-- -- 2 WD 40GB Ultra-100 7200rpm IDE HD(Mirrored Sys)
-- 3Ware 7850 8-port Ultra-IDE RAID-5 Controller
-- -- 16 WD 100GB Ultra-100 IDE 7200RPM Hard Drive(Raid)
-- 2.4.3 kernel

The two pictures below show the Polywell server.

Level3 node/Fiber Channel

-- Tyan Thunder LE S2510 Dual P3 1 GHz
-- 256 MB RAM
-- QLogic ISP2200 FibreChannel-to-SCSI PCI controller (64 bit PCI)
-- Chaparral K7413 FiberChannel 1.3TB disk array (eight 170GB SCSI disks, hardware RAID controller)
-- Linux with 2.4.16 kernel

Duke Server

-- Supermicro P3TDLE dual P3 866 MHZ
-- 1024 MB ram PC 133 Mhz ECC
-- Escalade 7800 storage switch
-- 8 WD 100 GB 7200 rpm drives
-- Raid 5 configuration
-- riesertfs
-- FNAL RH 7.1

Rackable 3U Server

Phuoc Vu - Account Manager
Rackable Systems
721 Charcot Avenue
San Jose, CA 95131
E-mail : pvu@rackable.com
URL : http://www.rackable.com
Phone:408-321-0290 Ext.328
408-835-6673 (mobile)

3U-13 Hswap IDE storage server:
-- 3U CM HS Chassis, 300W Redundant Power Supply
-- Intel SDS2 MB with 512 MB 133 ECC memory
-- 13 160GB Maxtor IDE drives (5400 RPM, 12ms avg seek time)
-- Two 3ware Escolade 7850 IDE Raid controllers
-- -- RAID 5 - 6 drives
-- -- RAID 5- 7 drives ( 6 RAID, 1 for OS, boot drive )
-- loaded with RH 7.2 and 3ware 7.4 firmware (beta).

See also:
Advantages of Rackable servers and storage integrated solutions.
Rackable powers the Human Genome Project.

The four pictures below show the Rackable server:

Disk attached to fcdfcode1

-- Arena Indy 2400
-- 19'' Rackmount chassis
-- Ultra 160 SCSI-IDE
-- 12 160GB Maxtor 5400rpm hard drives

Local I/O Performance

For the purposes of the CDF CAF, we are interested in evaluating how fast the server hardware can deliver a secondary data set to the worker nodes in the farmlet. This has two main pieces - local read bandwidth (i.e how fast can it retrieve data from disk into memory?) and network bandwidth (i.e. how fast can it send data from memory to remote memory over the network connection via some protocol like NFS?). Write bandwidth is a comparatively minor concern (but should not be completely ignored as we'll have to load the servers with new secondary data) and therefore not systematically benchmarked.

This section deals with local disk read performance.

Benchmarking Method

The primary tool used in the local and network benchmarking present here was written as a perl script (disk_bench.pl) which forks and manages a user-requested number of parallel "dd" read processes to simulate simultaneous client access to server data. Several reasons for writing a benchmarking script over simply using an existing multi-thread i/o benchmark tool (e.g. tiobench) are:

-- knowledge of exactly what the test is doing w/o having to look at source code
-- ease of customization to match specific tests (e.g. benchmark multiple controllers at once)

The read benchmarking algorithm using disk_bench.pl is as follows. For N simultaneous read threads to be benchmarked, N dd processes are forked and timed. The only subtlety is that while all the dd threads start at the same time, they do not all end at the same time. The consequence of this is that the last k dd jobs to finish will not be "competing" with N-k other dd threads throughout their entire process, and therefore the throughput will be overestimated. To circumvent this problem, we do not just fork N dd processes but rather N dd sequences, each of "several" dd's with only the first dd timed to calculate aggregate throughput. In practice, the "several" dd's in the sequence corresponds to enough dd jobs to ensure that the last timed dd finishes before the final dd in any sequence finishes (this is checked in the script and the user warned of throughput over-estimation is this occurs). As soon as all of the timed dd processes finish, all other dd's are killed. Memory caching effects also need to be address, as they can cause significant overestimates over disk bandwidth. This is taken care of by first reading a large file from disk, where "large" is typically 1-2 times the size of the system RAM. An another alternative not implemented would be to simply remount the filesystem before forking the dd sequences.

An important feature built into disk_bench.pl is the ability to read benchmarking files from multiple directories. An example of the utility of this feature is when data is in two different directories that represent two different i/o controllers. In this case, the bandwidth for data randomly accessed from these two directories would be underestimated by a benchmark of the two controllers separately. Recall that the ASA and Polywell servers have two Raid 5 volumes attached to different controllers. This feature is no longer needed when we stripe across the two controllers in a Raid50 configuration (to be discussed later).

Results

The results of running disk_bench.pl to measure local disk read performace are presented in this section. Benchmarking file sizes are 512 MB and read block sizes 1M unless otherwise noted.

Raid Configurations

For the Fiber channel disk array attached to a Level 3 node, comparing different Raid configurations:


       #threads   Raid0   Raid3   Raid5 (aggregate throughput in MB/s)
       --------   -----   -----   -----
          1        42      39      39
          2        48      47      47
          4        59      57      59
          8        59      53      59
         16        60      57      60
         32        59      55      58
         60        50      46      48
        100        44      40      44

All three of the above Raid configurations give similar read throughput as would basically expect (all striped data arrays). If benchmarked, we would expect slower write performance for Raid 3,5 compared to Raid 0 because of parity overhead.

Journaling Filesystems

Any file server O(1 TB) would use a journaling filesystem for this size filesystem. An unjournaled filesystem (like ext2) requires manual filesystem checking and cannot be restored as quickly and easily in the event or data corruption as a filesystem which logs filesystem modifications. Simple practicalities also limit the use of unjournaled filesystems - fsck can take several hours to check a 1 TB filesystem!

There are currently four options for journaling filesystems:

1) ext3 - simple extension of ext2 for journaling. In fact, an ext2 fs can be converted to ext3 jfs without much problem which makes it attractive for existing systems. Supported in 2.4.15 and later kernels.

2) ReiserFS - supposedly better performance than other jfs on small files. Supported in 2.4.16 kernel.

3) XFS - SGI's jfs. Needed to patch 2.4.16 kernel to get this to work.

4) JFS - IBM's jfs. Needed to patch 2.4.16 kernel to get this to work.

I've repeated the exact same benchmark previously described but using each of these journaling filesystems. The results are show below (all for Raid 3 configuration, same FC + Level 3 node configuration):


       #threads   ext2    ext3    ReiserFS    XFS    JFS (agg tput in MB/s)
       --------   -----   -----   --------    ---    ---
          1        39      38        37       42     43
          2        47      46        50       48     49
          4        57      57        57       59     58
          8        53      52        52       52     56
         16        57      57        56       58     58
         32        55      55        54       55     56
         60        46      44        42       49     44
        100        40      39        34       43     39

We can see that there is also little variability in the read performance for the various journaling and ext2 filesystem in these very limited tests. This is again maybe not to be all that surprising, as we would naively expect the write performance to more significantly impacted by the journal updating overhead.

Raid 5 - One/Two Controller Results

We ran disk_bench.pl on the ASA and Polywell servers to benchmark the disk read performances. In this case benchmarking was done separately for "one controller" (directory) and "two controller" (directory). For the two controller test and all but one dd thread (one is the same for one and two controllers) the timed dd threads operate on benchmarking files from alternate directories (controllers). This is case of perfect controller load on the server and the "one controller" benchmark represents maximally unequal controller load. Therefore, the "one controller" and "two controller" benchmarking should be considered worst and best case file server throughput, respectively, simply using two hardware Raid 5 arrays. The results are shown below.


                    ASA server      Polywell server
       #threads   1 cont  2 cont    1 cont  2 cont (agg tput in MB/s)
       --------   -----   ------    ------  ------
          1         83      83        86      90
          2         57     117        64     128
          4         54     107        62     119
          8         47     108        43     109
         16         48      97        38      82
         32         47      94        36      72
         60         46      89        32      69
        100         40      82        26      66

Raid 50 - Striping Across Raid 5 Arrays

In this case, we stripe the data (and parity) across the two Raid 5 arrays. This later striping is done in software and setting up Raid50 was really quite simple. The advantage of Raid50 over two separate Raid 5 is quite substantial, as shown below (Polywell server, 2.4.16 kernel, 512k chunck size, bs=dd block size):


       #threads   1M bs   64k bs (agg tput in MB/s)
       --------   -----   ------ 
          1        127     182
          2        147     190
          4        152     165
          8        141     146
         16        132     129
         32        127     120

The CPU and memory utilization for Raid50 are higher than for the previous tests of simple hardware Raid5, which I suppose is expected from the additional software raid overhead. In fact, we started swapping in and out the dd threads starting at 60 simultaneous threads on our 1GB RAM system, which caused incorrect results (this is why they're not shown).

Data Integrity Checks

Ultra ATA employs Cyclic Redundancy Check (CRC) data verification for host bus transfers. This checks that data is transferred without error between the drive and the host controller, a CRC being performed for each burst of data by both the host and the drive. At the end of the burst, the host sends this information to the drive for comparison with the drive's CRC. Although the data integrity is CRC checked, the commands (e.g. read, write) and command parameters (e.g. sector/cylinder addressing) are not CRC checked. This means that corruption can still occur if data is written to or read from the wrong location on the disk or if incorrect commands are communicated to the drive (the latter is probably less likely). With the very large numbers of disks to contend with in a 100 server (~1600), data integrity issues become a concern.

In order to check this, I wrote a simple Perl script ( md5_checks.pl ) which simply reads a file repeatedly from disk, computing the md5sum each time (md5sum is a different checksum from CRC which is considered more robust in the sense that it is less likely that different files will have the same checksum) and comparing this to the know checksum of the file.

This script was run overnight simultaneously on both of the ASA's Raid 5 arrays without any failed checksums encountered. The script completed 12000 iterations of reading a random-byte file. This file was created with

      dd if=/dev/urandom of=./random.dat bs=1M count=2048

This probed data integrity at the level of 2e13 bits without seeing any bit errors. An exerpt from a recent email from Frank Wuerthwein:

Maxtor claims about their drives that they have less than 1 non-recoverable data error per 10E14bits read.

This suggests that that we would have to run our test for a few weeks to probe the validity of Maxtor's claims with our ASA server (or get more servers). .

Network (Gigabit Ethernet) Performance

For these tests, we used a direct fiber link between two SysKonnect gigabit ethernet cards (SK-9843 SK-NET GE-SX single port, multimode fiber adapter).

Steve Heaton- Account Manager
SysKonnect, Inc.
1922 Zanker Rd.
San Jose, CA 95112
408-437-3840 Direct
408-437-3866 Fax
1-800-752-3334 Toll Free
E-Mail: sheaton@syskonnect.com
URL : http://www.syskonnect.com

Memory-to-Memory

To measure the bare performance of the Gigabit ethernet card, we used very short C program which uses the socket level function. The server program allocates a fixed size of memory to be used as a buffer, opens the socket and waits until the client program connects. Once a connection is established, the server sends the same content of the buffer repeatedly to the client and disconnects after fixed number of repetitions. On the other side of the connection, the client just connects to the server, and receives the data to the same buffer by overwriting it (without looking into the contentso f the data) until the connection is closed by the server. This transfer was timed to get a measure of the network throughput

The only possible overhead except the copy between the kernel buffer and the NIC buffer is the copy between the kernel buffer and the user space buffer, but this step should exist in a usual application, anyway.

The table below shows the memory-to-memory results for different IP and TCP packet sizes. "in" and "cs" refer to the number of interrupt and context switches issued to the client processor per second (via "vmstat 5"), respectively. Interrupts are used by device drivers to get the processor's attention to perform some set of tasks. Context switches represent switching between user and system level processing which affects both CPU latency and load. Therefore, in a very loose sense, these provide a relative measure of how efficient the client is at processing the incoming packets (higher numbers mean that the processor is working harder, particularly for context switching).


      IP packet size(bytes)   TCP packet size(bytes)   in    cs    MB/s
      ---------------------   -----------------------  ----  ----   ----
             1000                     10               34k   34k     7
             1000                    100               41k   41k    70
             1000                   1000               58k    85    67
             1000                  10000               58k   100    67
             1000                 100000               56k   200    67

             9000                     10               32k   32k     8
             9000                    100               19k   13k    71
             9000                   1000               20k   15k   118
             9000                  10000               21k   15k   118
             9000                 100000               21k   15k   118

Notice that for reasonably sized TCP packets, we are only able to get near the full bandwidth of the gigabit link (118 MB/s or 940 Mbits/s) with so called "Jumbo Frames" (9000 byte IP packets). The downside is that Jumbo Frames are not currently part of the IP standard so that they are not guaranteed to be universally compliant with all networking hardware.

Disk-to-Memory

To investigate what might be the maximum speed of an NFS server, we used a similar program to the memory-to-memory case. The difference is that this server program directly opens a given local file, reads the file into the user space buffer, and sends the contents of this buffer to the client over TCP. The client side is unchanged from the memory-to-memory case. The results are shown below (ASA server):


      IP packet size(bytes)   TCP packet size(bytes)   in    cs    MB/s
      ---------------------   -----------------------  ----  ----   ----
             1500                    100               38k   42k    30
             1500                   1000               42k    5k    65
             1500                  10000               39k    4k    46
             1500                 100000               33k    3k    44

             9000                    100               12k   11k    43
             9000                   1000               15k   14k    63
             9000                  10000               13k   10k    62
             9000                 100000               10k    7k    58

It should be noted that we these tests were done, we were only getting approximately 70 MB/s for local disk reads, rather than the 83 MB/s previously shown (see "Raid 5 - One/Two Controller Results") for several reasons not worth mentioning. The point is that its not clear whether the Disk-to-Memory throughput is limited by disk access or network bandwidth. This test should be repeated with faster disk access to investigate the limits of remote file transfer with this method.

NFS Performance

In the current CAF model, the secondary data volume on a given server is NFS exported to each of the worker nodes in the farmlet. Therefore, we are ultimately interested in determining the aggregate throughput for multiple clients reading data on the server over NFS. The procedure for benchmarking this is indentical to what was done for local I/O (i.e. use of disk_bench.pl), except now that that the test is performed on the client using a benchmarking directory which is an NFS mounted disk physically residing on the server.

The results for vanilla NFS (2.4.3 kernel, no mount options, standard 1500 byte frames) with the ASA server exporting benchmarking data (two controllers) to a single dual Athelon client is shown below:


                       Client              Server
       #threads   %CPU   in    cs     %CPU   in    cs     MB/s   
       --------   ----  ----  ----    ----  ----  ----    ----
          1        34   10k   12k      50   10k   20k      34
          2        24    5k    9k      31    6k   10k      42
          4        27    6k   15k      31    8k   12k      35
          8        32    6k   32k      30    9k   14k      27
         16        25    5k   40k      16    6k    9k      23
         32        31    4k   55k      10    2k    5k      11
         60        67    3k   72k       8    3k    4k       7
        100        98    1k   63k       5    2k    3k      --

Notice that as the number of client read threads increases, the CPU load on the client becomes very large and the server CPU consequently decreases. One should keep this in mind when try to interpret these throughput scaling results, because in the actual farmlet configuration each client will only have to process incoming packets for a few (say, three or less) read threads. Therefore, we might expect the aggregate throughput scaling in a switched environment to be better than the results above if we are truely client CPU-limited. Of course, the relative server load will increase in this case, so we really need a more realistic network configuration to test the throughput scaling behavior over NFS.

Interrupt Moderation

We found in our CAF prototype tests that simultaneous clients (here, truly multiple clients connected to the server through a Cisco 2950 switch - FE client connection, Gigabit server connection) accessing data on the server would max out the throughput at ~45 MB/s with 100% server CPU utilization. It was noticed that the gigabit card was issuing ~1 CPU interrupt per IP packet sent along the PCI bus. The SysKonnect 9843 supports as feature called "dynamic interrupt moderation" (DIM) which seeks to reduce the server CPU load by grouping together interrupts so that one interrupt can handle several data packets. The open source kernel driver module in the 2.4.18 kernel (sk98lin) supports this feature, however the modules needs to be rebuilt after uncommenting out some of the source code (diff of source) There is one tunable parameter which is the number of interrupts issued by the SysKonnect card per second. So you get fewer interrupts per packet at the expense of increased packet latency (ie packet transfer into/out of the card is delayed because it accumulates packets before it transfers them).

It was found through studies of network throughput (over NFS) on both the ASA and Rackable servers that enabling DIM increases aggregate dd read throughput to simultaneous client by 5-10%, and decreases CPU utilization and interrupt rate as expected. This increase in throughput was fairly insensitive to the number of interrupts per second specified in the device driver over a range of 15000 to 1000 int/sec, however throughput decrease if this was made too small (200 int/sec was tested) due to increased packet latency.

Attempts at NFS Optimization

The are several ways that one can boost NFS performance. A good start to learning about NFS-related network settings is the Linux NFS-HOWTO. This section discusses some of the attempts we made at boosting the NFS read performance measured by the disk_bench.pl benchmarking script. The test setup was as above - direct Gigabit connection (SysKonnect) between ASA server and dual Athelon client.

The first thing we tried was 9000 byte jumbo frames with NFS. The results are shown below (same vanilla 2.4.3 kernel NFS except jumbo frames):


       #threads     1 cont   2 cont   
       --------     ------   ------
          1           47       44
          2           28       54
          4           19       45  
          8           17       37
         16           11       30
         32            8       14
         60            7       11
        100            -       10

Notice that we see ~30-35% increase in throughput for 512 MB file reads over NFS.

At the time of writing, the 2.4 kernel series has client-side NFS over TCP but only UDP server-side NFS. The UDP protocol has the advantage of providing no load on the server when the connection is not active. However, the disadvantage of UDP is that if a packet is lost on large read/write all packets are retransmitted, rather than just the lost packet as in TCP. Experimental patches to the 2.4.17 kernel provide server-side NFS over TCP functionality.

An important feature of NFS version 3 (available in 2.4 kernels) is support for large NFS transfer buffers (up to 32k). The current default size in 2.4 kernels in 8k. The same set of patches which provide NFS over TCP for 2.4.17 also increase the maximum NFS buffer size to 32k.

Another handle on network throughput is the socket input queue size (also sometimes referred to as "window size") used by the kernel. The default in 2.2 and 2.4 kernels is 64k to be used to store requests while the kernel processes them. This queue size can be changed through the /proc facility by

       echo 262144 > /proc/sys/net/core/rmem_default
       echo 262144 > /proc/sys/net/core/rmem_max

which changes the default and maximum kernel input queue size to 256k. The results of trying out the above changes (with patched 2.4.17 kernel) on NFS performance is shown below:


       NFS buf(kB)  IP packet(bytes)  Protocol  Socket window(bytes)  MB/s
       -----------  ----------------  -----     -----------------     ----
           8             1500          UDP            default          34
          32             1500          UDP            default          47
           8             9000          UDP            default          47
           8             9000          TCP            default          35
          32             9000          TCP            default          56
          32             9000          TCP            256k             56
          32             9000          TCP             1M              55
          32             9000          TCP             4M              56

Here are the full read thread scaling results from two controllers for the patched 2.4.17, 32k NFS buffers, jumbo frames, and UDP:


       #threads     MB/s
       --------     ----
          1          55
          2          72
          4          63
          8          69
         16          52
         32          33
         60          26
        100          19

In summary, there are configuration handles which can be used to increase NFS performance. We openly admit that we have taken a sort of "phenomenalogical" approach to these NFS performance studies. Some additional care and intelligence should be put into our optimization process on the final network configuration and system usage to ensure that we've found a robust maximum. Our intent was simply to demonstrate handles on NFS performance enhancement, which I believe has been achieved.

Alternatives to NFS?

rootd

Rootd is a file server program which provides access to root files over a network. This program can be run by superuser (official TCP port number 1094) or privately (with port number larger than 2048). To benchmark only the rootd connection, we wanted to avoid the overhead of analyzing root file structure. Therefore we used the ReadBuffer() member function of the TNetFile class, which reads the file contents serially while ignoring all the structure.

For the rootd case, we found that reading big chunk of data at once improves the performance because of the inherent limit of script interpreter. The maximum read throughput we were able to get reading a root file over rootd was 51 MB/s local I/O and 36 MB/s over gigabit ethernet, basically consistent with the statement on the Root webpage that rootd performance is comparable to NFS.

Rackable Server Performance

Cooling considerations

As previously indicated, maintaining adequate cooling of internal components is a serious consideration in evaluation of server units. This is particularly important for this unit due to increased component density compared to the 4U ASA server and Polywell server. As with the ASA and Polywell servers, we found the Rackable server adequate in this regard. Ultimately, we are interested in the long term stability of the unit's accurate serving of data files to clients- this is discussed in the Data Integrity part of this section.

As with the other servers, we used an IR gun to measure the temperature profile of the disks after a long period of heavy activity, as follows. Two separate infinite loops the disk benchmarking utility bonnie were run overnight - one on each Raid 5 array (3ware controller). The tests were stopped, the machine shutdown, and the disks pulled out and scanned with the IR gun (over the rotation point of the disk, mostly) to measure the temperature profile (all this done as quickly as possible). The disks varied in temperature from 82 to 85 deg F, similar to the ASA and Polywell and reasonably well below the point at which one might become concerned (roughly 90F as a general rule of thumb). Of course, long term data stability tests in a realistic operating environment is the ultimate test system cooling.

Local disk read performance

Here we test the data throughput and integrity for local disk reads, as was done for the ASA and Polywell servers.

Throughput

The benchmarking script disk_bench.pl was run on the server to test the local scaling of simultaneous read requests. The tests were run "as is" in terms of system configuration (i.e. tested system as configured by the vendor). Some characteristics of the vendor configuration:
-- RedHat 7.2 (2.4.7-10smp kernel)
-- Two separate ext3 volumes (751GB and 902GB) used for benchmarking, one for each 3ware controller

The disk_bench.pl results are shown below, with ASA and Polywell server results previously shown included for reference:


                  Rackable server      ASA server     Polywell server
       #threads   1 cont  2 cont    1 cont  2 cont    1 cont  2 cont (MB/s)
       --------   -----   ------    -----   ------    ------  ------
          1         60      66        83      83        86      90
          2         47      97        57     117        64     128
          4         30      66        54     107        62     119
          8         25      47        47     108        43     109
         16         25      42        48      97        38      82
         32         25      40        47      94        36      72
         60         23      35        46      89        32      69
        100         21      32        40      82        26      66

There are differences in both hardware and OS/software which make comparison of the Rackable unit with the ASA/Polywell servers difficult to interpret. These differences can be noted from the descriptions provided on this web page. They include:
-- MB (SDS2 vs STL2)
-- Kernel (2.4.7 vs 2.4.2/3)
-- Benchmarked filesystem (ext3 vs ext2)
-- IDE disks + 3ware driver (160GB+beta vs 100GB+std)

This last one (disks+) is mostly likely to account for most of the performance difference. The Maxtor 100GB drives are 7200 RPM with 2 ms average seek time whereas the 160GB drives are 5400 RPM with 12 ms average seek time. One might expect the difference in RPM's to dominate for a single thread, while seek time and driver issues become increasingly important as the number of simultaneous reads increases. In fact, the Rackable server read throughput is 25% worse than the ASA server for one read thread, which is consistent with the 28% lower RPM for the 160GB Maxtor disks. The throughput decrease for the Rackable server becomes more substantial as the number of threads increases, becoming a factor of 2-3 at the highest threading. The source of this increased discrepancy in not clear, but may be an issue with the beta driver or increased disk seeking.

Additional Throughput Tests

In order to help understand the differences between the Rackable and ASA server performance, Rackable System sent us 13 Western Digital 100GB 7200 RPM drives (the same as in the ASA server). These drives replaced the Maxtor 160GB drives in the Rackable server for testing, and the following results were obtained:


                               Rackable server
                   160GB Maxtor         100GB WD            ASA server 
       #threads   1cont  2cont  1cont(6) 1cont(7)  2cont   1cont  2cont
       --------   -----  -----  -------- --------  -----   -----  -----
          1         60     66      58       45      58(6)   83      83 
          2         47     97      37       32      88      57     117 
          4         30     66      50       49      68      54     107 
          8         25     47      39       40      84      47     108 
         16         25     42      30       29      77      48      97 
         32         25     40      28       28      59      47      94 
         60         23     35      26       28      54      46      89 
        100         21     32      25       27      52      40      82

where the number in parentheses after 1cont refers to the number of disks attached to the controller (recall that the Rackable server has 13 disks attached to two 3ware controllers). The most evident difference between the 160GB and 100GB drive results is the better scaling when the 100GB drives are used. The Rackable server throughput for our read tests is still below the ASA server performance. A dump of the Raid array summary from the 3ware 3dm utility is shown for the ASA server and Rackable server.

Data Integrity

Here we investigate the integrity of the data read locally using the md5_checks.pl script (see previous discussion on this page). With the Rackable server we have our most extensive test yet of data integrity - 15800 iterations of reading a 1 GB file from the IDE RAID arrays without any bit errors (nearly one week of running on both arrays). This corresponds to ~3e14 bits read without error. This tests the quality and stability of the unit's ability to server data (locally).

fcdfcode1 Performance

Throughput

The benchmarking script disk_bench.pl was run on fcdfcode1 to test the local scaling of simultaneous read requests. The results are shown below:


       #threads   (MB/s)
       --------   -----
          1         47
          2         48
          4         56
          8         56
         16         61
         32         60
         60         58
        100         56

Data Integrity

Here we investigate the integrity of the data read locally using the md5_checks.pl script (see previous discussions on this page). With the disk array attached to fcdfcode1, we have successfully read an 8GB file >1100 times with no data bit errors. This correponds to a bit error rate (BER) of < 1/8e13.