Distributed File System Benchmarking

Parallel Distributed Systems Facility

NERSC Lawrence Berkeley National Lab

Stephen Chan sychan@lbl.gov
 

Motivations:


    One of the main problems facing scientific computing, especially in the field of High Energy Nuclear Physics (HENP), is the large quantities of data that are generated and need to be analyzed. Active and planned experiments expect to generate a PetaByte or more of data per year.
    Data analysis in the field of HENP can often by decomposed into embarassingly parallel computations, which is relatively easy to scale on clusters of loosely interconnected clusters. However, scaling the efficient handling of the data sets is not easy, and it can be considered to be an instance of the pervasive memory latency issue that is a bottleneck in high performance computing. While drops in hardware prices for storage and networking have been steadily dropping, it isn't clear how well the software is evolving to deal with the increased storage capacities possible.
    Distributed file systems make it possible to have the illusion of a large single storage system across a cluster of machines. In this document, we look into the state of contemporary distributed filesystems, and benchmark those that are potential candidates for production use. We are interested in high performance and stability, because of our production orientation.
    In the early part of 2001, PDSF ran into serious problems with NFS stability. When a high number of clients ( ~40) simultaneously generated traffic to our disk vaults, we found that the server would hang, and require a reboot to be functional. We got around the problem by throttling the number of simultaneous connections to the server, but it got us started looking for a better DFS.
 
 

Distributed File Systems (The Contenders):


    The platform we are interested in is Linux. In production Linux HENP clusters, we typically see 2.2 kernels, with a move to 2.4 based kernels as they become more production worthy. While there may be many sites that run Sun, Irix, AIX or even Windows, we are focusing on Linux because that is the platform we use, and which are most concerned with.
 

   The following DFS's were considered:

    NFS - the incumbent. Available everywhere, and has fairly mature implementations. We examine NFS version 2 and version 3. Not considered high performance, however everyone ends up using it.

    GFS - a new cluster file system out of University of Minnesota, currently being developed by Sistina Software. It has reached a usable state, but is still not nearly as mature as NFS

    AFS - the Andrew file system. AFS is scalable on a "worldwide" scale in terms of capacity, but it has never been considered to be very fast. Typically it is slower than NFS, however the client side caching compensates somewhat

    GPFS - the IBM cluster file system. Originally developed for IBM AIX based clusters, it is being ported to Linux.

    PVFS - the Parallel Virtual Filesystem. A high performance filesystem designed to stripe data across multiple server machines, to parallelizing access for clients.
 

Distributed File Systems - the first cut:


    Some of the contenders get DQ'ed:

    GPFS - holds a lot of promise, however IBM was unable to get GPFS running on the Alvarez cluster (an IBM built and managed cluster). Not production ready - perhaps, not even beta ready.

    PVFS - in our tests, it was unstable, and the entire system was vulnerable to a single point of failure (the metadata server). Application code requires modification to run properly. Not ready for production.

Distributed File Systems - the second cut:


    AFS - fairly mature, and running in production at many sites, but deployment is very resource intensive, and requires specialists. In addition, the performance is typically slower than NFS. AFS has been released for Linux, but we do not have the resources to test it at this time.

    GFS - works as advertised, but still not clear about stability for production. GFS is actually designed to run on Fiber Channel SANs (Storage Area Networks). GFS traffic can be encapsulated in TCP, however there is a significant performance penalty. In addition, the lock management for GFS is still not mature - it depends on DMEP, an extension to the SCSI protocol that allows read/write access to memory ranges on SCSI devices. DMEP is not widely supported. An IP based lock manager is available, MEMEXPD, to handle locking on LAN/WAN installations. However, these is a lot of lock traffic, and it is highly sensitive to latency in the transport medium - as a consequence, scalability for high numbers of clients is not expected to be good (for IP, expect it to be less scalable than NFS in the near term). Sistina is also focusing on the 2.4 kernel, so updates to the code are not actively back ported to 2.2.

    Both these file systems are worth investigating, GFS especially looks like a good candidate for sites that have a SAN. However, for PDSF, they are not immediately compelling (we can't afford a SAN, and AFS is heavy and slow).
 

Distributed File Systems - the low hanging fruit:


    I wish I had better news, but NFS is the only realistic choice for our environment. That being said, how can we better understand NFS performance on Linux?
    Linux current supports NFSv2 well on 2.2 kernels, and the best NFSv3 support is in the 2.4 kernels.

    Issues:

    - NFSv2 on 2.2 kernels have been known to hang the server under high load ( ~40 clients)
    - NFSv3 on 2.2 kernels does not fully implement the spec, and does not correctly support async

    - NFSv3 spec supports client side caching, which improves write performance.
        - in reality, NFSv2 implementation on 2.2 uses disk cache already, so we get client side caching
           for free with NFSv2 and 2.2
    - NFSv3 on 2.2 doesn't support async, and results in dismal write performance
    - NFSv3 on 2.4 properly supports async, so write performance returns to reasonable levels

    Because of the poor support for NFSv3 in 2.2, we run some benchmarks, but do not waste our time testing all configurations.
 

Benchmarking Tools:


    PDSF applications individual disk access patterns can be characterized as large sequential accesses. Data sets are anywhere from several megabytes to a gigabyte in size, and they are typically read straight through, and outputs are are written out sequentially. Only a single job is run per CPU, so disk access are essentially sequential.
    However, multi-client access to the same server can result in randomizing the I/O patterns at the server side.
    Access patterns:

    Server side - possibly sequential, but likely to be random
    Client side - almost always sequential, possibly multiple file simultaneously accessed sequentially

    As a result, to simulate the load that would be generated on our servers, we should be using a series of clients that perform sequential disk I/O, giving us the same sequential at client, random at server behavior.

    We actually examined several benchmarks: iozone, bonnie and postmark

    IOZone http://www.iozone.org - this is a very thorough file system benchmark that examines performance across a large range of parameters (file size, block size, # of threads, etc...). The implementation seems to also be more efficient than Bonnie, because for comparable performance levels, there is more CPU idle time using iozone. This is a fairly general tool that can be used to get far more information than you really wanted to know about IO performance.

    Bonnie http://www.textuality.com/bonnie/ - this is a fairly easy to use benchmark that measures performance for sequential and random IO, at the block and character level. The results it gives are generally similar to iozone for throughput. However, as mentioned above, IOZone can measure more things, and seems to run more efficiently.

    PostMark http://www.netapp.com/tech_library/3022.html - this is a benchmark that specifically examines lots of small random I/Os across a large number of files. It is meant to simulate the disk access patterns of NNTP servers. These access patterns don't really correspond to our HENP applications, but may be relevant to performance for compiling code.

    IOZone and Bonnie are the closest match for our applications. After some consideration, we use IOZone for the benchmark - even though it is somewhet more complicated to use, it is more efficient, making it more of a pure filesystem benchmark and less of a CPU benchmark.
    We do present some Bonnie benchmark results (in the Jumbo Frame benchmarks) - so long as we are doing Bonnie vs. Bonnie benchmarks, they are useful for seeing relative performance across different configurations.
 

Benchmark methodology


    The main thing we want to understand is throughput over NFS, for different numbers of clients, over different OS and NFS versions. We are also interested in the performance difference between jumbo frame versus normal frame sizes on ethernet. It is important that we test on a large enough file size to defeat caching. Note that on many of our applications, caching doesn't come into play because the data is only read or written once.
    To test the first aspect, we set aside a group of 11 servers on a relatively quiescent switch. Each of these machines is running only the benchmark, and whatever normal services/daemons would find on a compute server or an NFS server.
    We have a scripted test harness that copies the benchmarking code and any configuration files over to all the nodes being tested. The following then occurs:
    1: the NFS server unmounts the partition being used for the testing, runs mkfs to create a fresh ext2 filesystem, remounts the partition and then exports it over NFS
    2: all the clients mount the exported partition using whatever parameters are set (NFS version, etc...)
    3: clocks are synchronized across all the nodes
    4: the actual test scripts are scheduled to be run at a certain time with the "at" command
    5: the master node waits for all the client nodes to signal that they are done, and then copies the results back to the master node
    6: all the benchmark files on the client nodes are deleted, and the shared partition is unmounted

    During the benchmark, client sar information is captured to a file for later analysis
    The benchmark consists of running IOZone in Write and then Read tests, for a 2 Gigabyte file (larger than available cache memory on all clients and servers), and then on a 1 Gigabyte file (will fit in client cache).
    Since we are testing with such relatively large files, I throw in a seperate test for different file sizes. This isn't really a test of the distributed filesystem, but it is useful to know in order to have a context for interpreting our results for 1GB and 2GB files.

    Currently the tests are run with 1 client, then 2 clients, then 5 clients and finally 10 clients. Eventually, we want to run with as many clients as it takes to kill off the server, but that is something we haven't had a chance to schedule yet.

    Ideally, we would want to run the very same tests with jumbo frame ethernet, but jumbo frame and normal frame ethernet don't coexist well, so for the time being we have only done some single client jumbo tests.

    We would also ideally run the tests multiple times and average the results, however in the testing, we found a surprising level of consistency in performance. The major variation comes in when caching effect kick in. For the time being, use only 1 run at each configuration. This is flawed, but looking at the results, I think you will understand why I don't think it is a large issue.
 
 

Configuration:

    NFS server:
    Dual 866Mhz PIII with 512MB RAM
    3Ware Escalade 6200 series 4 channel IDE RAID, with 3 72GB drives striped,
        64k stripe size
    SysKonnect copper GigE NIC
    Redhat 6.2 (with custom 2.2.19 kernel) or Redhat 7.1 (with stock 2.4.3 kernel)
    rsize and wsize of 8192 on NFS
    NFS rpms for RH6.2:
        nfs-utils-0.3.1-0.6.x.1
        mount-2.10r-0.6.x

    NFS clients:
    Dual 1Ghz PIII with 2GB RAM
    local IDE drives using software striping
    SysKonnect copper GigE NIC
    Redhat 6.2 (with custom 2.2.19 kernel)
    rsize and wsize of 8192 on NFS mount
    NFS rpms for RH6.2:
        nfs-utils-0.3.1-0.6.x.1
        mount-2.10r-0.6.x

    Switch:
    Extreme Networks Summit 7i, 28+4 port gigE "wire speed" switch
 

Benchmark Results:

Local filesystem baseline:

    This is a baseline performance measure on the local filesystem of the NFS server.

             --Output--   --Input---
             --Block---   --Block---
Machine   MB    K/sec      K/sec
local    100   244216     150460
local   1000    72899      81738
local   2000    62885      78460
 

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          100 14414 100.2 220353 99.0 92188 99.9 14258 100.1 471726 96.7 21667.2 200.4
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          500 13967 100.0 85762 96.7 23259 54.5 13782 97.4 82939 66.7 740.4  8.5
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         1000 13828 99.9 67756 96.8 22829 51.6 14069 97.5 80783 62.5 358.7  3.4
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         2000 13775 99.9 58966 94.2 22918 51.3 14323 97.2 78885 57.5 261.1  2.7
 

Large files vs Small files:

    Here are some Bonnie results for different file sizes, across an NFSv2 mount on a 2.2.19 server:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
           10  5546 92.1 26251 35.9 23186 67.9  7071 100.1 177147 103.8 6343.7 77.7
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          100  5537 93.1 26336 76.1 23495 78.7  7041 100.0 177193 100.4 6298.3 89.8
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          250  5528 95.2 22608 49.9  9369 34.9  5603 96.7 15379 57.9 2375.9 83.2
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          500  5517 94.7 17783 33.3  6292 20.0  4889 88.6 16175 64.8 648.6 23.4
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         1000  5467 93.5 15677 23.2  5733 16.5  4995 90.4 15798 66.0 329.7 10.6

Analysis:
    This benchmark was run on a different machines than the main benchmarks - only 256MB of RAM on the client. You can see that the block read performance falls to realistic levels once the filesize exceeds the available client cache.
    You can also see that the write performance starts to decline as the filesize gets larger - block writing a 1GB file is roughly 40% slower than writing a 100MB file. However, performance does not decline much for 10MB - 250MB files. This seems to indicate that there is a caching effect for writes as well.
    This is not unreasonable, because writes under NFSv2 are asynchronous by default, if all reads and writes go through the local disk buffers, then the drop in performance past 250MB makes some sense.
 

Jumbo frame vs Normal Frame

    Here is a Bonnie run done on the same hardware, but with Jumbo Frame enabled for an MTU of 9000 bytes (big enough for the rsize/wsize of 8192).

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          100  5861 96.9 35267 73.7 30658 74.3  7033 99.9 177614 98.9 7089.3 95.7
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          500  5808 95.3 29198 47.3  7763 17.2  5489 83.2 26826 46.0 721.0 19.6
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         1000  5801 95.5 18802 29.1  7100 15.0  5741 87.5 28696 52.8 354.3 11.3
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         2000  5824 95.6 18665 27.1  7179 15.5  5711 87.2 27064 53.7 269.8  8.4

    Comparing the performance for similar file sizes:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
MTU-1500  100  5537 93.1 26336 76.1 23495 78.7  7041 100.0 177193 100.4 6298.3 89.8
MTU-9000  100  5861 96.9 35267 73.7 30658 74.3  7033 99.9 177614 98.9 7089.3 95.7

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
MTU-1500  500  5517 94.7 17783 33.3  6292 20.0  4889 88.6 16175 64.8 648.6 23.4
MTU-9000  500  5808 95.3 29198 47.3  7763 17.2  5489 83.2 26826 46.0 721.0 19.6

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
MTU-1500 1000  5467 93.5 15677 23.2  5733 16.5  4995 90.4 15798 66.0 329.7 10.6
MTU-9000 1000  5801 95.5 18802 29.1  7100 15.0  5741 87.5 28696 52.8 354.3 11.3

Analysis:
    In terms of read performance, so long as the file resides entirely within the client cache, jumbo frame makes no difference. However for block writes we see up to a ~60% boost in performance (at 500MB) and for reads we see up to an 80% boost (sequential reads at 1GB). Notice that on reads, jumbo frame increases throughput and also lowers CPU utilization.
    Jumbo frames provide a significant performance boost, and in most situations, lowers relatively CPU utilization (sometimes it lowers absolute utilization as well).
 

Jumbo Frame NFSv2 vs. NFSv3

    This run is a repeat of the above jumbo frame run, but with NFSv3 on the 2.2.19 kernel.

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          100  3498 61.4  7483 23.8  6233 23.0  7015 100.0 178076 99.1 911.4 13.0
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          500  2597 45.5  4091 12.8  1291  5.7  5944 94.3 25866 56.2 216.5  6.7
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         1000  2071 36.1  2846  8.3   882  3.6  5950 94.0 26356 53.5 106.8  2.5
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         2000  1663 28.9  2104  5.9   513  1.9  5928 94.1 26081 55.0  81.8  2.3
 

    Comparing similar runs:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
NFSv2     100  5861 96.9 35267 73.7 30658 74.3  7033 99.9 177614 98.9 7089.3 95.7
NFSv3     100  3498 61.4  7483 23.8  6233 23.0  7015 100.0 178076 99.1 911.4 13.0

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
NFSv2     500  5808 95.3 29198 47.3  7763 17.2  5489 83.2 26826 46.0 721.0 19.6
NFSv3     500  2597 45.5  4091 12.8  1291  5.7  5944 94.3 25866 56.2 216.5  6.7

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
NFSv2    1000  5801 95.5 18802 29.1  7100 15.0  5741 87.5 28696 52.8 354.3 11.3
NFSv3    1000  2071 36.1  2846  8.3   882  3.6  5950 94.0 26356 53.5 106.8  2.5

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
NFSv2    2000  5824 95.6 18665 27.1  7179 15.5  5711 87.2 27064 53.7 269.8  8.4
NFSv3    2000  1663 28.9  2104  5.9   513  1.9  5928 94.1 26081 55.0  81.8  2.3

Analysis:
    NFSv3 on the 2.2 kernel seems to offer roughly equivalent performance for reads, but markedly slower performance on writes. By forcing synchronous writes under NFSv3, we lose, on the average, about 85% of our throughput on block writes.
    Friends don't let friends run NFSv3 on 2.2 kernels.
 

Varying the number of NFSD processes:

    The number of NFSD processes defaults to 8, however I decided to see if there was a performance advantage to running more. I noticed that when running with as few as 1 client, all 8 nfsd processes would be running. In this experiment, I have 2 clients doing the IOZone workload against a single server. On the server I tried 8 daemons and then 16 daemons to see if there would be any performance difference.
    The server was running the 2.2.19 kernel, on normal frame gigE. Let look only at aggregate throughput for reading and writing

             --Output--   --Input---
             --Block---   --Block---
Machine   MB    K/sec      K/sec
 8 nfsd 2000    20553      19542
16 nfsd 2000    20781      20309

 8 nfsd 1000    21964     538896
16 nfsd 1000    21027     536607

 8 nfsd  100    37796    1032889
16 nfsd  100    38283    1032815

Analysis:
    For all file sizes, the number of nfs daemons has basically no effect on read or write. At this point, we start to see the ridiculous read throughputs that are possible when the test file is entirely in memory. When IOZone does read tests on a file that resides entirely within the disk buffer cache, the throughput number is consistently about 500MB/sec.
    Apparently adding more nfs daemon processes has no effect on performance.

    This experiment also shows that for small files that can be fully cached on the client, the IOZone benchmarks are essentially a test of how fast the application can read data out of the disk cache. We notice that IOZone is capable of reading a lot more throughput than Bonnie - Bonnie can read at 180MB/sec from the buffers, while IOZone can read over 500MB/sec.
 

Single Client NFSv2 2.2.19 kernel vs 2.4.3 kernel


    This is a basically a head to head comparison of the 2.2 kernel versus the RH 7.1 2.4 kernel for NFSv2 file sharing. This comparison illustrates the main differences between the 2.2 and 2.4 kernels as NFS servers. The clients are still running the 2.2 kernels.

             --Output--   --Input---
             --Block---   --Block---
Machine   MB    K/sec      K/sec
2.2     2000    20402      24799
2.4     2000    24723      21150

2.2     1000    21011     517902
2.4     1000    25611      19726

2.2      100    27129     517731
2.4      100    35037     521637
 

Analysis:
    For write performance, the 2.4.3 kernel seems to provide a roughly 20% performance boost at the server side. The read performance is fairly close, however the 500MB+ throughput is an artifact of the caches, and those numbers should be thrown out. The 2.2 kernel may have a slight performance edge when it comes to reads.
 

Multiple client scalability for NFSv2 2.2 kernel

    This shows how NFS throughput scales up as we add more clients. The throughput values are aggregate throughput.

               --Output--   --Input---
               --Block---   --Block---
Machine    MB    K/sec      K/sec
1 client   2000  20402      24799
2 clients  2000  20553      19542
5 clients  2000  20635      12183
10 clients 2000  20376      10475

1 client   1000    21011     silly
2 clients  1000    21694     silly (21674)
5 clients  1000    31893     silly (14432)
10 clients 1000    17672     silly (14025)

1 client    100    27129     silly
2 clients   100    37796     silly
5 clients   100    63928     silly
10 clients  100    43230     silly

Analysis:
    The benchmark does very little synchronization between clients, and runs the tests for different file sizes one right after the other. The larger files get done first, with the smaller ones following. As a consequence, the performance numbers for the smaller files is dubious, because clients will have started to fall outof sync with each other for the load they generate. On the next iteration through this test, we will increase the number of synchronization points.
    However, for the 2GB file, we are fairly certain that the numbers are representative.
    It is clear that write performance is basically constant, irrespective of the number of clients. It is also clear that read performance begins to deteriorate as the number of clients goes up.
    This may be a sign that on writes, because it is asynchronous, the OS is capable of doing write coalescing, whereas on reads, there is nothing to mitigate the randomization from multiple client reads.
 

Multiple client scalability for NFSv3, 2.2 kernel


    We don't bother to do this benchmark because NFSv3 performance is so poor under the 2.2 kernel
 

Multiple client scalability for NFSv2, 2.4 kernel


    [to be completed]
 

Multiple client scalability for NFSv3, 2.4 kernel


    [to be completed]
 

Conclusions:


    - NFSv3 should be avoided on 2.2 kernels
    - Jumbo frames should be used when possible, however they should be used on
        dedicated jumbo frame networks to avoid frag/defrag overhead.
        You can see a 60% to 80% boost in performance with Jumbo.
    - 2.2 NFSv2 write performance does not scale up. However, it does not degrade under load
    - 2.2 NFSv2 read performance degrades significantly as the number of clients goes up
    - Linux disk buffer caching gives you free NFS read caching, great if you can stay in cache
    - As an NFSv2 server, the 2.4.3 kernel is roughly 20% faster on writes - read
        performance is unclear
    - Increasing the number of nfsd processes seems to have no effect on throughput
    - Write performance decreases as file sizes increase