Stephen Chan sychan@lbl.gov
One of the main problems facing scientific computing,
especially in the field of High Energy Nuclear Physics (HENP), is the large
quantities of data that are generated and need to be analyzed. Active and
planned experiments expect to generate a PetaByte or more of data per year.
Data analysis in the field of HENP can often by
decomposed into embarassingly parallel computations, which is relatively
easy to scale on clusters of loosely interconnected clusters. However,
scaling the efficient handling of the data sets is not easy, and it can
be considered to be an instance of the pervasive memory latency issue that
is a bottleneck in high performance computing. While drops in hardware
prices for storage and networking have been steadily dropping, it isn't
clear how well the software is evolving to deal with the increased storage
capacities possible.
Distributed file systems make it possible to have
the illusion of a large single storage system across a cluster of machines.
In this document, we look into the state of contemporary distributed filesystems,
and benchmark those that are potential candidates for production use. We
are interested in high performance and stability, because of our production
orientation.
In the early part of 2001, PDSF ran into serious
problems with NFS stability. When a high number of clients ( ~40) simultaneously
generated traffic to our disk vaults, we found that the server would hang,
and require a reboot to be functional. We got around the problem by throttling
the number of simultaneous connections to the server, but it got us started
looking for a better DFS.
The platform we are interested in is Linux. In
production Linux HENP clusters, we typically see 2.2 kernels, with a move
to 2.4 based kernels as they become more production worthy. While there
may be many sites that run Sun, Irix, AIX or even Windows, we are focusing
on Linux because that is the platform we use, and which are most concerned
with.
The following DFS's were considered:
NFS - the incumbent. Available everywhere, and has fairly mature implementations. We examine NFS version 2 and version 3. Not considered high performance, however everyone ends up using it.
GFS - a new cluster file system out of University of Minnesota, currently being developed by Sistina Software. It has reached a usable state, but is still not nearly as mature as NFS
AFS - the Andrew file system. AFS is scalable on a "worldwide" scale in terms of capacity, but it has never been considered to be very fast. Typically it is slower than NFS, however the client side caching compensates somewhat
GPFS - the IBM cluster file system. Originally developed for IBM AIX based clusters, it is being ported to Linux.
PVFS - the Parallel Virtual Filesystem. A high performance
filesystem designed to stripe data across multiple server machines, to
parallelizing access for clients.
Some of the contenders get DQ'ed:
GPFS - holds a lot of promise, however IBM was unable to get GPFS running on the Alvarez cluster (an IBM built and managed cluster). Not production ready - perhaps, not even beta ready.
PVFS - in our tests, it was unstable, and the entire system was vulnerable to a single point of failure (the metadata server). Application code requires modification to run properly. Not ready for production.
AFS - fairly mature, and running in production
at many sites, but deployment is very resource intensive, and requires
specialists. In addition, the performance is typically slower than NFS.
AFS has been released for Linux, but we do not have the resources to test
it at this time.
GFS - works as advertised, but still not clear about stability for production. GFS is actually designed to run on Fiber Channel SANs (Storage Area Networks). GFS traffic can be encapsulated in TCP, however there is a significant performance penalty. In addition, the lock management for GFS is still not mature - it depends on DMEP, an extension to the SCSI protocol that allows read/write access to memory ranges on SCSI devices. DMEP is not widely supported. An IP based lock manager is available, MEMEXPD, to handle locking on LAN/WAN installations. However, these is a lot of lock traffic, and it is highly sensitive to latency in the transport medium - as a consequence, scalability for high numbers of clients is not expected to be good (for IP, expect it to be less scalable than NFS in the near term). Sistina is also focusing on the 2.4 kernel, so updates to the code are not actively back ported to 2.2.
Both these file systems are worth investigating,
GFS especially looks like a good candidate for sites that have a SAN. However,
for PDSF, they are not immediately compelling (we can't afford a SAN, and
AFS is heavy and slow).
I wish I had better news, but NFS is the only
realistic choice for our environment. That being said, how can we better
understand NFS performance on Linux?
Linux current supports NFSv2 well on 2.2 kernels,
and the best NFSv3 support is in the 2.4 kernels.
Issues:
- NFSv2 on 2.2 kernels have been known to hang the
server under high load ( ~40 clients)
- NFSv3 on 2.2 kernels does not fully implement
the spec, and does not correctly support async
- NFSv3 spec supports client side caching, which
improves write performance.
- in reality, NFSv2 implementation
on 2.2 uses disk cache already, so we get client side caching
for free
with NFSv2 and 2.2
- NFSv3 on 2.2 doesn't support async, and results
in dismal write performance
- NFSv3 on 2.4 properly supports async, so write
performance returns to reasonable levels
Because of the poor support for NFSv3 in 2.2, we
run some benchmarks, but do not waste our time testing all configurations.
PDSF applications individual disk access patterns
can be characterized as large sequential accesses. Data sets are anywhere
from several megabytes to a gigabyte in size, and they are typically read
straight through, and outputs are are written out sequentially. Only a
single job is run per CPU, so disk access are essentially sequential.
However, multi-client access to the same server
can result in randomizing the I/O patterns at the server side.
Access patterns:
Server side - possibly sequential, but likely to
be random
Client side - almost always sequential, possibly
multiple file simultaneously accessed sequentially
As a result, to simulate the load that would be generated on our servers, we should be using a series of clients that perform sequential disk I/O, giving us the same sequential at client, random at server behavior.
We actually examined several benchmarks: iozone, bonnie and postmark
IOZone http://www.iozone.org - this is a very thorough file system benchmark that examines performance across a large range of parameters (file size, block size, # of threads, etc...). The implementation seems to also be more efficient than Bonnie, because for comparable performance levels, there is more CPU idle time using iozone. This is a fairly general tool that can be used to get far more information than you really wanted to know about IO performance.
Bonnie http://www.textuality.com/bonnie/ - this is a fairly easy to use benchmark that measures performance for sequential and random IO, at the block and character level. The results it gives are generally similar to iozone for throughput. However, as mentioned above, IOZone can measure more things, and seems to run more efficiently.
PostMark http://www.netapp.com/tech_library/3022.html - this is a benchmark that specifically examines lots of small random I/Os across a large number of files. It is meant to simulate the disk access patterns of NNTP servers. These access patterns don't really correspond to our HENP applications, but may be relevant to performance for compiling code.
IOZone and Bonnie are the closest match for our applications.
After some consideration, we use IOZone for the benchmark - even though
it is somewhet more complicated to use, it is more efficient, making it
more of a pure filesystem benchmark and less of a CPU benchmark.
We do present some Bonnie benchmark results (in
the Jumbo Frame benchmarks) - so long as we are doing Bonnie vs. Bonnie
benchmarks, they are useful for seeing relative performance across different
configurations.
The main thing we want to understand is throughput
over NFS, for different numbers of clients, over different OS and NFS versions.
We are also interested in the performance difference between jumbo frame
versus normal frame sizes on ethernet. It is important that we test on
a large enough file size to defeat caching. Note that on many of our applications,
caching doesn't come into play because the data is only read or written
once.
To test the first aspect, we set aside a group of
11 servers on a relatively quiescent switch. Each of these machines is
running only the benchmark, and whatever normal services/daemons would
find on a compute server or an NFS server.
We have a scripted test harness that copies the
benchmarking code and any configuration files over to all the nodes being
tested. The following then occurs:
1: the NFS server unmounts the partition being used
for the testing, runs mkfs to create a fresh ext2 filesystem, remounts
the partition and then exports it over NFS
2: all the clients mount the exported partition
using whatever parameters are set (NFS version, etc...)
3: clocks are synchronized across all the nodes
4: the actual test scripts are scheduled to be run
at a certain time with the "at" command
5: the master node waits for all the client nodes
to signal that they are done, and then copies the results back to the master
node
6: all the benchmark files on the client nodes are
deleted, and the shared partition is unmounted
During the benchmark, client sar information is captured
to a file for later analysis
The benchmark consists of running IOZone in Write
and then Read tests, for a 2 Gigabyte file (larger than available cache
memory on all clients and servers), and then on a 1 Gigabyte file (will
fit in client cache).
Since we are testing with such relatively large
files, I throw in a seperate test for different file sizes. This isn't
really a test of the distributed filesystem, but it is useful to know in
order to have a context for interpreting our results for 1GB and 2GB files.
Currently the tests are run with 1 client, then 2 clients, then 5 clients and finally 10 clients. Eventually, we want to run with as many clients as it takes to kill off the server, but that is something we haven't had a chance to schedule yet.
Ideally, we would want to run the very same tests with jumbo frame ethernet, but jumbo frame and normal frame ethernet don't coexist well, so for the time being we have only done some single client jumbo tests.
We would also ideally run the tests multiple times
and average the results, however in the testing, we found a surprising
level of consistency in performance. The major variation comes in when
caching effect kick in. For the time being, use only 1 run at each configuration.
This is flawed, but looking at the results, I think you will understand
why I don't think it is a large issue.
NFS clients:
Dual 1Ghz PIII with 2GB RAM
local IDE drives using software striping
SysKonnect copper GigE NIC
Redhat 6.2 (with custom 2.2.19 kernel)
rsize and wsize of 8192 on NFS mount
NFS rpms for RH6.2:
nfs-utils-0.3.1-0.6.x.1
mount-2.10r-0.6.x
Switch:
Extreme Networks Summit 7i, 28+4 port gigE "wire
speed" switch
--Output-- --Input---
--Block--- --Block---
Machine MB K/sec
K/sec
local 100 244216
150460
local 1000 72899
81738
local 2000 62885
78460
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
100 14414 100.2 220353 99.0 92188 99.9 14258 100.1 471726 96.7 21667.2
200.4
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
500 13967 100.0 85762 96.7 23259 54.5 13782 97.4 82939 66.7 740.4
8.5
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1000 13828 99.9 67756 96.8 22829 51.6 14069 97.5 80783 62.5 358.7
3.4
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
2000 13775 99.9 58966 94.2 22918 51.3 14323 97.2 78885 57.5 261.1
2.7
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
10 5546 92.1 26251 35.9 23186 67.9 7071 100.1 177147 103.8
6343.7 77.7
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
100 5537 93.1 26336 76.1 23495 78.7 7041 100.0 177193 100.4
6298.3 89.8
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
250 5528 95.2 22608 49.9 9369 34.9 5603 96.7 15379 57.9
2375.9 83.2
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
500 5517 94.7 17783 33.3 6292 20.0 4889 88.6 16175 64.8
648.6 23.4
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1000 5467 93.5 15677 23.2 5733 16.5 4995 90.4 15798 66.0
329.7 10.6
Analysis:
This benchmark was run on a different machines than
the main benchmarks - only 256MB of RAM on the client. You can see that
the block read performance falls to realistic levels once the filesize
exceeds the available client cache.
You can also see that the write performance starts
to decline as the filesize gets larger - block writing a 1GB file is roughly
40% slower than writing a 100MB file. However, performance does not decline
much for 10MB - 250MB files. This seems to indicate that there is a caching
effect for writes as well.
This is not unreasonable, because writes under NFSv2
are asynchronous by default, if all reads and writes go through the local
disk buffers, then the drop in performance past 250MB makes some sense.
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
100 5861 96.9 35267 73.7 30658 74.3 7033 99.9 177614 98.9 7089.3
95.7
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
500 5808 95.3 29198 47.3 7763 17.2 5489 83.2 26826 46.0
721.0 19.6
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1000 5801 95.5 18802 29.1 7100 15.0 5741 87.5 28696 52.8
354.3 11.3
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
2000 5824 95.6 18665 27.1 7179 15.5 5711 87.2 27064 53.7
269.8 8.4
Comparing the performance for similar file sizes:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
MTU-1500 100 5537 93.1 26336 76.1 23495
78.7 7041 100.0 177193 100.4 6298.3 89.8
MTU-9000 100 5861 96.9 35267 73.7 30658
74.3 7033 99.9 177614 98.9 7089.3 95.7
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
MTU-1500 500 5517 94.7 17783 33.3
6292 20.0 4889 88.6 16175 64.8 648.6 23.4
MTU-9000 500 5808 95.3 29198 47.3
7763 17.2 5489 83.2 26826 46.0 721.0 19.6
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
MTU-1500 1000 5467 93.5 15677 23.2 5733
16.5 4995 90.4 15798 66.0 329.7 10.6
MTU-9000 1000 5801 95.5 18802 29.1 7100
15.0 5741 87.5 28696 52.8 354.3 11.3
Analysis:
In terms of read performance, so long as the file
resides entirely within the client cache, jumbo frame makes no difference.
However for block writes we see up to a ~60% boost in performance (at 500MB)
and for reads we see up to an 80% boost (sequential reads at 1GB). Notice
that on reads, jumbo frame increases throughput and also lowers CPU utilization.
Jumbo frames provide a significant performance boost,
and in most situations, lowers relatively CPU utilization (sometimes it
lowers absolute utilization as well).
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
100 3498 61.4 7483 23.8 6233 23.0 7015 100.0 178076
99.1 911.4 13.0
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
500 2597 45.5 4091 12.8 1291 5.7 5944 94.3
25866 56.2 216.5 6.7
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1000 2071 36.1 2846 8.3 882 3.6
5950 94.0 26356 53.5 106.8 2.5
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
2000 1663 28.9 2104 5.9 513 1.9
5928 94.1 26081 55.0 81.8 2.3
Comparing similar runs:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
NFSv2 100 5861 96.9
35267 73.7 30658 74.3 7033 99.9 177614 98.9 7089.3 95.7
NFSv3 100 3498 61.4
7483 23.8 6233 23.0 7015 100.0 178076 99.1 911.4 13.0
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
NFSv2 500 5808 95.3
29198 47.3 7763 17.2 5489 83.2 26826 46.0 721.0 19.6
NFSv3 500 2597 45.5
4091 12.8 1291 5.7 5944 94.3 25866 56.2 216.5 6.7
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
NFSv2 1000 5801 95.5 18802
29.1 7100 15.0 5741 87.5 28696 52.8 354.3 11.3
NFSv3 1000 2071 36.1
2846 8.3 882 3.6 5950 94.0 26356 53.5 106.8
2.5
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
NFSv2 2000 5824 95.6 18665
27.1 7179 15.5 5711 87.2 27064 53.7 269.8 8.4
NFSv3 2000 1663 28.9
2104 5.9 513 1.9 5928 94.1 26081 55.0
81.8 2.3
Analysis:
NFSv3 on the 2.2 kernel seems to offer roughly equivalent
performance for reads, but markedly slower performance on writes. By forcing
synchronous writes under NFSv3, we lose, on the average, about 85% of our
throughput on block writes.
Friends don't let friends run NFSv3 on 2.2 kernels.
--Output-- --Input---
--Block--- --Block---
Machine MB K/sec
K/sec
8 nfsd 2000 20553
19542
16 nfsd 2000 20781
20309
8 nfsd 1000 21964
538896
16 nfsd 1000 21027 536607
8 nfsd 100 37796
1032889
16 nfsd 100 38283 1032815
Analysis:
For all file sizes, the number of nfs daemons has
basically no effect on read or write. At this point, we start to see the
ridiculous read throughputs that are possible when the test file is entirely
in memory. When IOZone does read tests on a file that resides entirely
within the disk buffer cache, the throughput number is consistently about
500MB/sec.
Apparently adding more nfs daemon processes has
no effect on performance.
This experiment also shows that for small files that
can be fully cached on the client, the IOZone benchmarks are essentially
a test of how fast the application can read data out of the disk cache.
We notice that IOZone is capable of reading a lot more throughput than
Bonnie - Bonnie can read at 180MB/sec from the buffers, while IOZone can
read over 500MB/sec.
This is a basically a head to head comparison
of the 2.2 kernel versus the RH 7.1 2.4 kernel for NFSv2 file sharing.
This comparison illustrates the main differences between the 2.2 and 2.4
kernels as NFS servers. The clients are still running the 2.2 kernels.
--Output-- --Input---
--Block--- --Block---
Machine MB K/sec
K/sec
2.2 2000 20402
24799
2.4 2000 24723
21150
2.2 1000 21011
517902
2.4 1000 25611
19726
2.2 100 27129
517731
2.4 100 35037
521637
Analysis:
For write performance, the 2.4.3 kernel seems to
provide a roughly 20% performance boost at the server side. The read performance
is fairly close, however the 500MB+ throughput is an artifact of the caches,
and those numbers should be thrown out. The 2.2 kernel may have a slight
performance edge when it comes to reads.
--Output-- --Input---
--Block--- --Block---
Machine MB K/sec
K/sec
1 client 2000 20402
24799
2 clients 2000 20553
19542
5 clients 2000 20635
12183
10 clients 2000 20376 10475
1 client 1000 21011
silly
2 clients 1000 21694
silly (21674)
5 clients 1000 31893
silly (14432)
10 clients 1000 17672
silly (14025)
1 client 100 27129
silly
2 clients 100 37796
silly
5 clients 100 63928
silly
10 clients 100 43230
silly
Analysis:
The benchmark does very little synchronization between
clients, and runs the tests for different file sizes one right after the
other. The larger files get done first, with the smaller ones following.
As a consequence, the performance numbers for the smaller files is dubious,
because clients will have started to fall outof sync with each other for
the load they generate. On the next iteration through this test, we will
increase the number of synchronization points.
However, for the 2GB file, we are fairly certain
that the numbers are representative.
It is clear that write performance is basically
constant, irrespective of the number of clients. It is also clear that
read performance begins to deteriorate as the number of clients goes up.
This may be a sign that on writes, because it is
asynchronous, the OS is capable of doing write coalescing, whereas on reads,
there is nothing to mitigate the randomization from multiple client reads.
We don't bother to do this benchmark because
NFSv3 performance is so poor under the 2.2 kernel
[to be completed]
[to be completed]
- NFSv3 should be avoided on 2.2 kernels
- Jumbo frames should be used when possible, however
they should be used on
dedicated jumbo frame networks
to avoid frag/defrag overhead.
You can see a 60% to 80%
boost in performance with Jumbo.
- 2.2 NFSv2 write performance does not scale up.
However, it does not degrade under load
- 2.2 NFSv2 read performance degrades significantly
as the number of clients goes up
- Linux disk buffer caching gives you free NFS read
caching, great if you can stay in cache
- As an NFSv2 server, the 2.4.3 kernel is roughly
20% faster on writes - read
performance is unclear
- Increasing the number of nfsd processes seems
to have no effect on throughput
- Write performance decreases as file sizes increase