The HSM benchmark requires us to do the following:

- dmget 252 large files of 5.3GB from Titanium tape, one file per tape
- dmget 648 small files of 450MB from Disk Cache Manager
- on two CXFS clients of the four DMF-managed filesystems, take half of
  each of the above files and copy them to a scratch filesystem on that
  node (note: for this we use cxfscp, which is direct access I/O)
- do all of this in 20 minutes

Note also: 
1. The files are divided into six roughly equal groups.  One group is in
   /arch1, one is /arch2, and one each in the top 16 and bottom 16
   allocation groups in /arch3 and /arch4.  No more detailed placement
   is attempted, based on the argument that given a stripe width of
   6MB, every file is pretty well smeared out over all the LUNs of the
   stripe.  For arch1 and arch2, this is ALL LUNs; for arch3 and arch4,
   the concat-followed-by-stripe layout means the only distinction to
   worry about is between the first and second half of the filesystem.
 
2. Twice as many files are placed on /arch3 and /arch4 as on /arch1 and
   /arch2; this is due to testing which suggested /arch3-4 are twice
   as fast as /arch1-2.  Some of the HSM benchmark results indicate this
   may not be the case.  Note that /arch1-2 are in production, with
   about 15 million files each; /arch3-4 have not yet been deployed for
   user files.

3. Each MDS (ac1 and ac2) has 96 600 MHz processors and 96GB of memory.
   Each has 2 I-Bricks, 10 P-Bricks and one PX-Brick.  Each P-Brick and
   PX-Brick is dual-NuMA link attached to two of the 24 C-Bricks.

4. T10K tape drives are direct attached to 2Gbit HBAs loosely packed in
   rack 3 P-Bricks.

5. The DMF process which does the dmget, dmatrc, has three threads.  The
   first fills buffers from tape using direct access I/O with a 2MB I/O
   size.  As a buffer fills, it is handed off to the "checksum" thread,
   which verifies the checksums in that buffer and reformats the data to
   omit the checksums, writing the result to another memory buffer.  When
   that operation is complete for the target buffer, it is passed off to
   the third thread, which copies the buffer to disk using buffered I/O.
   At GFDL DMF is configured to use a 2MB buffer size for this I/O.  DMF
   always has at least two tape input buffers and two disk output buffers
   to alternate between.

6. The large file gets are broken up into 21 streams; each stream must do 12
   dmgets, so to complete in 1200 seconds, each iteration of this process
   would have to take no more than 100 seconds, of which about 60 seconds of
   the "test budget time" has been allocated to tape mount/dismount, including
   robot time, and about 40 seconds has been targetted for the data transfer.
   This would imply a rate of about 125MB/sec for the transfer; we see at most
   about 100MB/sec.

To see how this plays out on the hardware, you should have the following,
which may be found on ftp.gfdl.noaa.gov in /perm/bhs/GFDLconfig.

- GFDL.acio.061107.ps	AC1 & AC2 I/O Configuration (PS document)
- GFDL.2Gb.SAN.config	DMF-Managed Filesystem Layout at GFDL
- GFDL.hsmtest.text	This document, describing the HSM test

Given those documents, we make the following observations/claims:

1. Assuming a maximum tape bandwidth of 100MB/sec (about what we see when only
   a few tapes are moving), the 21 T10K drives impose a maximum load of
   600MB/sec on any P-Brick in Rack-3, and 300MB/sec per NuMA attach.  Even
   at 120MB/sec per T10K (nominal, but more than we see), the maximum would be
   720MB/sec per P-Brick, 360MB/sec per NuMA attach.  The total data movement
   is 5.3GB x 252 = 1336GB / 1200 seconds = 1.11GB/sec (average).

2. The DCM traffic comes through the four dual-port 2Gb HBAs in the PX-Brick;
   the total data moved for all small files is about 300GB, distributed as
   evenly as we can over the 1200 seconds, it adds about .25GB/sec of I/O
   bandwidth (average).  Thus the total average I/O transfer rate for the
   dmgets is about 1.36GB/sec.  The total load on the four file systems is
   about 2.72GB/sec, including the copy on the CXFS client nodes.  The HBAs
   and P-Bricks servicing the DCM, parenthetically, should not be taxed.

3. I/O operations to/from /arch1-2 should be fairly evenly spread over all
   four HBAs, 101p24, 102p28, 101p20, 102p24.  Topdisk pretty well confirms
   that.  Similarly, I/O to the first half of /arch3 or /arch4 should be
   evenly distributed over 101p20 and 102p24 and I/O to the second half of
   the address space for those two filesystems should be spread out over the
   HBAs in 101p24 and 102p28.  Thus whenever we have I/O to one of the six
   target areas, we assume it divides evenly among the above named P-Bricks.
   Assuming we have correctly divided the files on /arch3 and /arch4 between
   the two filesystem halves, we should be able to divide the total given
   above by 16, giving an I/O rate per HBA of about 85MB/sec (680Mb/sec),
   The rate per P-Brick is about 340MB/sec; per NuMA it's about 170MB/sec.
   If we take all tapes cranking at full rate as a reasonable number to use
   for computing maximum transfers to disk (rough guess at best), which would
   be 2.1GB/sec + .25GB/sec total, the numbers would be 147MB (1.18Gb) per sec
   per HBA, 588MB/sec per P-Brick and 294MB/sec per NuMA.

4. Total maximum inter-rack I/O traffic seems well within advertised limits.

5. Modelling the flows between 1+2 and 3 is much harder, and we're not making
   any claims in that department.