The HSM benchmark requires us to do the following: - dmget 252 large files of 5.3GB from Titanium tape, one file per tape - dmget 648 small files of 450MB from Disk Cache Manager - on two CXFS clients of the four DMF-managed filesystems, take half of each of the above files and copy them to a scratch filesystem on that node (note: for this we use cxfscp, which is direct access I/O) - do all of this in 20 minutes Note also: 1. The files are divided into six roughly equal groups. One group is in /arch1, one is /arch2, and one each in the top 16 and bottom 16 allocation groups in /arch3 and /arch4. No more detailed placement is attempted, based on the argument that given a stripe width of 6MB, every file is pretty well smeared out over all the LUNs of the stripe. For arch1 and arch2, this is ALL LUNs; for arch3 and arch4, the concat-followed-by-stripe layout means the only distinction to worry about is between the first and second half of the filesystem. 2. Twice as many files are placed on /arch3 and /arch4 as on /arch1 and /arch2; this is due to testing which suggested /arch3-4 are twice as fast as /arch1-2. Some of the HSM benchmark results indicate this may not be the case. Note that /arch1-2 are in production, with about 15 million files each; /arch3-4 have not yet been deployed for user files. 3. Each MDS (ac1 and ac2) has 96 600 MHz processors and 96GB of memory. Each has 2 I-Bricks, 10 P-Bricks and one PX-Brick. Each P-Brick and PX-Brick is dual-NuMA link attached to two of the 24 C-Bricks. 4. T10K tape drives are direct attached to 2Gbit HBAs loosely packed in rack 3 P-Bricks. 5. The DMF process which does the dmget, dmatrc, has three threads. The first fills buffers from tape using direct access I/O with a 2MB I/O size. As a buffer fills, it is handed off to the "checksum" thread, which verifies the checksums in that buffer and reformats the data to omit the checksums, writing the result to another memory buffer. When that operation is complete for the target buffer, it is passed off to the third thread, which copies the buffer to disk using buffered I/O. At GFDL DMF is configured to use a 2MB buffer size for this I/O. DMF always has at least two tape input buffers and two disk output buffers to alternate between. 6. The large file gets are broken up into 21 streams; each stream must do 12 dmgets, so to complete in 1200 seconds, each iteration of this process would have to take no more than 100 seconds, of which about 60 seconds of the "test budget time" has been allocated to tape mount/dismount, including robot time, and about 40 seconds has been targetted for the data transfer. This would imply a rate of about 125MB/sec for the transfer; we see at most about 100MB/sec. To see how this plays out on the hardware, you should have the following, which may be found on ftp.gfdl.noaa.gov in /perm/bhs/GFDLconfig. - GFDL.acio.061107.ps AC1 & AC2 I/O Configuration (PS document) - GFDL.2Gb.SAN.config DMF-Managed Filesystem Layout at GFDL - GFDL.hsmtest.text This document, describing the HSM test Given those documents, we make the following observations/claims: 1. Assuming a maximum tape bandwidth of 100MB/sec (about what we see when only a few tapes are moving), the 21 T10K drives impose a maximum load of 600MB/sec on any P-Brick in Rack-3, and 300MB/sec per NuMA attach. Even at 120MB/sec per T10K (nominal, but more than we see), the maximum would be 720MB/sec per P-Brick, 360MB/sec per NuMA attach. The total data movement is 5.3GB x 252 = 1336GB / 1200 seconds = 1.11GB/sec (average). 2. The DCM traffic comes through the four dual-port 2Gb HBAs in the PX-Brick; the total data moved for all small files is about 300GB, distributed as evenly as we can over the 1200 seconds, it adds about .25GB/sec of I/O bandwidth (average). Thus the total average I/O transfer rate for the dmgets is about 1.36GB/sec. The total load on the four file systems is about 2.72GB/sec, including the copy on the CXFS client nodes. The HBAs and P-Bricks servicing the DCM, parenthetically, should not be taxed. 3. I/O operations to/from /arch1-2 should be fairly evenly spread over all four HBAs, 101p24, 102p28, 101p20, 102p24. Topdisk pretty well confirms that. Similarly, I/O to the first half of /arch3 or /arch4 should be evenly distributed over 101p20 and 102p24 and I/O to the second half of the address space for those two filesystems should be spread out over the HBAs in 101p24 and 102p28. Thus whenever we have I/O to one of the six target areas, we assume it divides evenly among the above named P-Bricks. Assuming we have correctly divided the files on /arch3 and /arch4 between the two filesystem halves, we should be able to divide the total given above by 16, giving an I/O rate per HBA of about 85MB/sec (680Mb/sec), The rate per P-Brick is about 340MB/sec; per NuMA it's about 170MB/sec. If we take all tapes cranking at full rate as a reasonable number to use for computing maximum transfers to disk (rough guess at best), which would be 2.1GB/sec + .25GB/sec total, the numbers would be 147MB (1.18Gb) per sec per HBA, 588MB/sec per P-Brick and 294MB/sec per NuMA. 4. Total maximum inter-rack I/O traffic seems well within advertised limits. 5. Modelling the flows between 1+2 and 3 is much harder, and we're not making any claims in that department.