GCC SL8500 #2 Tape Library Commissioning Report

SSA Group

1/31/2008

 

Introduction

 

This document describes steps taken to commission a new automated tape library, a 10,000 slot Storage Tek (STK) SL8500 #2, and relatively new tape technology – LTO-4. The tests were performed with up to 17 STK/IBM LTO-4 tape drives and 100 Fujifilm LTO-4 tape cartridges. The model of operation is based on the document DRAFT SL8500 operation model, which also has a description of the tape library and how it operates. Following this model, the drives were uniformly distributed across all four rails of the library, and the library was configured to float the cartridges.

 

Limitations

1.      Enstore was run unmodified so that the tape drive selected for a volume mount request does not take the location of the drive and tape into account. The implication is that the tape and drive may be located on different rails, so the tape needs to be passed up an elevator to the rail that the drive is located on, which involves arms on two rails participating in moving the cartridge to the drive. In float mode, when dismounted, the tape gets placed in a slot on the same rail as the drive. This is not optimal performance, but modifying enstore to preferentially select a free drive on the same rail as the cartridge is thought to require a major enstore code revision.

2.      Though100 cartridges were scattered about the length of the library, yielding fairly representative average response times, because of the manner that the library returns cartridges; only a small portion of the slot space was accessed. The library software does not have functionality to move a cartridge from one location to another, preventing us from being able to arrange for full slot coverage.

 

Test Goals and Description

 

The goal was to simulate on the order of one month’s usage, which we decided would entail writing 100 Terabytes and reading back 300 TB. We also wanted to perform accelerated mount/dismount tests. We also wanted to read 100 Terabytes of LTO-3 pre-written media as a proof of backwards compatibility for reading. The tests executed were:

 

1.      Read about 100 Terabytes of pre-written LTO-3 media. These were written during the LTO-3 commissioning.

2.      Write about 70 Terabytes to LTO-4 tape with up to 17 movers and 10 worker nodes. This test used a file family width of 10 and 1 to encourage mounts and dismounts. Read-after-every-write mode was ser on each of the movers that each file was immediately read back and the CRC verified.

3.      Write about 30 Terabytes to LTO-4 tape using up to 17 movers and 10 worker nodes with the torture script. Each tape had its own file family and each worker node rotated through the file families, guaranteeing a mount/dismount for each file. Read-after-every-write mode was set on each of the movers so that each file was immediately read back and the CRC verified. After the writes a random read was made of the same tape to cause tape repositioning for a read while still mounted.

4.   One tape is having the first 10 files on it, in a random ordering, read over and over again. We are tracking the number of passes.

 

 

The first figure below shows the total bytes transferred/day and the second only the bytes written/day

 

 

 

 

Results

 

Hardware & Media Durability

 

Plot 1 shows the mount count distribution for all of the media. The mean is around 900 for 100 tapes so about 90,000 mounts/dismount cycles were performed. This corresponds to about 6000 mount/dismount cycles for each drive.

 

 

 

There were no tape cartridge failures.

There was one drive failure. Sun said the electronics failed. We feel this was just an early life failure of the drive.

 

In the test one robot arm required replacement. There have been no arm problems after that incident.

 

One of the robot arms arm would squeal along a small segment of its track. This similar to the commissioning of the first SL8500 and is addressed by a special alignment of the rails. Sun continues to work on improving the rail setup and robot wheels to address squealing of the robot wheels.

 

Tape TST03 is undergoing a read torture test. This test performs a read of the first 10 (ave 4 GB) files on the tape, in a randomly selected order, repeatedly. The tape is kept mounted in the drive. At the time of this writing, the tape has had 1000 passes (10,000 reads) with no errors.

 

Data loss and media/drive issues

There were no data loss issues during the testing. As noted earlier, there was one drive that failed. It broke on a unload.

 

One other less significant read incident occurred:

 

A read failed because of a “lost position” error in Enstore. The tape and mover were investigated and both returned to service. The tape read okay and the drive continued to work properly. A further investigation shows that we had these same messages during the LTO-3 commissioning. These messages only appear during the torture read/write test. We feel that these messages can be forced by the torture test and indicate that we are approaching the mechanical operational limit of the tape drive. We do not see these messages during standard Enstore operation of the tape complex.

 

Attach a graph here.----------------------------

 

Performance

 

Mount/dismount

 

Mount latency for the Sl8500 and the 9310 complex (two 9310s connected via pass-thru) are shown below. The latency is the same order of magnitude for both. Banding is visible in both. The lower band for the SL8500 is for mounts where the cartridge is in the same rail as the tape. The upper band is due to the increased latency introduced by the elevator when a cartridge has to be moved to a different rail to mount the tape in a drive on that rail.

 

The upper band in the 9310 complex results from moving a cartridge through the pass-thru port when the cartridge and tape drives are in different silos.

 

When SL8500s are complexes with pass-thru ports, cartridges that need to be mounted in a drive in another ACS (I.e. SL8500) will likely suffer the additional overhead of two elevator moves and a pass-thru move, which may significantly degrade the mount latency.

 

 

 

Mount Latency SL8500

 

 

 

Mount Latency stken 9310 complex

 

 

 

No attempt was made to stress the mount rate, but as the figure below indicates, we exercised the library up to a peak of about ~ 2750 exchanges/day.

 

 

 

 

 

Transfer rates

 

Back-to-back local transfers of 5 2GB files in a row completed in 3 minutes. That's an aggregate rate of 55MB/s. This is below the maximum 80 MB/s streaming rate of the drive, but is consistent at 80 MB/s when the per-file overhead (~ 8 seconds/file) is taken into account.

 

The commissioning was performed from un-optimized worker nodes across two network hubs. The plots below show the performance during the commissioning. These results should be considered in the light that no attempt was made to optimize the performance (e.g. disk striping the worker nodes) except using a large average file size of 4GB.

 

All rates are in MB/s. Not that the blue reads are plotted over and obscure the red writes such that writes are obscured when they are at the same rate as reads in the dame time period.

 

 

Drive rate

 

Network rate

 

Overall Rate

 

Tape Library Room Readiness

 

The following items are considered of high importance for the tape library room readiness:

 

1.      Maintaining and monitoring a tolerable temperature and relative humidity during summer and winter

2.      Sealing against insect infiltration

3.      Fire suppression and detection for the room

4.      Fire suppression and detection for the library

5.      Alternate/generator power source for the library and climate control

 

All these have been addressed in the first SL8500 commissioning in GCC.

 

Appendix – Commissioning Hardware and software specifications

 

Worker nodes were CDF farm nodes and Enstore migration servers. encp requests from these nodes passed through the FCC hub router and the GCC tape library room router (see the network design note). encp version 3.6 and 3.7 was used.

 

enstore server and mover nodes (gccensrv1, gccensrv2, gccenmvr40-56) were all on the network local to the GCC tape library room.

 

 

SL8500:

-         10,000 slots

-         Redundant AC power and drive power (front touch panel console is not redundant, but not necessary to operate)

-         17 IBM LTO-4 drives which stream at ether 60MB/s or 120MB/s. 128MB internal buffer. Open mouthed.

-         ACSLS Sun V240: Control LAN name/IP fntt-gcc/192.168.89.248

-          

Enstore (temporary gccen instance):

-         gccensrv1: SLF 4.4, Supermicro Dual Xeon 3.6GHz, 2MB L2 cache, 4GB RAM., HT on, 1G NIC, Accounting, alarm, drivestat, event relay, file clerk, volume clerk, info server, inquisitor, log server, ratekeeper

-         gccensrv2: SLF 4.4, Supermicro Dual Xeon 3.6GHz, 2MB L2 cache, 4GB RAM., HT on, 1G NIC media changer, library manager Mover plant:

-         17 movers (17 active): SLF 4.4 , Supermicro Dual Xeon 2.8GHz @MB L2 cache, HT off, 1G NIC, 2G FC