GCC SL8500 #2 Tape
Library Commissioning Report
SSA Group
1/31/2008
Introduction
This
document describes steps taken to commission a new automated tape library, a
10,000 slot Storage Tek (STK) SL8500 #2, and
relatively new tape technology – LTO-4. The tests were performed with up to 17
STK/IBM LTO-4 tape drives and 100 Fujifilm LTO-4 tape cartridges. The model of
operation is based on the document DRAFT SL8500 operation model, which also has
a description of the tape library and how it operates. Following this model,
the drives were uniformly distributed across all four rails of the library, and
the library was configured to float the cartridges.
Limitations
1.
Enstore was run unmodified so that the tape drive selected for a volume mount
request does not take the location of the drive and tape into account. The
implication is that the tape and drive may be located on different rails, so
the tape needs to be passed up an elevator to the rail that the drive is
located on, which involves arms on two rails participating in moving the
cartridge to the drive. In float mode, when dismounted, the tape gets placed in
a slot on the same rail as the drive. This is not optimal performance, but
modifying enstore to preferentially select a free drive on the same rail as the
cartridge is thought to require a major enstore code revision.
2.
Though100 cartridges were scattered about the length of the library, yielding
fairly representative average response times, because of the manner that the
library returns cartridges; only a small portion of the slot space was
accessed. The library software does not have functionality to move a cartridge
from one location to another, preventing us from being able to arrange for full
slot coverage.
Test Goals and Description
The
goal was to simulate on the order of one month’s usage, which we decided would
entail writing 100 Terabytes and reading back 300 TB. We also wanted to perform
accelerated mount/dismount tests. We also wanted to read 100 Terabytes of LTO-3
pre-written media as a proof of backwards compatibility for reading. The tests
executed were:
1.
Read about 100 Terabytes of pre-written LTO-3 media. These were written during
the LTO-3 commissioning.
2.
Write about 70 Terabytes to LTO-4 tape with up to 17 movers and 10 worker
nodes. This test used a file family width of 10 and 1 to encourage mounts and
dismounts. Read-after-every-write mode was ser on each of the movers that each
file was immediately read back and the CRC verified.
3.
Write about 30 Terabytes to LTO-4 tape using up to 17 movers and 10 worker
nodes with the torture script. Each tape had its own file family and each
worker node rotated through the file families, guaranteeing a mount/dismount
for each file. Read-after-every-write mode was set on each of the movers so
that each file was immediately read back and the CRC verified. After the writes
a random read was made of the same tape to cause tape repositioning for a read
while still mounted.
4.
One tape is having the first 10 files on it, in a random ordering, read over
and over again. We are tracking the number of passes.
The
first figure below shows the total bytes transferred/day and the second only
the bytes written/day
Results
Hardware &
Media Durability
Plot
1 shows the mount count distribution for all of the media. The mean is around
900 for 100 tapes so about 90,000 mounts/dismount cycles were performed. This
corresponds to about 6000 mount/dismount cycles for each drive.
There
were no tape cartridge failures.
There
was one drive failure. Sun said the electronics failed. We feel this was just
an early life failure of the drive.
In
the test one robot arm required replacement. There have been no arm problems
after that incident.
One
of the robot arms arm would squeal along a small segment of its track. This
similar to the commissioning of the first SL8500 and is addressed by a special
alignment of the rails. Sun continues to work on improving the rail setup and
robot wheels to address squealing of the robot wheels.
Tape
TST03 is undergoing a read torture test. This test performs a read of the first
10 (ave 4 GB) files on the tape, in a randomly
selected order, repeatedly. The tape is kept mounted in the drive. At the time
of this writing, the tape has had 1000 passes (10,000 reads) with no errors.
Data loss and
media/drive issues
There
were no data loss issues during the testing. As noted earlier, there was one
drive that failed. It broke on a unload.
One
other less significant read incident occurred:
A
read failed because of a “lost position” error in Enstore. The tape and mover
were investigated and both returned to service. The tape read okay and the
drive continued to work properly. A further investigation shows that we had
these same messages during the LTO-3 commissioning. These messages only appear
during the torture read/write test. We feel that these messages can be forced
by the torture test and indicate that we are approaching the mechanical
operational limit of the tape drive. We do not see these messages during
standard Enstore operation of the tape complex.
Attach
a graph here.----------------------------
Performance
Mount/dismount
Mount
latency for the Sl8500 and the 9310 complex (two 9310s connected via pass-thru)
are shown below. The latency is the same order of magnitude for both. Banding
is visible in both. The lower band for the SL8500 is for mounts where the
cartridge is in the same rail as the tape. The upper band is due to the
increased latency introduced by the elevator when a cartridge has to be moved
to a different rail to mount the tape in a drive on that rail.
The
upper band in the 9310 complex results from moving a cartridge through the
pass-thru port when the cartridge and tape drives are in different silos.
When
SL8500s are complexes with pass-thru ports, cartridges that need to be mounted
in a drive in another ACS (I.e. SL8500) will likely suffer the additional
overhead of two elevator moves and a pass-thru move, which may significantly
degrade the mount latency.
Mount
Latency SL8500
Mount
Latency stken 9310 complex
No
attempt was made to stress the mount rate, but as the figure below indicates,
we exercised the library up to a peak of about ~ 2750 exchanges/day.
Transfer rates
Back-to-back
local transfers of 5 2GB files in a row completed in 3 minutes. That's an
aggregate rate of 55MB/s. This is below the maximum 80 MB/s streaming rate of
the drive, but is consistent at 80 MB/s when the per-file overhead (~ 8
seconds/file) is taken into account.
The
commissioning was performed from un-optimized worker nodes across two network
hubs. The plots below show the performance during the commissioning. These
results should be considered in the light that no attempt was made to optimize
the performance (e.g. disk striping the worker nodes) except using a large
average file size of 4GB.
All
rates are in MB/s. Not that the blue reads are plotted over and obscure the red
writes such that writes are obscured when they are at the same rate as reads in
the dame time period.
Drive
rate
Network
rate
Overall
Rate
Tape Library Room Readiness
The
following items are considered of high importance for the tape library room
readiness:
1.
Maintaining and monitoring a tolerable temperature and relative humidity during
summer and winter
2.
Sealing against insect infiltration
3.
Fire suppression and detection for the room
4.
Fire suppression and detection for the library
5.
Alternate/generator power source for the library and climate control
All
these have been addressed in the first SL8500 commissioning in GCC.
Appendix –
Commissioning Hardware and software specifications
Worker
nodes were CDF farm nodes and Enstore migration servers. encp requests from these nodes passed through the FCC
hub router and the GCC tape library room router (see the network design note). encp version 3.6 and 3.7 was used.
enstore server and mover
nodes (gccensrv1, gccensrv2, gccenmvr40-56) were all on the network local to
the GCC tape library room.
SL8500:
-
10,000 slots
-
Redundant AC power and drive power (front touch panel console is not redundant,
but not necessary to operate)
-
17 IBM LTO-4 drives which stream at ether 60MB/s or 120MB/s. 128MB internal
buffer. Open mouthed.
-
ACSLS Sun V240: Control LAN name/IP fntt-gcc/192.168.89.248
-
Enstore
(temporary gccen instance):
-
gccensrv1: SLF 4.4, Supermicro Dual Xeon 3.6GHz, 2MB
L2 cache, 4GB RAM., HT on, 1G NIC, Accounting, alarm, drivestat,
event relay, file clerk, volume clerk, info server, inquisitor, log server, ratekeeper
-
gccensrv2: SLF 4.4, Supermicro Dual Xeon 3.6GHz, 2MB
L2 cache, 4GB RAM., HT on, 1G NIC media changer, library manager Mover plant:
-
17 movers (17 active): SLF 4.4 , Supermicro Dual Xeon
2.8GHz @MB L2 cache, HT off, 1G NIC, 2G FC