CCF/IA Report for the week ending Friday, 2006-07-14
Installs:
0 installs 0 decommissions
Replaced drives:
0 9940B 0 9940A 0 LTO1/LT02 2 other
Replaced servers/movers/fileserver nodes:
1 server 0 mover 0 fileservers
0 Robot hardware maintenance service calls
0 Server/Mover/Fileserver maintenance (repairs/parts replacement)
Investigations/Interventions:
39 mover 2 server 5 tape drive 0 fileserver 39 tape 0 file 0 library
tape operations
367 tapes clobbered/recycled
792 tapes labeled/entered/removed
3 tape MIRs fixed
0 tape MIRs not fixed
0 tape drive firmware updates
3 tapes cloned
3 quota increase requests serviced
0 raid disk replacements/interventions
1 enstore service requests
0 new muon pages
2 off-hour calls/interventions
Monday
------
CDF Volume IAA453 went NOACCESS last Wednesday in mover 9940B20. The tape
was new and unlabeled. The label was written then the transfer failed
with "WRITE_ERROR Tape IAA453 at BOT, can not write TAPE volume=IAA453
location=1". The tape was checked with volume_assert which reported
READ_VOL1_READ_ERR. There was no sense data reported. The eod_cookie
was reset and the tape cleared.
CMS Stopped writing tapes when their available blanks dropped below 50.
They had 202 blanks last Wednesday but have written 159 in the last
week, far faster than their usual rate. 153 A tapes were clobbered and
converted to B tapes. 183 B tapes were recycled, and of those, 75 were
write-protected and needed their tabs flipped. That request was made
on Tuesday.
D0 Ticket 81719 was generated because an encp write failed with an error:
Filesystem is corrupt. We were all glad to learn that "This has been
fixed. The file transfer should work now."
STK Volume VOA069 went NOACCESS in mover 9940B36 when it failed to mount.
The alarm indicated that the tape was in 9940B17's drive. By the time
it was checked, the tape was home. It was cleared.
Mover 9940B22 went offline with the message "Tape thread running in
state HAVE_BOUND. Will offline the mover." There had also been an
ENCP_GONE NETWORK error. The mover was restarted.
Mover 9940B32 went offline with the message "Tape thread running in
state HAVE_BOUND. Will offline the mover." There had also been an
ENCP_GONE NETWORK error. The mover was restarted.
Volume VO6618 went NOACCESS in mover 9940B17 due to a failed MIR. The
MIR was rebuilt, and the mover and tape placed back in service.
Drive DE3EDLT has changed its approach from an occasional death grip to
failing to grab the tape at all. It was called in to ADIC.
Volume JL7528 went NOACCESS in mover DE2EDLT because the drive pushed
the tape out a little too far when ejecting and the robot bumped it
back in when it tried to put the tape away, apparently, causing the
drive to go for the death grip. The tape was manually dismounted and
put away, and the mover and tape were placed back in service.
The syscollect.sh script was missing from some systems. Whatever it is.
We continue to see intermittent problems connecting to the configuration
server. For example, we might see one instance while checking a list
of 183 tapes for write protect status. The first retry usually works.
A request for information and/or help was made to delete the old tape
LEGL01 which no longer exists. Chih answered the call.
FTT Testing has commenced for ftt v2_25.
Tuesday
-------
D0 Volume PRT465 went NOACCESS when it failed to mount in drive 9940B33
because it had failed to dismount and was still in drive 9940B27. There
was sense data indicating a host problem so the mover was rebooted. The
tape was manually dismounted and put away, and tape and movers cleared.
The ball went red late in the afternoon because testing consumed all of
the LTO-2s in the testlto2 library. For some reason this storage group
wasn't being ignored. That was fixed.
STK Primary was paged at 6:30 AM because minos's quota was exhausted.
Their quota was increased from 200 to 240.
Mover 9940B23 went offline due to a write error on the drive, an EBLANK
error on a selective CRC check on volume VOA203. The sense data
indicated a hardware problem. A service request was placed and Clarence
tested the drive but found no problem. The file in question was encped
without difficulty. The drive was put back in service.
Node stkendm2a went belly up. Its motherboard and CPUs were replaced.
It also got an increase in memory to 2 GB.
/pnfs/miniboone was exported to mbdata05 at Sam Zeller's request.
A ticket from Peter Cooper requesting some help for Selex's batch
processing needs was assigned to ISA because it mentioned "enstore".
It was reassigned to CSS.
The quota for SDSS was increased from 320 to 340.
The quota for exp-db was increased from 470 to 490.
185 tapes were write protected.
Volume VO6948 was cloned after having served over 6000 mounts (but no
hamburgers). Besides Dcache test files, it has about 200 test files
from various Dcache developers.
Mover DG4BDLT went offline due to 3 consecutive "Mount failed BAD"
errors. The system was rebooted and put back online.
Drive DE3FDLT learned a new trick from its partner and no longer grabs
tapes to load. The alarm for this problem indicated that the mover
might be stuck in a D state, but wasn't the case when examined. It
was called in to ADIC.
Mover DG4CDLT went offline because volume JL5753 spent too long in state
ACTIVE. The tape had been put away. The mover was put back online.
Volume VOA181 got stuck writing file 68 in mover 9940B27 for 50 hours
due to a memory error. The tape was dismounted and put away, and the
mover restarted. The tape was subsequently written to capacity and the
files have all been marked deleted.
An inquiry was made concerning increasing the 10-cap limit on tab
flipping requests. The option to do so has been provided for.
There was also a request for a tab-flipping "unfinish" command to take
care of cases where a bad date parameter is used. In this instance, the
incorrect dates were rectified via PSQL commands. Other bookkeeping was
performed on some old work, too.
David dealt with a number of exception cases for 13 tapes needing tab
flipping and sent a very informative email about it. In summary:
VO3724 set full to force write protection
VO4178 3 active files, 3 deleted files
VO5666 blank (eod=None) - cleared ff/ffw
VO6223 blank (eod=None) - cleared ff/ffw
VO6631 blank (eod=1) - cleared sg
VO6635 blank (eod=1) - cleared sg
VO6727 blank (eod=None) - cleared ff/ffw
VO6876 set full to force write protection
VO7026 inconsistent state, 3 active files
VO7175 inconsistent state, 3 active files, 2 unknown files
VO7428 blank (eod=None), history of noaccess
VO7600 blank (eod=1) - cleared ff/ffw
VO9491 inconsistent state, many active files
Wednesday
---------
CDF Drive 9940B11 has a stored dump. It was called in.
The ball went red when 2 of the 3 9940 drives had trouble mounting
tapes due to competition from tab flipping. Volumes IA0213 and IA1030
went NOACCESS, both Dcache test tapes. They were cleared.
29 tapes were write protected. 210 tapes were write enabled.
David asks what we should do with the 10 9940A tapes in the group of
1478 tapes recycled by CDF. 3 have VO labels, so they should come out.
The others could remain as Dcache test tapes or be converted to B tapes.
CMS Volume VO2684 went NOACCESS after spending too long in SEEK in a number
of drives. The MIR was repaired and the tape put back in service.
STK The ball went red when 3 of the DLT drives went offline while 2 were
already down. Volumes JL6648 and JL6637 failed to dismount from drives
DE2EDLT and DE2FDLT, respectively. Volume JL6636 failed to mount in
drive DG4BDLT with indications that the initial rewind after load is
failing. The tapes and drives were all put back in service.
158 tapes were write protected.
4 SDSS volumes were recycled.
For some reason, the primary's attempts to acknowledge ticket 81947 for
the STKen red ball failed and the secondary was paged about 1:40 PM.
An inquiry is being made into the matter.
Mover DG4CDLT went offline with a mover/ftt exception and was restarted.
Node stkensrv2 locked up, probably due to a kernel panic around 2:15 PM.
No information was available via the log or console. It was rebooted.
The inquisitor apparently failed to come up afterward and was started
manually. "ERROR {'status': ('TIMEDOUT', 'configuration_server')}" was
all that the log file had in it.
Volume JL8319 got stuck in SETUP in mover DG4ADLT because the mover
couldn't contact stkensrv2. Enstore was restarted.
Volume JL6636 went NOACCESS in mover DG4DDLT due to a mount failure.
The mover was restarted and the tape was cleared.
Volume JL5760 went NOACCESS in mover DE2EDLT which had the tape in a
death grip. The tape was pushed back in, ejected, and put away. The
tape and drive were put back in service.
Drive DE3EDLT was locked up with a case of blinking lights. The brick
containing it and DE3FDLT was power-cycled and the movers restarted.
Volume VO9818 was cloned. It had 2056 mounts.
A request was made of the developers to add a "stop" feature to the
migration script.
Thursday
--------
CDF File /pnfs/cdfen/NULL/RT/RT00/RT0005/RT0005.3/j_115276.6685time was
reported as possibly being corrupt. "It was fixed."
Movers 9940B18, 9940B19, and 9940B25 went offline when they couldn't
mount tapes due to competition from tab flipping. Volumes IA9895,
IA9889, and IA2951 went NOACCESS in the respective drives. All cleared.
210 tapes were write enabled.
Volume IA7062 has a broken MIR, staying in SEEK for over 1/2 hour in
mover 9940B15. Will need to be fixed.
Mover 9940B14 went offline complaining that it couldn't eject volume
IA2706. The tape was actually put away. It also reported "Supposedly
a serious problem with tape drive positioning the tape: ftt.FTTError
FTT_SUCCESS." The sense data indicated a host problem. The node was
rebooted.
D0 One cartridgeful of LTO tapes was tab flipped somewhat unintentionally.
The intent was to flip 9940 tapes, but the wrong script was run.
STK There was an inquiry from e907 about how long it takes a file to get
from Dcache to tape. Sasha and Rob answered the question.
Tape VO7428 failed to mount in drive 9940B41 because it had its leader
snapped off. The tape had never had a file or even a label written to
it. It had gone NOACCESS on 3 previous attempts to use it. We suspect
the leader has been broken all along. Clarence said it had telltale
signs of having been dropped. The drive was tested to make sure it
wasn't the culprit. A new tape was put in for VO7428.
ADIC arrived and attempted to replace the DE3EDLT/DE3FDLT brick but
one of the drives in the replacement wasn't working. We're waiting for
another brick. Maybe we can build a wall to fill the empty spaces.
While we continue working with the old DE3EDLT/DE3FDLT brick, ADIC
made a slight reduction to the bow pressure on the gripper. It doesn't
appear to have helped any. A new call was placed because tape JL7232
couldn't be dismounted from DE2EDLT in several manual attempts. The
tape was reentered through the front.
Mover 9940B11 went offline with the message "Tape thread running in
state HAVE_BOUND. Will offline the mover." There had also been an
ENCP_GONE NETWORK error. The mover was restarted.
Mover 9940B16 went offline with the message "Tape thread running in
state HAVE_BOUND. Will offline the mover." There had also been an
ENCP_GONE NETWORK error. The mover was restarted.
Mover 9940B26 went offline with the message "Tape thread running in
state HAVE_BOUND. Will offline the mover." There had also been an
ENCP_GONE NETWORK error. The mover was restarted.
Mover 9940B40 went offline with the message "Tape thread running in
state HAVE_BOUND. Will offline the mover." There had also been an
ENCP_GONE NETWORK error. The mover was restarted.
Volume VO9817 was cloned. It had 2305 mounts.
Mover 9940B36 went offline with the message "Tape thread running in
state HAVE_BOUND. Will offline the mover." There had also been an
ENCP_GONE NETWORK error. The mover was restarted.
Volume VO2684 went NOACCESS in mover 9940B15 due to a MOVER_STUCK error.
There was no alarm and nothing in the system log and mover.out files.
The tape was ejected, put away, and cleared, and the mover rebooted.
The mover had to be started manually after the reboot.
27 exp-db tapes were recycled.
Friday
------
Cron jobs
---------
CDF
7/12 enstoreNetwork had a TCP_EXCEPTION at 13:47
7/12 enstoreSystem had a KeyError: servers at 04:52
7/13 enstoreNetwork had a KeyError: html_gen_host at 14:16
D0
7/12 enstoreNetwork had a TCP_EXCEPTION at 11:47
7/13 enstoreNetwork had a KeyError: html_gen_host at 09:31
STK
7/04 pageDcacheKftp was hung since last Tuesday
7/05 pageDcacheGridftp was hung since last Wednesday
7/06 pageDcacheCmsDccp was hung since last Thursday's downtime
7/08,09 The backup job failed twice over the weekend
7/08 The check_multiple job ran for almost 30 hours Saturday/Sunday
7/09 write-tabs got stuck Sunday at 01:30
7/10 pageDcacheCmsGridftp was hung since 6/29
7/10 offline_inventory had a KeyError: db_host at 23:50
7/10 STKquery had an error due to no response from the config server
7/11 pageDcacheCmsGridftp got hung again
7/11 offline_inventory had a KeyError: db_host at 12:50
7/10 STKquery had an invisible error at 9:40
7/11 STKquery had an error due to no response from the config server
7/12 delfile ran long a couple of times while stkensrv2 was down
7/12 enstoreSystem had a pair of KeyError: servers at 06:52 and 20:32
7/12 quickcheck got stuck for an hour or so when stkensrv2 went down
7/12 STKquery got stuck for an hour or so when stkensrv2 went down
Misc
----
D0en volumes active too long
Volume Count Movers
Tapes marked readonly due to write errors
Volume Count Movers
CDFen volumes active too long
Volume Count Movers
NUL045 2 cdfenmvr3a
Total 2
Tapes marked readonly due to write errors
Volume Count Movers
IAA467 1 cdfenmvr12a
IAA474 1 cdfenmvr23a
IAA481 1 cdfenmvr20a
STKen volumes active too long
Volume Count Movers
JL5753 2 stkenmvr30a
JL8319 2 stkenmvr28a
JL8320 2 stkenmvr18a
JL8324 2 stkenmvr18a
Total 8
Tapes marked readonly due to write errors
Volume Count Movers
VO3068 1 stkenmvr24a
VO3071 1 stkenmvr25a
VO3199 1 stkenmvr20a
VO6611 1 stkenmvr15a
VO7600 1 stkenmvr21a
VO9550 1 stkenmvr27a
VO9555 1 stkenmvr27a
VO9556 1 stkenmvr27a
VO9557 1 stkenmvr27a
VO9558 1 stkenmvr27a
VOA156 1 stkenmvr15a
VOA159 1 stkenmvr20a
VOA161 1 stkenmvr35a
VOA195 1 stkenmvr41a
VOB597 1 stkenmvr20a
VOB599 1 stkenmvr34a
VOB635 1 stkenmvr34a
VOB638 1 stkenmvr41a
Security, Privacy, Legal