CCF/IA Report for the week ending Friday, 2006-07-14

Installs: 0 installs 0 decommissions Replaced drives: 0 9940B 0 9940A 0 LTO1/LT02 2 other Replaced servers/movers/fileserver nodes: 1 server 0 mover 0 fileservers 0 Robot hardware maintenance service calls 0 Server/Mover/Fileserver maintenance (repairs/parts replacement) Investigations/Interventions: 39 mover 2 server 5 tape drive 0 fileserver 39 tape 0 file 0 library tape operations 367 tapes clobbered/recycled 792 tapes labeled/entered/removed 3 tape MIRs fixed 0 tape MIRs not fixed 0 tape drive firmware updates 3 tapes cloned 3 quota increase requests serviced 0 raid disk replacements/interventions 1 enstore service requests 0 new muon pages 2 off-hour calls/interventions Monday ------ CDF Volume IAA453 went NOACCESS last Wednesday in mover 9940B20. The tape was new and unlabeled. The label was written then the transfer failed with "WRITE_ERROR Tape IAA453 at BOT, can not write TAPE volume=IAA453 location=1". The tape was checked with volume_assert which reported READ_VOL1_READ_ERR. There was no sense data reported. The eod_cookie was reset and the tape cleared. CMS Stopped writing tapes when their available blanks dropped below 50. They had 202 blanks last Wednesday but have written 159 in the last week, far faster than their usual rate. 153 A tapes were clobbered and converted to B tapes. 183 B tapes were recycled, and of those, 75 were write-protected and needed their tabs flipped. That request was made on Tuesday. D0 Ticket 81719 was generated because an encp write failed with an error: Filesystem is corrupt. We were all glad to learn that "This has been fixed. The file transfer should work now." STK Volume VOA069 went NOACCESS in mover 9940B36 when it failed to mount. The alarm indicated that the tape was in 9940B17's drive. By the time it was checked, the tape was home. It was cleared. Mover 9940B22 went offline with the message "Tape thread running in state HAVE_BOUND. Will offline the mover." There had also been an ENCP_GONE NETWORK error. The mover was restarted. Mover 9940B32 went offline with the message "Tape thread running in state HAVE_BOUND. Will offline the mover." There had also been an ENCP_GONE NETWORK error. The mover was restarted. Volume VO6618 went NOACCESS in mover 9940B17 due to a failed MIR. The MIR was rebuilt, and the mover and tape placed back in service. Drive DE3EDLT has changed its approach from an occasional death grip to failing to grab the tape at all. It was called in to ADIC. Volume JL7528 went NOACCESS in mover DE2EDLT because the drive pushed the tape out a little too far when ejecting and the robot bumped it back in when it tried to put the tape away, apparently, causing the drive to go for the death grip. The tape was manually dismounted and put away, and the mover and tape were placed back in service. The syscollect.sh script was missing from some systems. Whatever it is. We continue to see intermittent problems connecting to the configuration server. For example, we might see one instance while checking a list of 183 tapes for write protect status. The first retry usually works. A request for information and/or help was made to delete the old tape LEGL01 which no longer exists. Chih answered the call. FTT Testing has commenced for ftt v2_25. Tuesday ------- D0 Volume PRT465 went NOACCESS when it failed to mount in drive 9940B33 because it had failed to dismount and was still in drive 9940B27. There was sense data indicating a host problem so the mover was rebooted. The tape was manually dismounted and put away, and tape and movers cleared. The ball went red late in the afternoon because testing consumed all of the LTO-2s in the testlto2 library. For some reason this storage group wasn't being ignored. That was fixed. STK Primary was paged at 6:30 AM because minos's quota was exhausted. Their quota was increased from 200 to 240. Mover 9940B23 went offline due to a write error on the drive, an EBLANK error on a selective CRC check on volume VOA203. The sense data indicated a hardware problem. A service request was placed and Clarence tested the drive but found no problem. The file in question was encped without difficulty. The drive was put back in service. Node stkendm2a went belly up. Its motherboard and CPUs were replaced. It also got an increase in memory to 2 GB. /pnfs/miniboone was exported to mbdata05 at Sam Zeller's request. A ticket from Peter Cooper requesting some help for Selex's batch processing needs was assigned to ISA because it mentioned "enstore". It was reassigned to CSS. The quota for SDSS was increased from 320 to 340. The quota for exp-db was increased from 470 to 490. 185 tapes were write protected. Volume VO6948 was cloned after having served over 6000 mounts (but no hamburgers). Besides Dcache test files, it has about 200 test files from various Dcache developers. Mover DG4BDLT went offline due to 3 consecutive "Mount failed BAD" errors. The system was rebooted and put back online. Drive DE3FDLT learned a new trick from its partner and no longer grabs tapes to load. The alarm for this problem indicated that the mover might be stuck in a D state, but wasn't the case when examined. It was called in to ADIC. Mover DG4CDLT went offline because volume JL5753 spent too long in state ACTIVE. The tape had been put away. The mover was put back online. Volume VOA181 got stuck writing file 68 in mover 9940B27 for 50 hours due to a memory error. The tape was dismounted and put away, and the mover restarted. The tape was subsequently written to capacity and the files have all been marked deleted. An inquiry was made concerning increasing the 10-cap limit on tab flipping requests. The option to do so has been provided for. There was also a request for a tab-flipping "unfinish" command to take care of cases where a bad date parameter is used. In this instance, the incorrect dates were rectified via PSQL commands. Other bookkeeping was performed on some old work, too. David dealt with a number of exception cases for 13 tapes needing tab flipping and sent a very informative email about it. In summary: VO3724 set full to force write protection VO4178 3 active files, 3 deleted files VO5666 blank (eod=None) - cleared ff/ffw VO6223 blank (eod=None) - cleared ff/ffw VO6631 blank (eod=1) - cleared sg VO6635 blank (eod=1) - cleared sg VO6727 blank (eod=None) - cleared ff/ffw VO6876 set full to force write protection VO7026 inconsistent state, 3 active files VO7175 inconsistent state, 3 active files, 2 unknown files VO7428 blank (eod=None), history of noaccess VO7600 blank (eod=1) - cleared ff/ffw VO9491 inconsistent state, many active files Wednesday --------- CDF Drive 9940B11 has a stored dump. It was called in. The ball went red when 2 of the 3 9940 drives had trouble mounting tapes due to competition from tab flipping. Volumes IA0213 and IA1030 went NOACCESS, both Dcache test tapes. They were cleared. 29 tapes were write protected. 210 tapes were write enabled. David asks what we should do with the 10 9940A tapes in the group of 1478 tapes recycled by CDF. 3 have VO labels, so they should come out. The others could remain as Dcache test tapes or be converted to B tapes. CMS Volume VO2684 went NOACCESS after spending too long in SEEK in a number of drives. The MIR was repaired and the tape put back in service. STK The ball went red when 3 of the DLT drives went offline while 2 were already down. Volumes JL6648 and JL6637 failed to dismount from drives DE2EDLT and DE2FDLT, respectively. Volume JL6636 failed to mount in drive DG4BDLT with indications that the initial rewind after load is failing. The tapes and drives were all put back in service. 158 tapes were write protected. 4 SDSS volumes were recycled. For some reason, the primary's attempts to acknowledge ticket 81947 for the STKen red ball failed and the secondary was paged about 1:40 PM. An inquiry is being made into the matter. Mover DG4CDLT went offline with a mover/ftt exception and was restarted. Node stkensrv2 locked up, probably due to a kernel panic around 2:15 PM. No information was available via the log or console. It was rebooted. The inquisitor apparently failed to come up afterward and was started manually. "ERROR {'status': ('TIMEDOUT', 'configuration_server')}" was all that the log file had in it. Volume JL8319 got stuck in SETUP in mover DG4ADLT because the mover couldn't contact stkensrv2. Enstore was restarted. Volume JL6636 went NOACCESS in mover DG4DDLT due to a mount failure. The mover was restarted and the tape was cleared. Volume JL5760 went NOACCESS in mover DE2EDLT which had the tape in a death grip. The tape was pushed back in, ejected, and put away. The tape and drive were put back in service. Drive DE3EDLT was locked up with a case of blinking lights. The brick containing it and DE3FDLT was power-cycled and the movers restarted. Volume VO9818 was cloned. It had 2056 mounts. A request was made of the developers to add a "stop" feature to the migration script. Thursday -------- CDF File /pnfs/cdfen/NULL/RT/RT00/RT0005/RT0005.3/j_115276.6685time was reported as possibly being corrupt. "It was fixed." Movers 9940B18, 9940B19, and 9940B25 went offline when they couldn't mount tapes due to competition from tab flipping. Volumes IA9895, IA9889, and IA2951 went NOACCESS in the respective drives. All cleared. 210 tapes were write enabled. Volume IA7062 has a broken MIR, staying in SEEK for over 1/2 hour in mover 9940B15. Will need to be fixed. Mover 9940B14 went offline complaining that it couldn't eject volume IA2706. The tape was actually put away. It also reported "Supposedly a serious problem with tape drive positioning the tape: ftt.FTTError FTT_SUCCESS." The sense data indicated a host problem. The node was rebooted. D0 One cartridgeful of LTO tapes was tab flipped somewhat unintentionally. The intent was to flip 9940 tapes, but the wrong script was run. STK There was an inquiry from e907 about how long it takes a file to get from Dcache to tape. Sasha and Rob answered the question. Tape VO7428 failed to mount in drive 9940B41 because it had its leader snapped off. The tape had never had a file or even a label written to it. It had gone NOACCESS on 3 previous attempts to use it. We suspect the leader has been broken all along. Clarence said it had telltale signs of having been dropped. The drive was tested to make sure it wasn't the culprit. A new tape was put in for VO7428. ADIC arrived and attempted to replace the DE3EDLT/DE3FDLT brick but one of the drives in the replacement wasn't working. We're waiting for another brick. Maybe we can build a wall to fill the empty spaces. While we continue working with the old DE3EDLT/DE3FDLT brick, ADIC made a slight reduction to the bow pressure on the gripper. It doesn't appear to have helped any. A new call was placed because tape JL7232 couldn't be dismounted from DE2EDLT in several manual attempts. The tape was reentered through the front. Mover 9940B11 went offline with the message "Tape thread running in state HAVE_BOUND. Will offline the mover." There had also been an ENCP_GONE NETWORK error. The mover was restarted. Mover 9940B16 went offline with the message "Tape thread running in state HAVE_BOUND. Will offline the mover." There had also been an ENCP_GONE NETWORK error. The mover was restarted. Mover 9940B26 went offline with the message "Tape thread running in state HAVE_BOUND. Will offline the mover." There had also been an ENCP_GONE NETWORK error. The mover was restarted. Mover 9940B40 went offline with the message "Tape thread running in state HAVE_BOUND. Will offline the mover." There had also been an ENCP_GONE NETWORK error. The mover was restarted. Volume VO9817 was cloned. It had 2305 mounts. Mover 9940B36 went offline with the message "Tape thread running in state HAVE_BOUND. Will offline the mover." There had also been an ENCP_GONE NETWORK error. The mover was restarted. Volume VO2684 went NOACCESS in mover 9940B15 due to a MOVER_STUCK error. There was no alarm and nothing in the system log and mover.out files. The tape was ejected, put away, and cleared, and the mover rebooted. The mover had to be started manually after the reboot. 27 exp-db tapes were recycled. Friday ------ Cron jobs --------- CDF 7/12 enstoreNetwork had a TCP_EXCEPTION at 13:47 7/12 enstoreSystem had a KeyError: servers at 04:52 7/13 enstoreNetwork had a KeyError: html_gen_host at 14:16 D0 7/12 enstoreNetwork had a TCP_EXCEPTION at 11:47 7/13 enstoreNetwork had a KeyError: html_gen_host at 09:31 STK 7/04 pageDcacheKftp was hung since last Tuesday 7/05 pageDcacheGridftp was hung since last Wednesday 7/06 pageDcacheCmsDccp was hung since last Thursday's downtime 7/08,09 The backup job failed twice over the weekend 7/08 The check_multiple job ran for almost 30 hours Saturday/Sunday 7/09 write-tabs got stuck Sunday at 01:30 7/10 pageDcacheCmsGridftp was hung since 6/29 7/10 offline_inventory had a KeyError: db_host at 23:50 7/10 STKquery had an error due to no response from the config server 7/11 pageDcacheCmsGridftp got hung again 7/11 offline_inventory had a KeyError: db_host at 12:50 7/10 STKquery had an invisible error at 9:40 7/11 STKquery had an error due to no response from the config server 7/12 delfile ran long a couple of times while stkensrv2 was down 7/12 enstoreSystem had a pair of KeyError: servers at 06:52 and 20:32 7/12 quickcheck got stuck for an hour or so when stkensrv2 went down 7/12 STKquery got stuck for an hour or so when stkensrv2 went down Misc ---- D0en volumes active too long Volume Count Movers Tapes marked readonly due to write errors Volume Count Movers CDFen volumes active too long Volume Count Movers NUL045 2 cdfenmvr3a Total 2 Tapes marked readonly due to write errors Volume Count Movers IAA467 1 cdfenmvr12a IAA474 1 cdfenmvr23a IAA481 1 cdfenmvr20a STKen volumes active too long Volume Count Movers JL5753 2 stkenmvr30a JL8319 2 stkenmvr28a JL8320 2 stkenmvr18a JL8324 2 stkenmvr18a Total 8 Tapes marked readonly due to write errors Volume Count Movers VO3068 1 stkenmvr24a VO3071 1 stkenmvr25a VO3199 1 stkenmvr20a VO6611 1 stkenmvr15a VO7600 1 stkenmvr21a VO9550 1 stkenmvr27a VO9555 1 stkenmvr27a VO9556 1 stkenmvr27a VO9557 1 stkenmvr27a VO9558 1 stkenmvr27a VOA156 1 stkenmvr15a VOA159 1 stkenmvr20a VOA161 1 stkenmvr35a VOA195 1 stkenmvr41a VOB597 1 stkenmvr20a VOB599 1 stkenmvr34a VOB635 1 stkenmvr34a VOB638 1 stkenmvr41a


Security, Privacy, Legal