Primary Report for the week ending Friday, 2005-04-29
*****************************************************
Summary:
Installs:
0 installs 0 decommissions
Replaced drives:
1 9940B 0 9940A 0 LTO1/LT02 0 other
Replaced servers/movers/fileserver nodes:
0 server 0 mover 0 fileservers
1 Robot hardware maintenance service calls
0 Server/Mover/Fileserver maintenance (repairs/parts replacement)
Investigations/Interventions:
2 mover 1 server 0 tape drive 0 fileserver 0 tape 1 file 1 library
tape operations
13 tapes clobbered/recycled
0 tapes labeled/entered/removed
3 tape MIRs fixed
0 tape MIRs not fixed
0 tape drive firmware updates
1 tapes cloned
0 quota increase requests serviced
0 raid disk replacements/interventions
3 enstore service requests
1 new muon pages
1 off-hour calls/interventions
Sunday
------
George handled a New Muon chiller outage late Saturday night. Don Holmgren
is considering putting the Myrinet switches on a power controller.
Monday
------
CDF The system disk was almost full on cdfensrv3. The conserver log for
cdfensrv1 was renamed and compressed. We need to automate this.
Tape IA7031 investigated for "Too long in state MOUNT_WAIT" alarm.
It mounted and read the 1st file just fine, and is back in service.
Many spurious alarms cleaned up after mover restarts, which only
seems to occur on CDF, David points out.
D0 /pnfs/sam/dzero/db2/datalogger/run179000/datalogger/all/all_2/-
all_0000179883_113.raw (file 123 on tape PRR184) had a "CRC error
reading tape" alarm. The file read fine.
STK pageDcacheDccp was stuck since last Thursday's downtime, and was
consuming an entire CPU. It was killed.
Exported /e709 to fnsfo-stken, with Ron Soltz's approval.
A couple more exports are waiting for Ron's approval.
Tuesday
-------
CDF Read pools fcdfdata0{37,38,39,40}-{1,2,3} all went offline nearly
simultaneously around 3 PM. It turned out that they're being
reassigned to other duties, but there was no announcement.
D0 Called in drive 9940B60. Sense data shows Hardware Errors.
Clarence couldn't find anything wrong with the drive, but
replaced it anyway. There haven't been any further complaints
from the mover, which had its kernel, packages, and Enstore code
updated during its outage.
One odd thing was that the fntt node was very slow at the time,
taking several seconds to close a terminal. Data was collected
to send to STK about the performance. The STKrobot_inventory
job began running long later in the week, no doubt as a result
of fntt's continuing slowness.
Mover D41BLTO got stuck around 11:15 PM last night due to tape's
PRO972L1 having a bad TOC. The TOC was fixed using the new
scheme, but the tape subsequently failed to mount in 2 drives
for testing, apparently because I failed to put the tape away
first. After rechecking it on Wednesday, it worked fine
in a couple of tests reading the last file on the volume.
This case is bad news, since the files were all written between
April 18-25th, meaning that the firmware downgrade didn't help.
This information has been sent to ADIC.
Mover D41ELTO (d0enmvr56a) has gotten stuck a couple of times this
week but cleared itself.
STK Cleaned up some odd lost+found cvs files on stkenmvr10a for
/cvs/hppc/db-2.7.7/build_unix.
Provided ftt_stats output from DLT drives to Valery.
The exp-db writes via Dcache are much improved, but they
continue to get a transient permission denied error about
once in every 50 files. It looks like that may be from
attempts to rewrite files, though. Rob Kennedy says this
is because the "cleaner" isn't enabled on STKen Dcache.
There was a complaint from minos about hangs on writes to
Dcache. Investigating.
Deleted a couple of zero-length files from r-stkendca7a-1.
CDF Swap alarm on cdfenmvr23a.
Wednesday
---------
CDF Tape IA5387 was cloned.
CMS Tape VO7549 went noaccess when 9940B11 tried to write a file on
it. The error was "FTT_EIO: tape drive or IC problem". The
sense key indicated Medium Error, however. The tape was
deleted and readded. The mover was put back online.
D0 The AML/2 log retrieval has been stuck since the 22nd. The FTPD
process was not running on the ADIC after the crash/reboot.
D41DLTO (d0enmvr55a) went to sleep. There are 159 S.M.A.R.T.
notices, so it's probably a disk that's about to fail. The disk
was replaced and the latest 7.3.1 kernel and Enstore code loaded.
The replaced disk is being tested.
STK 13 tapes were recycled for exp-db.
9940B quota for test raised from 11 to 20.
Thursday
--------
D0 PRO915L1 was cloned to fix the bad file to good file transition
problem at location 349 to 350. This was quite difficult.
STK stkensrv2 locked up. Rebooted.
The backup2Tape job failed to run because the previous was still
active, supposedly. But Wednesday's job had finished at 8:06 AM.
The tails of the log and histogram files show:
cat /diska/BackupToTape/enstorelog.files
Status= 0
Wed Apr 27 08:06:22 CDT 2005: Finished backup on stkensrv3
2005-04-27:07:30:01 10
2005-04-27:08:06:22 0
2005-04-28:07:30:00 10
2005-04-28:07:30:00 -2
The backup2Tape job was rerun manually and worked fine until the
last file, which it appears to have tried to encp twice.
EEXIST: [ ERRNO 17 ] File exists:
/pnfs/eagle/backups/2005/04/28/15/diska-pnfs-backup-cms-cms.26.Z
Friday
------
D0 There are a huge number of "too long in dismount" alarms. fntt was
rebooted to try to resolve the problem. It had been up 161 days.
Mover D41GLTO reported "Supposedly a serious problem with tape drive:
ftt.FTTError FTT_SUCCESS. Will terminate". FYI, this is the error you
get when a mover is put back online without having its tape drive
reconnected after a tape cloning session. Oops.
PRO985L1 has a bad Table Of Contents and will be repaired today.
PRO987L1 has been too long in seek twice and will be investigated.
CMS Much work is going into figuring out the CMS/eagle migration situation.
Migration should start soon.
STK Tape VO5024 went noaccess when 9940B15 tried to write a file to it.
The error was "FTT_EIO: tape drive or IC problem". Unlike VO7549,
it was not an empty tape. To be investigated.
VO5025 took too long to dismount last night and will be investigated.
This could be more fntt slowness troubles.
Perhaps 10 unauthorized network service alarms for processes running
/home/zalokar/enstore/bin/encp have been cleared this week.
A couple of nodes have had "failed to read volume label on startup"
alarms this week, yet to be investigated.
Cron jobs
---------
CDF 4/26 pageDcacheGridftp was stuck for 6 days
D0 4/27 aml2mirror has failed since last Thursday
STK 4/25 pageDcacheDccp was stuck since last Thursday's downtime
4/25 pageDcacheCmsGridftp had been stuck for a week
4/28 backup2Tape failed to run -- spurious "still active"?
4/28 delfile stayed active for a cycle
4/28 burn-rate stayed active for a couple of cycles
4/28 quickcheck got stuck on stkensrv2
4/29 STKrobot_inventory is running too long
Other
-----
The latest from ADIC/IBM ...
Mike, we confirmed that the tape had an issue with the CM. I had a
discussion with IBM over the issue and forwarded the dump to them. In
our discussion the event seem to be very similar to another event IBM
had been involved in. IBM confirmed that the footprint was very similar
and believes that the most recent code level we are testing here at ADIC
will correct this condition. I am hoping to confirm this in the next
day or two with the tape you provided. LTO-2 firmware 53Y2