Primary Report for the week ending Friday, 2005-04-29
	*****************************************************



Summary:

Installs:
  0 installs  0 decommissions
Replaced drives:
  1 9940B  0 9940A  0 LTO1/LT02  0 other
Replaced servers/movers/fileserver nodes:
  0 server  0 mover  0 fileservers
1 Robot hardware maintenance service calls
0 Server/Mover/Fileserver maintenance (repairs/parts replacement)
Investigations/Interventions:
 2 mover  1 server  0 tape drive  0 fileserver  0 tape  1 file  1 library
tape operations
 13 tapes clobbered/recycled
  0 tapes labeled/entered/removed
  3 tape MIRs fixed
  0 tape MIRs not fixed
  0 tape drive firmware updates
  1 tapes cloned
  0 quota increase requests serviced
  0 raid disk replacements/interventions
  3 enstore service requests
  1 new muon pages
1 off-hour calls/interventions


                 Sunday
                 ------
George handled a New Muon chiller outage late Saturday night.  Don Holmgren
is considering putting the Myrinet switches on a power controller.


                 Monday
                 ------
CDF    The system disk was almost full on cdfensrv3.  The conserver log for
       cdfensrv1 was renamed and compressed.  We need to automate this.

       Tape IA7031 investigated for "Too long in state MOUNT_WAIT" alarm.
       It mounted and read the 1st file just fine, and is back in service.

       Many spurious alarms cleaned up after mover restarts, which only
       seems to occur on CDF, David points out.

D0     /pnfs/sam/dzero/db2/datalogger/run179000/datalogger/all/all_2/-
       all_0000179883_113.raw (file 123 on tape PRR184) had a "CRC error
       reading tape" alarm.  The file read fine.

STK    pageDcacheDccp was stuck since last Thursday's downtime, and was
       consuming an entire CPU.  It was killed.

       Exported /e709 to fnsfo-stken, with Ron Soltz's approval.
       A couple more exports are waiting for Ron's approval.


                 Tuesday
                 -------
CDF    Read pools fcdfdata0{37,38,39,40}-{1,2,3} all went offline nearly
       simultaneously around 3 PM.  It turned out that they're being
       reassigned to other duties, but there was no announcement.

D0     Called in drive 9940B60.  Sense data shows Hardware Errors.
       Clarence couldn't find anything wrong with the drive, but
       replaced it anyway.  There haven't been any further complaints
       from the mover, which had its kernel, packages, and Enstore code
       updated during its outage.

       One odd thing was that the fntt node was very slow at the time,
       taking several seconds to close a terminal.  Data was collected
       to send to STK about the performance.  The STKrobot_inventory
       job began running long later in the week, no doubt as a result
       of fntt's continuing slowness.

       Mover D41BLTO got stuck around 11:15 PM last night due to tape's
       PRO972L1 having a bad TOC.  The TOC was fixed using the new
       scheme, but the tape subsequently failed to mount in 2 drives
       for testing, apparently because I failed to put the tape away
       first.  After rechecking it on Wednesday, it worked fine
       in a couple of tests reading the last file on the volume.

       This case is bad news, since the files were all written between
       April 18-25th, meaning that the firmware downgrade didn't help.
       This information has been sent to ADIC.

       Mover D41ELTO (d0enmvr56a) has gotten stuck a couple of times this
       week but cleared itself.

STK    Cleaned up some odd lost+found cvs files on stkenmvr10a for
       /cvs/hppc/db-2.7.7/build_unix.

       Provided ftt_stats output from DLT drives to Valery.

       The exp-db writes via Dcache are much improved, but they
       continue to get a transient permission denied error about
       once in every 50 files.  It looks like that may be from
       attempts to rewrite files, though.  Rob Kennedy says this
       is because the "cleaner" isn't enabled on STKen Dcache.

       There was a complaint from minos about hangs on writes to
       Dcache.  Investigating.

       Deleted a couple of zero-length files from r-stkendca7a-1.

CDF    Swap alarm on cdfenmvr23a.


                 Wednesday
                 ---------
CDF    Tape IA5387 was cloned.

CMS    Tape VO7549 went noaccess when 9940B11 tried to write a file on
       it.  The error was "FTT_EIO: tape drive or IC problem".  The
       sense key indicated Medium Error, however.  The tape was
       deleted and readded.  The mover was put back online.

D0     The AML/2 log retrieval has been stuck since the 22nd.  The FTPD
       process was not running on the ADIC after the crash/reboot.

       D41DLTO (d0enmvr55a) went to sleep.  There are 159 S.M.A.R.T.
       notices, so it's probably a disk that's about to fail.  The disk
       was replaced and the latest 7.3.1 kernel and Enstore code loaded.
       The replaced disk is being tested.

STK    13 tapes were recycled for exp-db.

       9940B quota for test raised from 11 to 20.


                 Thursday
                 --------
D0     PRO915L1 was cloned to fix the bad file to good file transition
       problem at location 349 to 350.  This was quite difficult.

STK    stkensrv2 locked up.  Rebooted.

       The backup2Tape job failed to run because the previous was still
       active, supposedly.  But Wednesday's job had finished at 8:06 AM.
       The tails of the log and histogram files show:

          cat /diska/BackupToTape/enstorelog.files
          Status= 0
          Wed Apr 27 08:06:22 CDT 2005: Finished backup on stkensrv3

          2005-04-27:07:30:01 10
          2005-04-27:08:06:22 0
          2005-04-28:07:30:00 10
          2005-04-28:07:30:00 -2

       The backup2Tape job was rerun manually and worked fine until the
       last file, which it appears to have tried to encp twice.

          EEXIST: [ ERRNO 17 ] File exists:
          /pnfs/eagle/backups/2005/04/28/15/diska-pnfs-backup-cms-cms.26.Z


                 Friday
                 ------
D0     There are a huge number of "too long in dismount" alarms.  fntt was
       rebooted to try to resolve the problem.  It had been up 161 days.

       Mover D41GLTO reported "Supposedly a serious problem with tape drive:
       ftt.FTTError FTT_SUCCESS. Will terminate".  FYI, this is the error you
       get when a mover is put back online without having its tape drive
       reconnected after a tape cloning session.  Oops.

       PRO985L1 has a bad Table Of Contents and will be repaired today.

       PRO987L1 has been too long in seek twice and will be investigated.

CMS    Much work is going into figuring out the CMS/eagle migration situation.
       Migration should start soon.

STK    Tape VO5024 went noaccess when 9940B15 tried to write a file to it.
       The error was "FTT_EIO: tape drive or IC problem".  Unlike VO7549,
       it was not an empty tape.  To be investigated.

       VO5025 took too long to dismount last night and will be investigated.
       This could be more fntt slowness troubles.

       Perhaps 10 unauthorized network service alarms for processes running
       /home/zalokar/enstore/bin/encp have been cleared this week.

       A couple of nodes have had "failed to read volume label on startup"
       alarms this week, yet to be investigated.


                 Cron jobs
                 ---------
CDF    4/26 pageDcacheGridftp was stuck for 6 days
D0     4/27 aml2mirror has failed since last Thursday
STK    4/25 pageDcacheDccp was stuck since last Thursday's downtime
       4/25 pageDcacheCmsGridftp had been stuck for a week
       4/28 backup2Tape failed to run -- spurious "still active"?
       4/28 delfile stayed active for a cycle
       4/28 burn-rate stayed active for a couple of cycles
       4/28 quickcheck got stuck on stkensrv2
       4/29 STKrobot_inventory is running too long


                 Other
                 -----
The latest from ADIC/IBM ...

Mike, we confirmed that the tape had an issue with the CM.  I had a
discussion with IBM over the issue and forwarded the dump to them.  In
our discussion the event seem to be very similar to another event IBM
had been involved in.  IBM confirmed that the footprint was very similar
and believes that the most recent code level we are testing here at ADIC
will correct this condition.  I am hoping to confirm this in the next
day or two with the tape you provided.  LTO-2 firmware 53Y2