Primary Report for the Week of April 15, 2005
	*********************************************



Summary:

Pages:
 2 off hours

    Temperature alarm at LCC;  due to broken chiller (?).
    Cmspnfs1 was unpingable;  due to scheduled Network Switch maintenance.

 1 work hours

    Not enough LTO2 drives online

New Remedy tickets: 57681, 57610, 57444, 56501, 57648
Closed tickets: 56267, 56967, 57601, 56865, 57610, 56811, 56812, 57648

Server investigations: none

Mover investigations:
  One D0 Enstore 9940B drive, 9940B38
  Eight or more LTO2 drives were marked offline due to volumes with errors.
 
Tapes:  Two problem LTO2 volumes, PRO915L1 and PRO938L1.

Drives: 
  No broken drives
  Firmware changes to LTO2 drives (4APO -> 4772)

Robot Events:  None

Four mover PC's were swapped with spares and kernels were upgraded.

Monday 04/11

Numerous D0 Enstore nodes were running the selbit cronjob.
I attempted to fix the crons.

We are low on tapes in the Stken 9940A Common Pool

Wayne rebooted Stkendca11a:  Console reported a Kernel panic: Fatal exception

Wayne and Chih-Hao recycled 82 CDF 9940B tapes, but neglected to check the
write-protect tabs.  David ran flip_tab on them.

I tested volume PRO915L1 due to Too long in state SEEK errors.
The LTO2 drive had problems loctating fm 349 and 350 on the tape.
I tried more tests on Tuesday.

LTO2 drives D41ALTO and D41BLTO were marked offline due to the volume PRO915L1.
I put the drives back online. 

George completed cloning 9940A volume VO7813

I responded to Joe Boyd about the number of LTO2 tapes in the ADIC.

Tuesday 04/12

George fixed corrupt files on stkendca6a and stkendca7a.

I was paged for a temperature alarm at 03:49 this morning.  I checked
and 3083 reported 88 degrees. I called dispatch to deploy the duty
mechanic. As I kept checking the temperature remained at 88 degrees.
The duty mechanic reported that the chiller wouldn't restart and that
the next crew would repair it.

Because of a "positioning error FTT_EUNRECOVERED DRIVE volume=PRO938L1".
I ran ftt tests to skip foward and backward on the tape.  It seems
fine but I wasn't sure how else to test the tape. 

I Tested PRO915L1 further: 
dd: reading `/dev/rmt/tps0d1n': Input/output error on file at location 349
I sent e-mail to Adam Lyon reporting the bad file and set the file .bad..

Wednesday 04/13

ADIC FEs Mike and Vince loaded an older level of LTO2 firmware.
We are now using 4772 on all ULTRIUM-TD2 drives.
I caused a RED Ball about then.

Volume PRO938L1 had positioning errors around file 117, I cleared
the volume and set it to readonly.
 
I tested 9940B37 because of a Write sense error.  It seemed to test
fine but the log shows ftt traceback errors after I released it.
I have yet to check its cables, as Sasha suggested.

Investigated why 4 LTO2 drives went offline (on PRO915L1 file 350).
The file is readable but perhaps locating to 349 (which is .bad.)
is difficult.  Remind me to try encp again using 4772.

I spent some time troubleshooting a process accounting problem on kpasa.
We found the problem on Thursday.

Thursday 04/14

ticket 57648 cmspnfs1 due to network outage

Nothing happened to the queues on CDF Enstore at 06:30.
(And noone can prove I did it.)

PRS691 has been cloned.

VO3225 will be replaced.

Friday 04/15

George found DLT tape JL8296 in the problem box.
volser = JL8296  type = A attrib = E
        coordinate = T101271108
        use count = 5
        crash count = 0

2005-04-14 20:38:08 0319 <00443> ERROR: Touch sensor Robot 1 during PUT to rack.
2005-04-14 20:38:08 0319 <01164> AMU/P tells AMU/L to move a cartridge to the problem box.
2005-04-14 20:38:08 000000 <01077> ****> RQMA010319S1164................
2005-04-14 20:38:08 000000 <01150> <**** RQMA010319QCARY1..........P401010101ND...........JL8296
2005-04-14 20:38:10 0321 <00407> ERROR: Cartridge in gripper Robot 1. 

2005-04-15 07:02:07 0000 <01198> Check of EIF-Device No. E00 Segment 00 is complete.
2005-04-15 07:02:07 NTFY <01028> RQMA01A430IN........NTFY......................................000000111301E0000
2005-04-15 07:02:08 000000 <00922> STATUS: Problem box of P401 is empty now.

George removed two files with CRC of 0 from stkendca6a and stkendca7a.

Received ADIC log report from Greg Arellano:

April 12  Bow not in back position on R1 for Volser JL7878 to drive D4B
April 13  Gripper Handling issue, Bow not in back position for Volser
          JL7906 to drive D2F, Cartridge Not Ejected from Volser
          PRO912L1 to drive D2A.
April 14  There were multiple Touch Sensor warnings during a put. These
          were on Drive D3 and the following Volsers PRO561L1, PRO941L1,
          PRO912L1, PRO919L1, PRO942L1, PRO944L1, PRO961L1 & PRO959L1.
April 15  HOC error for partner 01 & 03

Other:
If we wish/need to create an Enstore workgroup for the Fermi Linux
install now is the time.  Send requests to linux-workgroups@fnal.gov
with the following info.
1. workgroup name
2. maintainer(s), names, e-mail addresses, and kerberos principle names
3. How many systems are expected to use it.