From:	D0ISU1::SYJUN         5-MAR-1994 19:58:34.33
To:	FNALV::SCHELLMAN
CC:	
Subj:	

D0_UNIX_FARM

I. Tape Drive

1) General

Each phsical machine(three logical machine-LM) has seven tapes drives: 
three for three inspoolers, three for three outspoolers and one for spare.  
Inspoolers can share tapes drives in case of lack of tape drives, while
three tape drives should be assinged to three ouspooler.  

2) Tape Drive Failure

A tape drives will fail after several months of usage.  A usual symptom
for a bad tape drive is mount failure on 3 or more in row(intape or outtape 
ABEND with error counts).  After 5 accululated failures with actual error
counts, you will be paged from the farm.  Look at drives.fnsfx at 
~/proman/resource area(the third column indicate actual error counts
while forth is accumulated errors).  The procedure when this happens
is to first call the operators to have the drive clean.  After it has been
cleaned, try the drives again(set the STATUS to READY and the actual 
error count to 0).  If it still fails after while, do the following steps:
0)  Combine the inspoolers properly, if necessary, in resource at 
    ~/proman/resource area based on inspool_list_#.fnsfx at the same area.
    - skip this step for one bad tape drive.
    - see example how to combine inspoolers.
1)  execute cps_umaint tape sgi82 broken noswap [ < comment ] - type why
    you believe the device is broken and ^D(end of file) when you done 
    or edit a comment file first and use the option.
    this will update the tape drive state to broken for cps_tape [-lt]
    and send mail to farm-admin@fnsg01.fnal.gov and farm-user-fnsfe@fnsg01
2)  call operator for log.
    (our production names: sgid0sfe and sgid0sfd)
   
*   a person you who takes charge for bad tapes replacement/repair in 
    farm-admin: fagan@large.fnal.gov - do not contact him directly.
*   whenever you send mail to farm-admin@fnsg01, send cc/farm-users-d0@fnsg01.

Other symptoms might be inspooler and outspooler hung, and too many files in
proman/sta area(- this usually happen when a tape drive is changed to single
density mode).

3)  How to combine inspoolers in 3 LM:

a)  1 bad tape drive- No action is neccessayy.
b)  2 bad tape drives- two inspoolers have to share one tape drive.   
c)  3 bad tape drives- set one LM to UNAVAILABLE.
d)  4 bad tape drives- case 2) and 3) simutaneously.
e)  5 bad tape drives- set two LM to UNAVAILABLE.
For b)-d), edit or use proper inspool_list_#.fnsfe file and edit proper 
           inspool # in resource file  at ~/proman/resource.
For c), you may combine three inspoolers together.  Nonetheless, it is
        not recommended due to the system efficiency.


II. Automatic Paging System

On the critical situations, the system will automatically page you so 
that problems may be fixed without any delay.  You may add paging codes 
if neccessary.  The following is the current setup:

1) AUTOMATIC PAGING CODE
-----------------------------------------------------------------------------
Code   Corresponding Mail Message 
_____________________________________________________________________________
00001  Disk space; fnsfg:/usr/people/dzero %use > 95% 
00002  No tapedrive; All drives are busy; fatal error submitting the spooler
00003  Bad tapedrives: Error counts exceeded
00004  All drives are busy; fatal error submitting the spooler
00005  All drives are busy; fatal error submitting the output spooler
00006  All drives are busy; fatal error submitting the input spooler
00007  No available output tape; no blank tape for output tapes
00008  No current project defined
00009  Disk space; /dev/dsk/lvx %use > 89%
-----------------------------------------------------------------------------

2) What have to do

00001 and 00009  : This means that the source and log file disk/dzero being
                   full.  use df and du command to check free disk space and 
                   disk usuage respectively.  This is the most dangerous 
                   problem; you have to stop the system completely immediately 
                   and start investigation. 
00002  -  00004  : Check resource or drives.fnsfx at ~/proman/resource
00003,00005,00006: consult Tape Drive Failure
00007            : No action is necessary.  Just call to the operator and
                   check available output tapes.
00008            : Check resource and project area.                     


III. Mail

During the shift, you will have many mail from the system to unix_proman
or from products and the farm-admin to farm-user.  Mails to unix_proman
are the subject concerned here.  The other is more general and broad
in fnal sgi farn level, which include reports from farm-users, shut down 
schedule, farm meeting, and etc..

Most of time, various mail messages from the farm are enough to give
you diagnostics or symptoms of problems  as well as problem itselves.
There are maily three catagories of mail; warning, alarm and critical
message.  Most of them are well explained in the TPM user manual.  
Here, some practical ways to take of various mails as a moniter.

1)  P-server(s):
This is just a warning message.  Ignore this unless it is repeated
many times for one fnsfx_# without doing any process or with doing
a very slow process.  Otherwise, there might be  real problems.

You  may check the flow of files looking at inspool or raw area as well
as proman/sta area or proman/dst.  tpm_disp will tell you how long a 
tape on LM in question has been processed.   

2)  after_crash executed:
This mail is sent after the system shut down or reboot saying that
tpm_submit_job is not running, ....  This is an automatic precedure to 
start the farm.  In any reason, if this message is repeated many times, 
check whether tpm_submit_job really is not running.  If tpm_submit_job 
is not running or dereco_prl would not start, execute after_crash 
manually at ~/proman/exe.  If tpm_submit_job is already running, you
will have another warnning mail message  not to try to start the system 
using after_crash.
 
3)  Bad Tape Drives and No tapedrive:
You will recieve this massage repeatedly when you are paged with code
00002-00005.  Follow a proper procedure based on tape drives failure.
Remember that there is a time lag to recover system after you fix it.
You may turn off your pager  not to be paged repeatedly if you fix it 
properly - the paging time interval is 15 min.  Don't foget to turn it on.

4)  Paging modem -- can't get lock file:
This is a mail from operator@d0sgi usually with Bad Tape Drives or
No tapedrive messages without paging.  Do the following:
rlogin into d0sgi9 as "operator" with passward ......; then
rm /tmp/modempage.lock.  Usually, this is due to a minor hareware problem
in the automatic paging system. Ask to turn off and on modem once.

5)  Bad file:
The inspooler skips the bad files on tape and then sends this mail saying
which file on which tape is bad.  Collect bad files more that 15 or longer 
than two weeks.  Send a list of bad files to dorota@fnalv.  The system won't
know anything about the missing("bad") file except that it is missing; thus
it will to resubmit it again via jobcontrol mechanism.  So you have to remove
this file by hand from the corresponding tapes_moved.VSN (not itapes_moved.*)
in ~/proman/history unless you want to be resubmitted agian.

6)  Wrong Project:
This means that a proper project is not defied at ~/proman/project area
or at the resource file.  A wrong format in injobs file will send the
same message.

7)  Tape in use on D0FS:
Log on to D0FS with the same username and password as the farm.
Type "show queue STAGE_IN_USE_ON_UNIX_FARM", then determine entry
number concerned base on JOBNAME. Remove the entry from the queue
executing "delete/queue=entry#".  Log off from the D0FS and change 
the status of the tape from SUBMITTED to WAITING in the injobs file.
However, there are legitimate cases when tape might be actually used 
by another user.

Sometimes logon D0FS is blocked by many users.  Try to logon D0FS
through subnode such as D0RSEX, D0TSEX, D0RSUT, etc..

8)  DST transfer to D0FS failed for WNXXXX:
This means that a DST failed to transfer successfully to D0FS.  If this
message has a timestamp anywhere from midnight to 20 minutes past midnight,
dont't worry about it - it will be fixed automatically.  If the timestamp
is any other time, go to ~/proman/exe and execute the script pick_up_all
(pick_up_trans will do the same thing but the specific node you are on).
System automatically pick up failure tapes twice a day(at 5:00 and 20:00).
Doing this will save you some disk space for spoolers.  

the system will pick failed transfers if everything is OK except D0FS; 
however, if something is wrong, it makes sense to have a look at 
~/proman/logdb/19../.../..-../copy_results_log.WNxxxx for comments.

9)  Outtape ABEND:
This means that an output tape has failed on the mount.  If there were no
files written on it, then its status should be changed from USED to READY
in the blank file.  The system do this automaticaaly, so you should not 
have to worry about it. 

10)  Intape ABEND:
This means that an input tape has failed on the mount.  The tape needs to 
be resubmitted.  To do this, change the status from SUBMITTED to WAITING 
in the injobs file, but chack that it hasn't been resubmitted already, i.e. 
it is not in queue already.  In theory, the system will put such tape back 
after 48 hours past prior submission.  You can check to see if the tape 
has already been submitted by using tpm_disp.  So if tape WM6000 failed 
and I wanted to see if it had been resubmitted already, I would do:
               tpm_disp 100 | grep WM6000
The 100 after the tpm_disp tells it to display the last 100 jobs.  You can
pick that number of leave it out - the default is 40.
Make sure to notice the tape drive if there are a lot of ADBEND messages.

11)  RECO hung:
This means that d0reco.x is hung more than 30 seconds on a node xxxx.
rlogin to the node in question and do the command ps -ef | grep dzero
You will see d0reco.x running and a cpu time next to it.  This time
should increase within 30 seconds, i.e., repeat the command and see if the
cpu time for d0reco.x increases.  If it is really hung, then you need to
kill the following: inreader.x, outwriter.x, and d0reco.x.

12)  Mutiple RECOs running (on node fnsfxxx):
rlogin to the node in question and kill all d0reco.x, inreader.x, and
outwriter.x process you see.

13)  Rsh got stuck in .......killed:
This just means that some remote command got stuck.  Don't worry about 
this message unless it is repeated a lot.  Otherwise, it is  real
network problems.  Call operator to check that everywork node is on if
you can not rlogin to each work node.

14)  Sever node fnsfd_0 down - investigate:
Thsi message most likely means that the node had a brief network glitch.
See if the sta files are growing (ls -l proman/sta in the relevant spooling
area).  If they are still growing, then there is nothing wrong.  Again, if 
this message is repeated many times, then there might be real network problems.


15)  Check wrkshell on fnsfxxx:
This means that there are more than one wrkshell running on fnsfxxx.
Normally, the system send this message as a notice after it fixed the
problem.  Otherwise, go to the node in question, do 
ps - ef | grep dzero and kill daughter processes if exist.

16) No action necessary:

No files in raw
WNxxxx returned to blanks
dbl3 sever(CAL, VTX, DBM, FDC, MUO, CDC, SAM, RSM) failures
Bad Zebra contsruct
Job control failed
..................

IV. Miscellaneous

1)  Stale(hung for a long time) or Nothing(total =0) on proman/sta area:

Follow steps descrived in the tpm user manual.  Sometimes, you need to
kill inspooler and remove files on $PRODAT/$FMLMODE.  You have to solve
problems case by case.  Make sure what you are doing before kill -9 and rm.

2)  Too many files in proman/sta area:

Check the disk space for the spool0x in question by using df and look
which area is occupying disk space mostly by using du.  Check the oldest
timestamp in proman/sta area or started time for the VSN concerned.
Decide whether the output tape drive for the LM is good or bad.  If
it is bad, kill outspooler, set tape drives to be UNAVAILABLE in
drives.fnsfx file and follow the procedure for tape drives failure.

3)  How to force the output tape copying:

If there is not enough disk space, or there are too many dst files on
proman/sta or proman/dst area , you may force the output files to be 
copied by doing the following:
    go to $PRODAT/$FMLMODE in question, then
    echo "" > proman/sta/endit
This will cause a output tape to be finished and the dst files to be
copied.

*  In any cases, seeing various log files in addition to mail message
   will give you usuful hints for trouble shooting.