From: D0ISU1::SYJUN 5-MAR-1994 19:58:34.33 To: FNALV::SCHELLMAN CC: Subj: D0_UNIX_FARM I. Tape Drive 1) General Each phsical machine(three logical machine-LM) has seven tapes drives: three for three inspoolers, three for three outspoolers and one for spare. Inspoolers can share tapes drives in case of lack of tape drives, while three tape drives should be assinged to three ouspooler. 2) Tape Drive Failure A tape drives will fail after several months of usage. A usual symptom for a bad tape drive is mount failure on 3 or more in row(intape or outtape ABEND with error counts). After 5 accululated failures with actual error counts, you will be paged from the farm. Look at drives.fnsfx at ~/proman/resource area(the third column indicate actual error counts while forth is accumulated errors). The procedure when this happens is to first call the operators to have the drive clean. After it has been cleaned, try the drives again(set the STATUS to READY and the actual error count to 0). If it still fails after while, do the following steps: 0) Combine the inspoolers properly, if necessary, in resource at ~/proman/resource area based on inspool_list_#.fnsfx at the same area. - skip this step for one bad tape drive. - see example how to combine inspoolers. 1) execute cps_umaint tape sgi82 broken noswap [ < comment ] - type why you believe the device is broken and ^D(end of file) when you done or edit a comment file first and use the option. this will update the tape drive state to broken for cps_tape [-lt] and send mail to farm-admin@fnsg01.fnal.gov and farm-user-fnsfe@fnsg01 2) call operator for log. (our production names: sgid0sfe and sgid0sfd) * a person you who takes charge for bad tapes replacement/repair in farm-admin: fagan@large.fnal.gov - do not contact him directly. * whenever you send mail to farm-admin@fnsg01, send cc/farm-users-d0@fnsg01. Other symptoms might be inspooler and outspooler hung, and too many files in proman/sta area(- this usually happen when a tape drive is changed to single density mode). 3) How to combine inspoolers in 3 LM: a) 1 bad tape drive- No action is neccessayy. b) 2 bad tape drives- two inspoolers have to share one tape drive. c) 3 bad tape drives- set one LM to UNAVAILABLE. d) 4 bad tape drives- case 2) and 3) simutaneously. e) 5 bad tape drives- set two LM to UNAVAILABLE. For b)-d), edit or use proper inspool_list_#.fnsfe file and edit proper inspool # in resource file at ~/proman/resource. For c), you may combine three inspoolers together. Nonetheless, it is not recommended due to the system efficiency. II. Automatic Paging System On the critical situations, the system will automatically page you so that problems may be fixed without any delay. You may add paging codes if neccessary. The following is the current setup: 1) AUTOMATIC PAGING CODE ----------------------------------------------------------------------------- Code Corresponding Mail Message _____________________________________________________________________________ 00001 Disk space; fnsfg:/usr/people/dzero %use > 95% 00002 No tapedrive; All drives are busy; fatal error submitting the spooler 00003 Bad tapedrives: Error counts exceeded 00004 All drives are busy; fatal error submitting the spooler 00005 All drives are busy; fatal error submitting the output spooler 00006 All drives are busy; fatal error submitting the input spooler 00007 No available output tape; no blank tape for output tapes 00008 No current project defined 00009 Disk space; /dev/dsk/lvx %use > 89% ----------------------------------------------------------------------------- 2) What have to do 00001 and 00009 : This means that the source and log file disk/dzero being full. use df and du command to check free disk space and disk usuage respectively. This is the most dangerous problem; you have to stop the system completely immediately and start investigation. 00002 - 00004 : Check resource or drives.fnsfx at ~/proman/resource 00003,00005,00006: consult Tape Drive Failure 00007 : No action is necessary. Just call to the operator and check available output tapes. 00008 : Check resource and project area. III. Mail During the shift, you will have many mail from the system to unix_proman or from products and the farm-admin to farm-user. Mails to unix_proman are the subject concerned here. The other is more general and broad in fnal sgi farn level, which include reports from farm-users, shut down schedule, farm meeting, and etc.. Most of time, various mail messages from the farm are enough to give you diagnostics or symptoms of problems as well as problem itselves. There are maily three catagories of mail; warning, alarm and critical message. Most of them are well explained in the TPM user manual. Here, some practical ways to take of various mails as a moniter. 1) P-server(s): This is just a warning message. Ignore this unless it is repeated many times for one fnsfx_# without doing any process or with doing a very slow process. Otherwise, there might be real problems. You may check the flow of files looking at inspool or raw area as well as proman/sta area or proman/dst. tpm_disp will tell you how long a tape on LM in question has been processed. 2) after_crash executed: This mail is sent after the system shut down or reboot saying that tpm_submit_job is not running, .... This is an automatic precedure to start the farm. In any reason, if this message is repeated many times, check whether tpm_submit_job really is not running. If tpm_submit_job is not running or dereco_prl would not start, execute after_crash manually at ~/proman/exe. If tpm_submit_job is already running, you will have another warnning mail message not to try to start the system using after_crash. 3) Bad Tape Drives and No tapedrive: You will recieve this massage repeatedly when you are paged with code 00002-00005. Follow a proper procedure based on tape drives failure. Remember that there is a time lag to recover system after you fix it. You may turn off your pager not to be paged repeatedly if you fix it properly - the paging time interval is 15 min. Don't foget to turn it on. 4) Paging modem -- can't get lock file: This is a mail from operator@d0sgi usually with Bad Tape Drives or No tapedrive messages without paging. Do the following: rlogin into d0sgi9 as "operator" with passward ......; then rm /tmp/modempage.lock. Usually, this is due to a minor hareware problem in the automatic paging system. Ask to turn off and on modem once. 5) Bad file: The inspooler skips the bad files on tape and then sends this mail saying which file on which tape is bad. Collect bad files more that 15 or longer than two weeks. Send a list of bad files to dorota@fnalv. The system won't know anything about the missing("bad") file except that it is missing; thus it will to resubmit it again via jobcontrol mechanism. So you have to remove this file by hand from the corresponding tapes_moved.VSN (not itapes_moved.*) in ~/proman/history unless you want to be resubmitted agian. 6) Wrong Project: This means that a proper project is not defied at ~/proman/project area or at the resource file. A wrong format in injobs file will send the same message. 7) Tape in use on D0FS: Log on to D0FS with the same username and password as the farm. Type "show queue STAGE_IN_USE_ON_UNIX_FARM", then determine entry number concerned base on JOBNAME. Remove the entry from the queue executing "delete/queue=entry#". Log off from the D0FS and change the status of the tape from SUBMITTED to WAITING in the injobs file. However, there are legitimate cases when tape might be actually used by another user. Sometimes logon D0FS is blocked by many users. Try to logon D0FS through subnode such as D0RSEX, D0TSEX, D0RSUT, etc.. 8) DST transfer to D0FS failed for WNXXXX: This means that a DST failed to transfer successfully to D0FS. If this message has a timestamp anywhere from midnight to 20 minutes past midnight, dont't worry about it - it will be fixed automatically. If the timestamp is any other time, go to ~/proman/exe and execute the script pick_up_all (pick_up_trans will do the same thing but the specific node you are on). System automatically pick up failure tapes twice a day(at 5:00 and 20:00). Doing this will save you some disk space for spoolers. the system will pick failed transfers if everything is OK except D0FS; however, if something is wrong, it makes sense to have a look at ~/proman/logdb/19../.../..-../copy_results_log.WNxxxx for comments. 9) Outtape ABEND: This means that an output tape has failed on the mount. If there were no files written on it, then its status should be changed from USED to READY in the blank file. The system do this automaticaaly, so you should not have to worry about it. 10) Intape ABEND: This means that an input tape has failed on the mount. The tape needs to be resubmitted. To do this, change the status from SUBMITTED to WAITING in the injobs file, but chack that it hasn't been resubmitted already, i.e. it is not in queue already. In theory, the system will put such tape back after 48 hours past prior submission. You can check to see if the tape has already been submitted by using tpm_disp. So if tape WM6000 failed and I wanted to see if it had been resubmitted already, I would do: tpm_disp 100 | grep WM6000 The 100 after the tpm_disp tells it to display the last 100 jobs. You can pick that number of leave it out - the default is 40. Make sure to notice the tape drive if there are a lot of ADBEND messages. 11) RECO hung: This means that d0reco.x is hung more than 30 seconds on a node xxxx. rlogin to the node in question and do the command ps -ef | grep dzero You will see d0reco.x running and a cpu time next to it. This time should increase within 30 seconds, i.e., repeat the command and see if the cpu time for d0reco.x increases. If it is really hung, then you need to kill the following: inreader.x, outwriter.x, and d0reco.x. 12) Mutiple RECOs running (on node fnsfxxx): rlogin to the node in question and kill all d0reco.x, inreader.x, and outwriter.x process you see. 13) Rsh got stuck in .......killed: This just means that some remote command got stuck. Don't worry about this message unless it is repeated a lot. Otherwise, it is real network problems. Call operator to check that everywork node is on if you can not rlogin to each work node. 14) Sever node fnsfd_0 down - investigate: Thsi message most likely means that the node had a brief network glitch. See if the sta files are growing (ls -l proman/sta in the relevant spooling area). If they are still growing, then there is nothing wrong. Again, if this message is repeated many times, then there might be real network problems. 15) Check wrkshell on fnsfxxx: This means that there are more than one wrkshell running on fnsfxxx. Normally, the system send this message as a notice after it fixed the problem. Otherwise, go to the node in question, do ps - ef | grep dzero and kill daughter processes if exist. 16) No action necessary: No files in raw WNxxxx returned to blanks dbl3 sever(CAL, VTX, DBM, FDC, MUO, CDC, SAM, RSM) failures Bad Zebra contsruct Job control failed .................. IV. Miscellaneous 1) Stale(hung for a long time) or Nothing(total =0) on proman/sta area: Follow steps descrived in the tpm user manual. Sometimes, you need to kill inspooler and remove files on $PRODAT/$FMLMODE. You have to solve problems case by case. Make sure what you are doing before kill -9 and rm. 2) Too many files in proman/sta area: Check the disk space for the spool0x in question by using df and look which area is occupying disk space mostly by using du. Check the oldest timestamp in proman/sta area or started time for the VSN concerned. Decide whether the output tape drive for the LM is good or bad. If it is bad, kill outspooler, set tape drives to be UNAVAILABLE in drives.fnsfx file and follow the procedure for tape drives failure. 3) How to force the output tape copying: If there is not enough disk space, or there are too many dst files on proman/sta or proman/dst area , you may force the output files to be copied by doing the following: go to $PRODAT/$FMLMODE in question, then echo "" > proman/sta/endit This will cause a output tape to be finished and the dst files to be copied. * In any cases, seeing various log files in addition to mail message will give you usuful hints for trouble shooting.