THE FARM SHIFTER'S GUIDE by Cathy Cretsinger University of Rochester V1.0 ­ 27 June 1994 OVERVIEW The purpose of this document is to provide guidance for the (non-expert) production-farm shifter on what to do during shift. The basic jobs of the farm shifter are: (1) looking for bad tapes, hardware failures, and other problems; (2) fixing and/or reporting the problems; (3) submitting special jobs requests when so instructed; and (4) providing weekly reports of farm activity. This document is divided into these sections, dealing with an assortment of events one can expect to see on shift. In order to keep the focus on "how-to" issues, I have not included a general description of the TPM farm management system. Documentation on that, in varying degrees of detail, exists elsewhere. (See the USR$ROOT:[DZERO.DOCS] area on D0FS.) Shifters are advised to learn about the system. I must emphasize that this is not a UNIX manual, although it does include a few UNIX commands that are essential to monitoring the farm status. Shifters who are not familiar with UNIX should acquire and read manuals on the subject. Credit goes to Soon Yung Jun, Kirill Denisenko, and Mei Gui for their notes, which I have used extensively in preparing this document. However, responsibility for the content of this document rests with me. 1 CONTENTS 1. Looking for Problems 1.1. Routine monitoring 1.1.1. Display farm status 1.1.2. Check farm configuration 1.1.3. Check output files 1.1.4. Look at injobs file 1.1.5. Check whether processes are running 1.1.6. Check tape drive status 1.1.7. Check disk space 1.1.8. Check D0FS buffer area 1.1.9. Monitor CPU activity 1.2. Email (automatic notification of problems) 1.2.1. P-server(s) 1.2.2. Intape ABEND 1.2.3. Outtape ABEND 1.2.4. Bad file 1.2.5. Tape in use on D0FS 1.2.6. DST transfer to D0FS failed for WNxxxx 1.2.7. after_crash executed 1.2.8. No tape drive 1.2.9. Bad tape drive 1.2.10. Wrong project 1.2.11. Server node fnsfX_# down - investigate 1.2.12. Disk space limit exceeded 1.2.13. Rsh got stuck in _____; killed 1.2.14. Check wrkshell on fnsfxxx 1.2.15. RECO hung 1.2.16. Multiple RECOs running on node fnsfxxx 1.2.17. Many others 1.3. Log files 1.3.1. logdb 1.3.2. pdbkd 1.3.3. wrkdb 1.4. The pager 2. Problem fixing 2.1. Tape drive fails 2.2. Combining inspoolers 2.3. Releasing tapes from D0FS 2.4. Marking tapes BAD 2.5. Injobs has nothing WAITING 2.6. DST does not get to D0FS 2.7. Killing/Restarting a parallel server 2.8. Multiple jobs INQUEUED on a VM 2.9. Restarting D0RECO on a node 2.10. Problem runs 2.11. Disk space shortage on /dzero 2.12. Disk space shortage in spooling area 3. Submitting special jobs 3.1. Get tapes_moved file 3.2. Add the request to the injobs file 3.3. Edit the resource file 2 3.4. What happens next 3.5. Tracing the output 4. Making weekly reports 4.1. Number of events processed (graph) 4.2. The written report 4.2.1. Dates and shifters' names 4.2.2. Weekly summary 4.2.3. Tape drives 4.2.4. Special projects 4.2.5. Current node configuration 4.2.6. Number of tapes processed 4.2.7. Last run number processed 4.2.8. Tapes in injobs 4.2.9. Tape ABENDS, bad files 4.2.10. Miscellaneous 4.3. Changing log directories NOTATION AND FONT CONVENTIONS I use a variety of fonts to convey different intentions. Filenames, directories, processes, etc. on the farm are set in Courier; commands you should type are set Courier (bold; machine names, account names, etc. are set in Helvetica, just to make life more interesting. (These fonts are not visible in the text version.) Throughout, the notation VM is used to denote a "virtual machine" (a.k.a. "logical machine"­ a parallel server process running on an I/O node on the farm); the generic name fnsfX_# is used to denote a specific virtual machine. Capital X is always to be replaced by a (usually lowercase) letter when typing commands; small x, N, and # are to be replaced by numerals. 3 1) LOOKING FOR PROBLEMS 1.1 Routine monitoring Log into fnsfe, fnsfd, and fnsff (the farm I/O nodes assigned to D0) as dzero and run the diagnostics as indicated. Also, log into the DZERO account on D0FS and read the mail, which will contain special requests, problem reports, inquiries, etc. You should archive the mail in this account, as detailed in section 1.2. 1.1.1. Display farm status This is a nice way to start because it gives you an overview of what each virtual machine has been doing. You should do it at least twice per day to make sure that tapes are moving through the system. The command is: %tpm_disp {If you have a 132-character terminal, you can use tpm_wide, which shows finish as well as start times} ID NODE PROJECT INVOL OUTVOL DATE STATUS ________________________________________________________________ 12492 fnsfe_0 RECO_FULL_V12 WM5976 PWNODE Mar 16 16:31:10 ENDED 12493 fnsfe_2 RECO_FULL_V12 WM5948 PWNODE Mar 16 17:37:14 ENDED 12494 fnsfe_1 RECO_FULL_V12 WM5977 PWNODE Mar 16 21:34:46 ENDED 12495 fnsfe_0 RECO_FULL_V12 WM5978 PWNODE Mar 16 23:28:53 ENDED 12496 fnsfe_2 RECO_FULL_V12 WM5979 PWNODE Mar 17 02:13:19 ENDED 12497 fnsfe_1 RECO_FULL_V12 WM5980 PWNODE Mar 17 05:23:30 ENDED 12498 fnsfe_0 RECO_FULL_V12 WM5982 PWNODE Mar 17 06:12:49 ENDED 12499 fnsfe_2 RECO_FULL_V12 WM5985 PWNODE Mar 17 08:54:01 ENDED 12500 fnsfe_1 RECO_FULL_V12 WM5986 PWNODE Mar 17 08:57:49 ENDED 12501 fnsfe_0 RECO_FULL_V12 WM5987 PWNODE Mar 17 12:12:40 ENDED 12502 fnsfe_1 RECO_FULL_V12 WM5988 PWNODE Mar 17 13:55:59 ENDED 12503 fnsfe_2 RECO_FULL_V12 WM5989 PWNODE Mar 17 15:24:46 ENDED 12504 fnsfe_0 RECO_FULL_V12 WM5990 PWNODE Mar 17 16:51:20 ENDED 12505 fnsfe_1 RECO_FULL_V12 WM5991 PWNODE Mar 17 18:37:03 ENDED 12506 fnsfe_2 RECO_FULL_V12 WM5992 PWNODE Mar 17 20:41:01 ENDED 12507 fnsfe_2 RECO_FULL_V12 WM5999 PWNODE Mar 17 22:11:16 ENDED 12508 fnsfe_1 RECO_FULL_V12 WM5942 PWNODE Mar 18 16:49:51 ENDED 12509 fnsfe_2 RECO_FULL_V12 WM5982 PWNODE Mar 19 16:39:15 ENDED 12510 fnsfe_1 RECO_FULL_V12 WM5986 PWNODE Mar 19 16:58:26 FAILED_ON_INSP 12511 fnsfe_1 RECO_FULL_V12 WM5942 PWNODE Mar 20 16:40:54 INQUEUED 12512 fnsfe_2 RECO_FULL_V12 WM5986 PWNODE Mar 20 16:57:16 INQUEUED % This command shows you the status of the farm. (Note that you can use tpm_disp # to display more lines­# specifies how many­if you want to look at less recent jobs.) The ID is a sequential number assigned to each job. NODE shows which virtual machine 4 (VM) the job ran/is running on. PROJECT tells what processing was/is done on the file. INVOL is the tape number, OUTVOL is always PWNODE. DATE is obvious. Possible STATUS entries for parallel servers (the usual D0 configuration) are: INQUEUED - Job has been submitted to VM with multiple worker nodes. ENDED - Inspooling finished, status of processing and output not checked. FAILED_ON_INSP - Input spooler failed. Check its log file (see section 1.3.1) for more information. For standard RECO (WMxxxx tapes with ALL stream data), the tape will be resubmitted automatically after 48 hours. It is usually a good idea to wait for this, since the procedure that resubmits the tape also updates the corresponding tapes_moved file to remove any files that were processed successfully before the inspooler failed; this cuts down on reprocessing of data. For special jobs, you must update the injobs file manually (edit the file, changing status from SUBMITTED to WAITING for that tape). INTAPE_IN_USE - Some other job is holding the input tape, which causes the inspooler to fail. You should investigate and, if appropriate, release the tape (see section 2.3). There are other status flags that will appear in sequential or stand-alone modes, which are normally not used. For more information on those, refer to Kirill Denisenko's "TPM Production Manager Operator's Manual and User's Guide." One thing to look for here is abandoned jobs in the INQUEUED state. You can tell they have been abandoned because the VM has gone on to the next tape (there is another tape INQUEUED to that VM at a later time) but the status for the older tape never changed to any of the end conditions. If there are abandoned jobs, this may be a sign of a problem with the RECO code, or simply a side effect of errors, crashes, etc. See Section 2.8 for more on this issue. 1.1.2. Check farm configuration The relevant files are kept in the ~/proman/resources directory on fnsfe. (To get there, you can use the command res on any of the I/O nodes.) These files show which VMs are available, which worker nodes are assigned to each VM, and how the inspoolers are distributed. Look at them when you want to check on how the farm is presently configured. Here are a few useful files: - The main resource file: % re (or tpm_rsrc) {types the resource file} fnsfe_0 FARMLET RECO_FULL_V12 4 fnsfe_1 FARMLET RECO_FULL_V12 5 fnsfe_2 FARMLET RECO_FULL_V12 6 fnsfd_0 UNAVAILABLE RECO_FULL_V12 1 fnsfd_1 UNAVAILABLE RECO_FULL_V12 3 5 fnsfd_2 UNAVAILABLE RECO_FULL_V12 2 % This file (~/proman/resources/resource) shows what VMs are defined, the status (FARMLET if the VM is able to process data, UNAVAILABLE otherwise), what project it is presently assigned to run, and which inspooler process the VM is using (designated by the number at the end of each line). You will need to edit the resource file in order to: change the status of VMs; change the project assigned to one or more VMs; combine inspoolers More on these operations (e.g., when to do them) appears in various sections below. Note: When you edit the control files (especially the injobs file), it is best to copy the file to a dummy file, edit the dummy version, then copy or rename the dummy to the correct file name when you are satisfied with the changes. This is recommended because UNIX does not save old versions, which makes it very easy to overwrite a good file with junk. - Inspooler definition files: The number representing the inspooler in the resource file refers to a file in ~/proman/resources called inspool_list_N.fnsfX, which contains the name(s) of a spooling disk(s) to be used by the VM(s) to which the number N is assigned in the resource file. These spooling disks are not assigned to any particular tape drives: when an inspooler is ready for a new tape, it grabs any available drive. Note: if tape drives are broken, you may have more than one VM assigned to the same inspooler. This is acceptable, provided that both VMs are running the same job. (See section 2.2 for more on assigning inspoolers.) - Worker node assignment files: Detailed information about which worker nodes are in use is contained in a number of files called resource.fnsfX_#. (These are also in the resource directory.) Each such file contains a list of worker nodes assigned to that VM and their status. If you should ever have to disable sending events to a broken node, this is the file you would edit to do that, changing the node's status from READY to UNAVAILABLE. - Tape drive files: These files identify the drives assigned to each physical machine (fnsfe, fnsfd, or fnsff). They are described in section 1.1.6. - Injobs file: This file contains the list of input tapes to be processed. It is discussed in more detail in section 1.1.4. 6 - Blank tape file: The file blanks lists the existing output tapes and indicates whether they are USED or READY to be written on. Tapes are added to this file by the operators in Feynman Computing Center as they are initialized; shifters rarely need to edit this file. 1.1.3. Check output files You should check the output on each (working) virtual machine every two to three hours; this is the best way to make sure that data is flowing through the farm. Log in to each I/O node and use the following commands: % sp0 (or sp1 or sp2) {goes to the VM's spool directory­for use on fnsfe} (or spd0, spd1, spd2 ­ for use on fnsfd or spf0, spf1, spf2 - for use on fnsff.) % lps {directory of output files, shorthand for ls -l proman/sta/} total 1326911 -rw-r--r-- 1 dzero e740 120 Mar 22 13:01 12521_010_076089_10.ous -rw-r--r-- 1 dzero e740 144 Mar 22 13:09 12524_003_076090_21.ous -rw-r--r-- 1 dzero e740 275970240 Mar 22 13:09 12524_003_076090_21.sta -rw-r--r-- 1 dzero e740 111252960 Mar 22 13:20 12524_004_076090_22.sta -rw-r--r-- 1 dzero e740 292153680 Mar 22 13:01 ALL_076089_10.X_STA01REU 1210_ALL00_NONEX00_4032213 % This shows you the status of the output STA files. If you type lps again after a few minutes, you should see the file size increasing on all *.sta files that are not paired with *.ous files. This tells you that the jobs are running and producing output. The *.ios and *.ous files you will sometimes see paired with the *.sta files are normal; their functions are described in Kirill Denisenko's "TPM Production Manager Operator's Manual and User's Guide." There are a number of things that can cause the file sizes to freeze for a while, most of which are perfectly normal and require no action on your part except continued vigilance. What follows are some things you should notice, with some hints on how to tell if there is a problem. This should not be regarded as an exhaustive list, merely some starting points. 1.) If the response to lps does not change with time, look for *.done files paired with some of the files. When this appears, the parallel server process (d0reco_prl_fnsfX_#) probably needs to be restarted. The system is designed to do this automatically and send email confirming that it happened, so the first thing to do is to leave things alone and check later. If things hang up for a long time (hours), you might consider killing the parallel server yourself, especially if you have not seen any mail messages indicating that the process was restarted. 7 Instructions for this are in section 2.7. 2.) If you see "total 0" in response to lps, check that the inspool_fnsfX_N process is running. (Remember that N here corresponds to the number assigned to each VM in the resource file, not to the VM number; see section 1.1.2.) If the inspooler is not running, check for mail messages complaining that no tape drives are available: you may need to combine inspoolers due to broken drives. (Tape drive diagnostics are covered in section 1.1.6; what to do with a broken drive is covered in sections 2.1 and 2.2.) If enough drives are available, make sure that the project assigned to the VM in the resource file has tapes waiting in injobs. If all these things are correct and the inspooler does not start for over half an hour, call an expert. If the inspooler is running , check the subdirectories inspool/ and raw/ to see whether new raw data files are being read in. If they are not, check that the proper input tape has been mounted (see section 1.1.6). You are likely to see "total 0" if the time stamp on the last INQUEUED job on the VM in question is quite recent; this is because a full file has to be read in before any processing can start. If the "total 0" has remained for a long time and all processes are running for that VM (see section 1.1.5), you should suspect RECO problems. Check on the worker nodes (rlogin to one or two of them and check that the appropriate processes are running as described in section 1.1.5). If they appear to be okay, check the log files in the wrkdb area (see section 1.3.3) for RECO error messages. 3.) Occasionally, you may see .problem files in the area, as in the following example: sfe/spool00/dzero/fnsfe_0% lps total 2517150 -rw-r--r-- 1 dzero e740 257755680 Mar 28 03:49 ALL_076151_28. X_STA01REU1210_ALL00_NONEX00_4032804.problem -rw-r--r-- 1 dzero e740 257755680 Mar 28 05:35 ALL_076151_28. X_STA01REU1210_ALL00_NONEX00_4032805.problem -rw-r--r-- 1 dzero e740 257722920 Mar 28 06:49 ALL_076151_28. X_STA01REU1210_ALL00_NONEX00_4032806.problem -rw-r--r-- 1 dzero e740 257755680 Mar 28 09:20 ALL_076151_28. X_STA01REU1210_ALL00_NONEX00_4032809.problem -rw-r--r-- 1 dzero e740 257788440 Mar 28 12:19 ALL_076151_28. X_STA01REU1210_ALL00_NONEX00_4032812.problem These are leftovers from an old attempt to diagnoze ZEBRA problems in the files being processed. They do not affect the quality of the output, nor do they mean that a partition (file) has been lost, but they can cause the VM to hang up. If this happens, you should delete all files with names ending in .problem; you should see processing resume fairly quickly after that. 1.1.4. Look at injobs file 8 You should do this every day and make a note of the last tape waiting; this is how you can make sure that tapes are being added. The injobs file is ~/proman/resources/injobs. A shorthand command has been defined: % in {shows tapes with status WAITING, equivalent to more injobs | grep WAITING} WM5342 RECO_L0V_V12 WAITING VGL063 D0GEANT_SGI_V11SHL WAITING 4 VG0661 D0GEANT_SGI_V11 WAITING 8 WM3233 RECO_FULL_V11 WAITING 1 WM3238 RECO_FULL_V11 WAITING 1 WM3420 RECO_FULL_V11 WAITING 1 % You can also type the file, by first going to the resource directory (res), then typing: % more injobs {types the complete file} [many lines deleted] WM6010 EXPRESS_SGI PENDING WM6011 RECO_FULL_V12 SUBMITTED WM6013 RECO_FULL_V12 SUBMITTED WM6014 RECO_FULL_V12 SUBMITTED WM6015 RECO_FULL_V12 SUBMITTED WM6016 RECO_FULL_V12 SUBMITTED WM6017 RECO_FULL_V12 WAITING WM6019 RECO_FULL_V12 WAITING WM6020 RECO_FULL_V12 WAITING WM6024 RECO_FULL_V12 WAITING WM6025 RECO_FULL_V12 WAITING WM6026 EXPRESS_SGI PENDING WM6027 RECO_FULL_V12 WAITING WM6028 RECO_FULL_V12 WAITING WM6037 RECO_FULL_V12 WAITING [more lines deleted] % You can also use more injobs | grep to check the status of a particular tape. As the examples show, injobs has the format . This last shows up only if it is larger than zero; it indicates how many times an inspooler has tried and failed to read the tape. The status choices are SUBMITTED, WAITING, PENDING and BADTAPE. The first two are common. The job stays in WAITING until it is first in line and a VM is set up to run the job ; then it is SUBMITTED. BADTAPE is sometimes set automatically, but you should do it manually (see section 2.4) if you notice a large number of retries (> 3) for some tape. PENDING is set manually (by you) when a tape is good, but for some reason (e.g., problems with the RECO code) the job needs to be held up. When you need to change the status of a tape, do so by editing injobs. 9 There is an automated process that adds new data tapes to be processed with the current version of RECO to the bottom of the injobs file. Another process removes old tapes from the file, presumably some weeks after they have been successfully processed and their output files properly disposed of and catalogued. The result is that this file is usually quite large. (If you see no tapes WAITING during the run, you should be highly suspicious. See section 2.8.) Tapes processed for special requests (e.g. Monte Carlo, older or non-standard versions of RECO) are added to the injobs file manually (by you ­ see section 3 for instructions on special jobs); WM tapes for regular reconstruction are added automatically. The injobs file is a queue: tapes are put in at the bottom and read out off the top. Therefore, you should add tapes to the end of the file, not to the beginning. You should only move tapes to the beginning if OCPB has granted certain runs high-priority status; Lee Lueking will notify you when this happens. Note: When you edit this file, it is wisest to copy it to a dummy file first and edit the dummy file; copy the dummy to injobs only when you are satisfied that your changes are correct and that you have not deleted any tapes that should remain. (If injobs is overwritten with junk, the amount of work required to reconstruct it is large, so this is important.) 1.1.5. Checking whether processes are running When data is not going through a VM, it is useful to check whether the relevant processes are running. The list below shows you one way to check for these processes. Note that you must be logged into the I/O node you want to investigate or you will not see the processes. (For explanations of what they do, see Kirill Denisenko's manual.) % pf tpm {on fnsfe} dzero 5550 1 0 Jun 16 ? 3:46 tpm_submit_job tpm_submit_job dzero 22309 21019 0 17:21:13 ? 0:00 /usr/local/home/dzero /proman/tpm/check_rsh.csh -f /usr/local/home/dzero/proman/ dzero 7548 1 0 11:38:42 ? 2:20 /usr/local/home/dzero /proman/tpm/d0reco_prl_fnsfe_0 /usr/local/home/dzero/proma dzero 18861 1 0 17:04:44 ? 0:11 /usr/local/home/dzero /proman/tpm/d0reco_prl_fnsfe_1 /usr/local/home/dzero/proma dzero 12003 1 0 12:07:41 ? 2:29 /usr/local/home/dzero /proman/tpm/d0reco_prl_fnsfe_2 /usr/local/home/dzero/proma This shows you that tpm_submit_job, the master control process, is running. You also see the three parallel server processes for the VMs on fnsfe and any active child processes of tpm_submit_job (in this case, check_rsh.csh). If tpm_submit_job is not running, you can expect to receive mail (see section 1.2.7). 10 % pf prl dzero 11422 1 0 07:50:32 ? 0:59 /usr/local/home/dzero/proman /tpm/d0reco_prl_fnsfe_1 /usr/local/home/dzero/proma dzero 24160 1 0 17:07:38 ? 7:51 /usr/local/home/dzero/proman /tpm/d0reco_prl_fnsfe_2 /usr/local/home/dzero/proma dzero 5198 1 1 03:03:49 ? 2:25 /usr/local/home/dzero/proman /tpm/d0reco_prl_fnsfe_0 /usr/local/home/dzero/proma % This shows you that d0reco_prl_fnsfX_# (the parallel server process) is runninng for each VM currently in use. If you should need to kill any of these processes, you can get the process ID from this display (e.g., for d0reco_prl_fnsfe_1, the ID is 11422). % pf inspool dzero 24494 1 80 09:42:16 ? 11:16 /usr/local/home/dzero /proman/exe/inspool_fnsfe_4 -inl WM6127 -dev sgi84 -queid dzero 6057 1 73 06:56:36 ? 17:19 /usr/local/home/dzero /proman/exe/inspool_fnsfe_6 -inl WM6240 -dev sgi85 -queid dzero 7787 1 57 11:27:23 ? 3:52 /usr/local/home/dzero /proman/exe/inspool_fnsfe_5 -inl WM6128 -dev sgi80 -queid % This shows the inspool processes, which copy files from raw data tapes to the assigned spooling areas. Note that the number following fnsfX (4, 6, or 5 in the above example) does not refer to the VM, but to the inspool area assigned to that VM in the resource file. (See section 1.1.2.) % pf insrv dzero 14076 5198 70 03:59:03 ? 55:43 insrv_d0_fnsfe_0 dzero 12406 11422 39 08:00:54 ? 27:01 insrv_d0_fnsfe_1 dzero 12752 24160 55 01:17:52 ? 64:41 insrv_d0_fnsfe_2 % This shows the insrv process running on each VM; the last part of the name tells you the VM. These processes serve raw events to the worker nodes for reconstruction. You may see multiple insrv processes for one VM; this is not a problem. % pf outsrv dzero 12407 11422 80 08:00:54 ? 24:48 outsrv_d0_fnsfe_1 dzero 14077 5198 71 03:59:03 ? 49:03 outsrv_d0_fnsfe_0 dzero 12753 24160 76 01:17:52 ? 65:50 outsrv_d0_fnsfe_2 % This shows the outsrv process running on each VM. These process pick up reconstructed events from the worker nodes and add them to the output files on the spooling disk (in the proman/dst and proman/sta subdirectories). You may see multiple outsrv processes for one VM; this is not a problem. 11 % pf outspool dzero 20778 1 0 09:12:07 ? 5:43 /usr/local/home/dzero/proman /exe/outspool_fnsfe_2 -inl WN7814 -dev sgi83 -path dzero 2922 1 0 22:23:14 ? 22:21 /usr/local/home/dzero/proman /exe/outspool_fnsfe_1 -inl WN7813 -dev sgi81 -path dzero 14235 1 78 19:33:14 ? 25:01 /usr/local/home/dzero/proman /exe/outspool_fnsfe_0 -inl WN7811 -dev sgi82 -path % This shows you the outspooler processes, including which tape drive is used by which VM, and what tape is being written. If any of the above processes (except tpm_submit_job) are missing, you can restart them by executing the script d0reco_prl.csh. It is a good idea to wait and watch for a while before doing that, since they usually get restarted automatically (usually within about 15 minutes, when the production manager process notices that they are dead). On a worker node, the key processes to look for (using rlogin to get to the node and pf to look for the processes) are: wrkshell, d0reco.x (there might be a different name for special jobs), inreader.x, and outwriter.x. There may be multiple inreader and/or outwriter processes; this is no problem, but multiple wrkshell or d0reco indicate a problem. 1.1.6. Check tape drive status You should do this at least once per day, and note the results in the production logbook, so that we can see if errors are mounting quickly on some drive. This will help to flag failing drives. Look at file drives.fnsfX in the directory ~/proman/resources. % more drives.fnsfe sgi80 READY 0 27 sgi81 READY 0 40 sgi82 READY 0 45 sgi83 READY 0 6 sgi84 READY 0 13 sgi85 READY 0 6 sgi86 READY 0 1 % The third column gives the current error count on this drive; the fourth is a cumulative total. If a problem is indicated (by the error count reaching 5 or by frequent ABEND errors on that drive in email messages), follow the tape-drive repair procedures below. Drives that reach 5 errors will be set to UNAVAILABLE in this file automatically. (Incidentally, drives on fnsfd have names beginning with sts rather than sgi, but all the procedures are the same.) Once per day, copy each drives.fnsfX file to the corresponding sk_drives.fnsfX file. This backup file can be used to replace 12 the drive file if it should be overwritten by blanks. Current tape drive use can be determined using cps_tape -lt. (You must do this from the I/O node whose drives you are interested in.) The response to this command looks like: TAPEDRIVE DEVTYPE STATUS ALLOC TAPE TAPE_STATUS sgi80 exabyte_8500 working allocated WM6128 mount pending sgi81 exabyte_8500 working allocated WN7813 mounted sgi82 exabyte_8500 working unallocated WN7811 mounted sgi83 exabyte_8200 working allocated WN7814 mounted sgi84 exabyte_8500 working allocated WM6127 mounted sgi85 exabyte_8500 working allocated WM6240 mounted sgi86 exabyte_8500 working unallocated WM3420 mounted If a drive is not working, it will have status broken in this list. One thing to watch for here is a drive that stays in mount pending for over half an hour. If you see this, you should call the operators and ask them to check on it. 1.1.7. Check disk space The farm will send mail messages when the disk in a spooling area is over 89% full; this is an important warning since processing will stop if the spooling disk is full. In addition, everything will stop if the user area is over 97% full. So, checking the disk space is useful, especially if something is slow or stopped. You can look at disk usage with the UNIX command df. % df {shows disk use} Filesystem Type blocks use avail %use Mounted on /dev/root efs 31430 21627 9803 69% / /dev/dsk/lv2 efs 7603592 5043005 2560587 66% /spool02 /dev/dsk/lv1 efs 7603592 5435444 2168148 71% /spool01 /dev/dsk/lv0 efs 7603592 5373084 2230508 71% /spool00 /dev/usr efs 537327 501613 35714 93% /usr /dev/dsk/dks0d1s7 efs 1642725 299369 1343356 18% /usr/people fnsfg:/usr/people/dzero nfs 3800576 3053056 747520 80% /dzero fnsfg:/dbl3 nfs 7603712 3033600 4570112 40% /dbl3 d0tsar:d0tsar$data1 nfs 15617024 1342464 14274560 9% /d0fs/disks/d0tsar_data1 fnsfg:/proman nfs 3801600 2466304 1335296 65% /proman In this example, you see that the spooling areas (/spoolXX in the right-hand column) are well below 89% in use. Note that the only spooling areas shown are those on the I/O node you are logged into. You should also pay attention to the disk mounts, particularly the nfs-mounted disks /dbl3, /proman, and /usr/people/dzero. If these are not mounted on an I/O node, its VMs will be unable to 13 process data. (Likewise, a worker node where these disks are not mounted will be unable to process data.) If you see that some area is getting quite full, you may want to track down how the space is being used. A useful tool for this task is the UNIX command du. % du {shows size of all subdirectories of current area} 2 ./fnsf112/proman/inspool 1 ./fnsf112/proman/raw 4 ./fnsf112/proman 408 ./fnsf112/run 413 ./fnsf112 2 ./fnsf113/proman/inspool 1 ./fnsf113/proman/raw 4 ./fnsf113/proman 332 ./fnsf113/run 337 ./fnsf113 [lines deleted] 68145 ./summaries 1738303 ./histograms 78579 ./inspool 724 ./tmp 1497886 ./raw 707976 ./proman/dst 649707 ./proman/sta 3965 ./proman/linktemps 1361649 ./proman [lines deleted] 4757416 . This command shows the size of each subdirectory of the current directory. It is useful for tracking down files that have run amok and are eating disk space. The last line is a total size for the current directory and all subdirectories. 1.1.8. Check D0FS buffer area Once per day, from the DZERO account on D0FS, set default to D0$DATA$BUFFER and do a directory. Look for DST files not paired with RCPs, or RCPs not paired with DSTs. If you find any, do the following: 1) If you have DSTs without RCPs that are over a day or two old, find out what output tapes have the corresponding STAs. Use the timestamp in the DST file name to select the correct subdirectory of ~/proman/pdbkd (see section 1.3.2), then search the STAT_*.RCP files in that subdirectory for the run and part number you need. Then look at the STAT_WNyyyy.RCP files found in the search to determine whether all files (or only some) on the tape are missing their RCPs. If all files on an output tape lack RCPs, then do cd 14 ~/proman/report on the farm and execute conv_prepare_rcp WNxxxx ~/proman. Here, should be replaced by the full pathname of the subdirectory of ~/proman/pdbkd in which the STAT RCP file for the tape you need resides. The last argument specifies a "linktemps" area to hold the RCPs until they are ready to be transferred; always use ~/proman for this. The RCPs will be generated and transferred to D0FS automatically; check later to confirm that the DSTs have moved out of the buffer area. If only a few files from a tape need RCPs, execute conv_prepare_rcp_single instead, with the arguments as indicated above. The RCPs will appear in the directory ~/proman/linktemps; you should transfer the needed RCPs by hand when they are there (this takes some time). Transfer the DST RCPs to the buffer area on D0FS, and send the STA RCPs to USR$ROOT:[FMD0.RCPFARMIN] on D0FS. Finally, send both the DST and STA RCPs to the PROD_DB area on D0FS: for data, the directory is USR$ROOT22:[PROD_DB.PROD_DB.SGIDATA], for Monte Carlo, use USR$ROOT22: [PROD_DB.PROD_DB.SGIDATA.MC]. You should also try this method when you have a full tape for which conv_prepare_rcp failed. 2) If you have RCPs without DSTs, check the log files on the farm in ~/proman/report/log. Again, you will need to know the output tape for the corresponding STA. Then look at the log files. (Use ls *.WNxxxx to find the log files.) In particular, the file copied.WNxxxx is the script for the zftp transfer of the DSTs, which you can use to find out which directory had the DST before the transfer (it's also a nice example of how to do zftp transfers), and copy_log.WNxxxx is the output from the transfer, which you can check for errors. Check in the directory given in copied.WNxxxx to see if the DST is still there. If the file is there, zftp it manually to the buffer on D0FS; if not, delete the RCP from the buffer area. The DST must be in lower-case letters or zftp will not work. If the file is only in uppercase, rename it to lowercase before starting zftp. Note: The above procedures for handling failed transfers were provided by Mark Galli (FNALV::MGALLI). Please direct questions about them to him. You should pay particular attention to files in D0$DATA$BUFFER with names starting with WG__, where is one of the following: `T' if the target directory specified in the RCP is an invalid choice, `N' if the file is an RCP whose corresponding DST never got to the buffer, `E' if the file already exists in the target directory, `F' if the RCP format is bad. Codes `D' (for done) and `C' (for copy) are dealt with by others; you may ignore such files. For the codes listed above, 15 investigate the problem identified by the code, and if you solve the problem, rename the file to remove the WG__ prefix. If duplicates of a file are produced, you will receive email on D0FS::DZERO. Investigate and reply with an indication of which files should be removed (usually the older version). 1.1.9. Monitor CPU activity You can create a running display of CPU activity and other information from the farm nodes (I/O and worker nodes) on any terminal that runs X-windows and supports TCP/IP network protocol. This is not necessary, but it can be helpful. To start the display, log into fnsfe or fnsfd and type: % setenv DISPLAY :0 { is the name of your X-windows terminal} % cps_xpsmon -p with chosen to be sgid0sfX to get a display for I/O node fnsfX. It will take a few minutes for the monitor to pop up in a separate window. Once it does, you can control it from that window. The colors are frightening, but the information can be helpful. 1.2. Email (automatic notification of problems) You will be notified of various conditions via automatic email from the farm. Responsibility for fixing these problems (or notifying an expert where appropriate) rests with the shifters. The mail messages are sent to unix_proman, an emailing list defined in a file named .mailrc and kept in the home directory on fnsfe. You can update the list by editing that file. Please leave dzero@d0fs on the list at all times. You should have your name (at an account you log into regularly) on the list while you are on shift; you may take it off when you are not on shift. In order to minimize confusion, do not add or delete names other than your own unless the person in question has asked you to. You should archive these and other messages in the MAIL folders of the DZERO account on D0FS. Here is the current list of folders, with an explanation of what goes in them: FARM_YYMMDD - Messages generated by the farm, or other messages directly related to farm status. The folder dated MMDD runs from Monday (MM/DD at 0:00) through Sunday (MM/DD+6 at 23:59). Please use the sender's time stamp, not the recipient's, to determine which week messages belong to. SPECIAL - Any message related to a special project: requests, questions, status messages you send out or receive, etc. 16 TAPE_PROBLEMS - Messages related to investigation of problems involving unreadable, overwritten, or useless input or output tapes. HINTS - Messages explaining how to do things on the farm, or copies of messages requesting hints on handling a problem. If a hint is of general interest, you may extract it to a file in the [.DOCS] subdirectory of the DOFS::DZERO account. (When doing this, be sure the file name is descriptive and contains the date of the message.) MINUTES - Any meeting minutes that get sent to DOFS::DZERO. DOUBLE_ENTRIES - Messages identifying files with two entries in the FATMEN catalog. You should respond with some indication of which ones should be deleted (see section 1.1.8). Keep the messages and any response you make to them in this folder. WRONG_FILES - Messages containing lists of WG* files in the buffer area on D0FS. Messages listing DST files in the buffer without RCP files go here also. Investigate these files as described in section 1.1.8. Messages that fit none of the above categories may be left in the MAIL folder, unless they are clearly irrelevant (e.g., cps/ocs updates for machines not used by the RECO farm), in which case they may be deleted. Here are some mail messages generated by the farm (with examples in some cases), and what to do when you get them. 1.2.1. P-server(s) Date: From: dzero@fnsfe.fnal.gov ( Dzero) To: dzero@d0fs Subject: P-server(s) Parallel servers on fnsfe_2 will be restarted This is probably the most common messagae. It signals that the parallel servers on one of the VMs were restarted. It can be ignored unless it happens too frequently, so you should archive these on d0fs and pay attention to the frequency. There is a likely problem if you see repeated messages from one VM (fnfsX_#) with little or no processing going on. If you are concerned, check the flow of files in the spooling area (see section 1.1.3). You can also use tpm_disp to see how long the VM has been running the current job. 1.2.2. Intape ABEND Date: From: dzero@fnsfe.fnal.gov ( Dzero) To: dzero@d0fs 17 Subject: Intape Abend The previous intape job ended ABEND 12512 Tape WM5986 -1 files 1869 sgi84 fnsfe_2 Mon Mar 21 16:45:27 CST 1994 This message signifies failure to mount an input tape for reading (-1 files), or failure during read of one of the files on the tape (file count > 0). Check whether the tape has already been resubmitted. If it is a special-project tape, you will probably have to resubmit it by changing its status from SUBMITTED to WAITING in the injobs file. For standard RECO, there is a procedure that automatically resubmits the tape after 48 hours. You only need to resubmit these tapes manually if you are in a hurry to get this particular data processed or if the farm has nothing else to do. In any case, add this tape to the list in the weekly report (see section 4) and archive the message. You should watch for 3 consecutive failures on the same tape drive: that signifies a failing drive. If a particular input tape has failed many times, you may want to declare it a bad tape. (See section 2.4.) 1.2.3. Outtape ABEND Date: From: dzero@fnsfe.fnal.gov ( Dzero) To: dzero@d0fs Subject: Outtape Abend The previous outtape job ended ABEND Tape WN7750 2 files /usr/local/home/dzero/proman/pdbkd/1994/mar/19-24 sgi81 fnsfe_1 Thu Mar 24 14:42:50 CST 1994 This signifies failure to mount an output tape for writing (file count 0) or a failure during writing (file count > 0). You should not have to do anything to the farm, but again you should watch for 3 consecutive failures on the same tape drive: that signifies a failing drive. Add this tape to the list in the weekly report and archive the message. 1.2.4. Bad file Date: From: dzero@fnsfe.fnal.gov ( Dzero) To: dzero@d0fs Subject: Bad file Bad file ALL_062508_71.X_RAW01 on WM3233 This message tells you that the inspooler has skipped a file due to read errors. Add this file to the list in the weekly report, and archive the message. Be aware that for WMxxxx tapes, the control program will try to 18 resubmit these bad files since it knows they're missing. If the experts tell you that an attempt to recover some particular bad file is hopeless, you should remove the file name from whichever tapes_moved. file in the ~/proman/history area has it. 1.2.5. Tape in use on D0FS Date: Wed, 6 Apr 94 10:54:15 -0500 From: dzero@fnsfe.fnal.gov ( Dzero) To: dzero@d0fs Subject: Tape in use on D0FS Tape VGL127 is in use on the D0FS This message occurs when a tape that is called for in injobs can't be mounted because some other process is using it. You will most likely see these tapes appeatpm_disp with INTAPE_IN_USE status. Follow the procedure in section 2.4 for releasing tapes from D0FS. 1.2.6. DST transfer to D0FS failed for WNxxxx Date: Tue, 5 Apr 94 16:12:50 -0500 From: dzero@fnsfe.fnal.gov ( Dzero) To: dzero@d0fs Subject: DST transfer to D0FS failed for WN7857 Tue Apr 5 16:12:46 CDT 1994 Normal operation calls for DST output to be written to a "linktemps" area on the farm and then transferred to D0FS. Sometimes this fails due to D0FS, network screwups, or other problems. If the time stamp on the message is between 00:00 and 00:20, do nothing and the problem should fix itself. Otherwise, go to ~/proman/exe directory and execute pick_up_all (or pick_up_trans, although that works only on the node you're actually logged into, where the other does all nodes). The system will retry the failed transfers twice a day (at 5:00 and 20:00). 1.2.7. after_crash executed This is a message you get after the system reboots. It signifies that the automatic procedure (which is called after_crash) that restarts tpm_submit_job has been executed. A second message should follow, confirming that tpm_submit_job has started. If you get many of these messages in a row, check to make sure tpm_submit_job is actually running. If it isn't, or if the d0reco_prl do not start, go to ~/proman/exe and execute after_crash manually. In any case, archive the message. 1.2.8. No tape drive 19 Date: Wed, 6 Apr 94 17:19:54 -0500 From: dzero@fnsfe.fnal.gov ( Dzero) To: dzero@d0fs Subject: No tapedrive All drives are busy; fatal error submitting the spooler; page D0 fnsfe_2 Wed Apr 6 17:19:46 CDT 1994 You get this message when an inspooler or outspooler needs to mount a tape, but there is no drive available. The most common cause is a bad tape drive, so check the appropriate drives.fnsfX file. Archive these messages. This message has also been seen recently when the drives.fnsfd file has been overwritten by a blank file of zero lines. This problem is not yet understood, but you can tell when it has happened if you go to the resource directory and type (more) the file ­ you will see nothing at all. The short-term fix is, from fnsfe: % res % cp sk_drives.fnsfd drives.fnsfd This replaces the blank file with one that has the needed drive information so that processing can continue. 1.2.9. Bad tape drive Date: Wed, 6 Apr 94 17:19:29 -0500 From: dzero@fnsfe.fnal.gov ( Dzero) To: dzero@d0fs Subject: Bad Tape Drives The following drives have their error counts exceeded: sgi83; page D0 You get this message when the error count in drives.fnsfX is 5 or more on a particular drive. Follow the procedure in section 2.1 for handling tape drive failures. Be sure to archive this message in FARM_YYMM folder. 1.2.10. Wrong project; no current project defined These signify an improper definition of a project. The problem could be in the injobs file, in the resource file, or in the ~/proman/project area where project definitions are specified. It can also happen on the worker nodes. You should check the projects requested in the injobs and resource files to make sure that they actually exist (compare them to the names on the files in the ~/proman/project area). Also make sure that there are tapes waiting in injobs for every project to which a VM is assigned. If these look okay, wait. The system has many automatic recovery features that may solve the problem. Normal processing should resume on the affected VM and/or worker node(s) shortly. 20 If the problem persists, call in an expert (S. Kunori) to sort out the problem. Shifters should not edit project files or create new projects without explicit instruction to do so. 1.2.11. Server node fnsfX_# down - investigate Usually, this represents a network glitch. Check the spooling area (sp#) to see if the STA files are growing. If they are, there is no real problem. If this message appears frequently, it is likely to be a symptom of a major network problem. If you are suspicious, contact the expert (S. Kunori) and ask him to take a look. Report any other symptoms you see. 1.2.12. Disk space limit exceeded Date: Wed, 6 Apr 94 06:00:17 -0500 From: dzero@fnsfe.fnal.gov ( Dzero) To: dzero@d0fs Subject: Disk space limit exceeded %90 of the disk space is used on /spool02 of the farmlet fnsfe_2; Check it. Also check mounts - maybe Xoper is frozen; otherwise page D0 primary. This message signifies that too much space is in use in the spooling area named. Processing will slowly grind to a halt unless the disk use falls below 90% used. See Section 2.12 for advice. You will get a similar message if the /dzero area is over 95% used; this will also halt processing. See Section 2.11 for advice on that problem. 1.2.13. Rsh got stuck in _______ ; killed This means that a remote command got stuck and was cancelled. A few of these are no problem; if you get many, it may signal a serious network problem. Try to rlogin to the node(s) that sent the message. If you can't do that, call the operator and ask to have the nodes checked (are they on?). Again, if you think there is a serious problem, call in the expert. 1.2.14. Check wrkshell on fnsfxxx Date: From: dzero@fnsfe.fnal.gov ( Dzero) To: dzero@d0fs Subject: Check wrkshell on fnsf128 More than one wrkshell on fnsf128 This message means that there are multiple wrkshell processes running on the worker node. Usually, when the problem is detected, the system cleans itself up and then sends the message 21 ­but sometimes this fails. You should check by logging into the node in question and doing ps -ef | grep dzero and killing whatever daughter processes exist, i.e., every wrkshell, d0reco.x, inreader.x, and outwriter.x. 1.2.15. RECO hung This means that d0reco.x has been hung for over 30 seconds on the specified node. You should restart RECO on that node (see section 2.5). 1.2.16. Multiple RECOs running on node fnsfXXX This means that too many processes got started on this node. You need to kill every inreader.x, outreader.x, and d0reco.x process you see on this node. (See section 2.5.) 1.2.17. Many others for which no action is required, including: No files in raw WNxxxx returned to blanks dbl3 server failures Bad Zebra construct Job control failed If you see these, you should just archive them in the FARM_YYMMDD folder on D0FS::DZERO. [FYI: at present, the "returned to blanks" message is fictitious. Tapes flagged for returning to blanks are instead listed in a separate file (blanks_to_be.list) in the resource directory, so that our tape managers can examine them to diagnose our problems and prevent overwriting of tapes.] 1.3. Log files The farm produces a number of log files which can be helpful in tracking down problems or determining the status of particular jobs. The log files are kept in subdirectories of the areas ~/proman/pdbkd, ~/proman/logdb, and /proman/wrkdb; the subdirectories have the format year/month/6-day. (Go to one of these areas and do ls; you'll understand the format.) It is to your advantage as a shifter to be familiar with the contents of various log files, so you should look in these directories on your own to learn the details. Here, I provide only some general guidance as to the kinds of information you can find in these files. 1.3.1. logdb 22 These are the kinds of files you will see in this area: sfe~/proman/logdb/1994/mar/19-24% ls PRL_RECO_FULL_V12_fnsfe_2_4032112.log inspool_12522.log PRL_RECO_FULL_V12_fnsfe_2_4032207.log inspool_12523.log copy_results_log.WN7726 inspool_12524.log copy_results_log.WN7727 inspool_12525.log copy_results_log.WN7728 outspool_WN7726_4031920.log inspool_12512.log outspool_WN7727_4032200.log inspool_12513.log outspool_WN7728_4032202.log inspool_12518.log outspool_WN7729_4032203.log inspool_12520.log outspool_WN7730_4032209.log inspool_12521.log outspool_WN7731_4032210.log sfe~/proman/logdb/1994/mar/19-24% Here is what some of them contain: -PRL_RECO_FULL_V12__.log records the activities of the d0reco_ prl_fnsfX_# process; a new file is created every time this process is started. -inspool_.log records the copying of files from the input tape to the inspool directory. This job ID corresponds to the ID shown in tpm_disp (see section 1.1.1). -Copy_results_log. confirms copying of DSTs and RCPs to D0FS. -Outspool__.log records the copying of output STAs to tape . 1.3.2. pdbkd These directories contain the RCP files for the production database. The STAF_.RCP and DSTF_.RCP files (one for each run partition) contain input and output file names, when and where RECO ran, the number of events processed and missed, and a few other things. The STAT_.RCP files list the files on each tape and which tapes the corresponding raw data files can be found on. They also note when and on which VM the job ran, as well as on which tape drive the output tape was written. You will find these files helpful in a number of different situations, particularly when you want to know which tape a given output file was on. 1.3.3. wrkdb This directory is not on the home disk. It can be found by typing cd /proman/wrkdb. (Note that there is no ~ in this directory name.) The subdirectories contain the D0RECO log files for each worker node; the logfile is placed in the subdirectory corresponding to the most recent date when wrkshell was started on the node, which does not usually correspond to the date a particular file was processed. (To find the most recent starting date, rlogin to a worker node and type pf wrk . You will see the 23 date this process was last started. Then log out and look in the subdirectory corresponding to that date. If pf wrk returns a time rather than a date, it means that wrkshell was started within the last 24 hours.) Looking at the logfiles is useful when you suspect RECO is crashing or hanging; you may be able to determine where the problem is and give some advice to the RECO experts who are responsibl for fixing problems in the code. 1.4. The Pager The pager should be carried by one of the farm shifters, mainly so that the operators can reach you when they need to notify you of conditions calling for your attention. When the operators page you, respond as soon as you reasonably can. If you see "LO CELL" in the pager window, it is time to change the battery. New batteries may be in a supply cabinet (ask Sonya Wright), or you may procure one from the stockroom if you are authorized to do so (ask Lee Lueking what budget code to use). 24 2) PROBLEM FIXING 2.1. Tape drive fails This happens to most drives after they have been operating for a few months. A drive is automatically marked UNAVAILABLE (bad) in drives.fnsfX when its error count reaches 5. Other signs of a failing drive include: ABEND errors (mount failures) on 3 or more consecutive attempts to mount a tape on that drive, a hung inspooler or outspooler process, or too many files in ~/proman/sta area. (This last happens when a drive is switched to single density. Call the operators and have them check that.) When a drive goes bad, the first thing to try is having the drive cleaned. Call the operators in Feynman and ask them to do it. When this is done, reset the error count to zero and status to READY in drives.fnsfX. If the drive still has problems, you should do the following: 1. Identify the problematic drive from email messages and/or the drives.fnsfX file. From whichever I/O node the drive is on, execute cps_umaint tape sgi## broken noswap . (Note that ## represents the 2-digit number that is part of the drive name.) When prompted, enter a comment explaining why you think the drive is broken (use when done) or create a comment file and select that; try to include enough detail to help the maintenance personnel track down the problem. After you have entered your comment, the tape drive state will be updated to broken in the cps database (when you do cps_tape -lt , you will see that it has changed) and mail will be sent to farm-admin@fnsg01.fnal.gov and to farm-user-fnsfe@fnsg01 announcing the update. 2. Recombine the inspoolers, if needed (see Section 2.2). 3. Call operator for a report. The production names for D0 are sgid0sfe and sgid0sfd. FYI, the person who actually takes care of repairing and replacing bad drives and other hardware is Ken Stox, stox@dcdkc.fnal.gov. It is generally a good idea not to contact him directly; use farm-admin. Also, you should send to farm-users-d0@fnsgi1 a copy of any message to farm-admin. When a drive is repaired, you will receive a mail message telling you that its status was updated to working. After receiving this message, you should do cps_tape -lt to make sure the drive is listed in the database as working. If it is, check the drives.fnsfX file (in the resource directory) to make sure that its status has been set to READY and its current error count to zero. If these have not been done, you should edit the file to make the changes. Then, you can change the inspooler configuration to reflect the new count of available drives. 25 2.2. Combining inspoolers This needs to be done whenever fewer than 6 tape drives are available on a physical machine (fnsfe, fnsfd, or fnsff). Since each machine has 7 drives, you do not need to combine inspoolers unless 2 or more drives are broken. The principle is that two VMs assigned to the same project can share an inspooler (and therefore a tape drive), but each VM must have a separate drive for its outspooler. The following rules for combining inspoolers follow this principle: 6 or 7 drives available: Each VM has its own inspooler. 5 drives available (2 bad): Two inspoolers must share a drive. 4 drives available (3 bad): One VM should be set to UNAVAILABLE (you can have all three VMs sharing an inspooler, but this is less efficient). 3 drives available (4 bad): One VM must be set to unavailable and the remaining two share an inspooler. 2 drives available (5 bad): Two VMs must be set to UNAVAILABLE. Fewer than 2 working drives: Data processing is not possible. In order to combine the inspoolers, find two VMs on the same physical machine that are running the same project. If all three are running different projects, you will have to change the project for one VM in the resource file so that two of them are running the same project. When you have decided which two VMs will share an inspooler, go to the ~/proman/resources directory (on fnsfe) and look at the inspool_list_N.fnsfX files there. You want to find one that combines the inspool areas of the two VMs you have decided to combine. If there isn't one, edit one of them (e.g., N=8 or N=9 for fnsfe) to match your choice. Then edit the resource file, changing the project (if needed) and changing the inspooler number to match N for the two VMs whose inspoolers you are combining. Now you will need to get the proper inspoolers started. (If you skip this step, the system will eventually sort itself out, but that wastes a lot of time and generates an annoyingly large number of email messages.) Here are the steps to take: First, kill the inspooler processes on the VMs you have just combined (using kill -9 ). Then cancel the tape mount(s) requested by these inspoolers: Figure out which tape(s) have been mounted on which drive(s) for the inspoolers you have combined ( pf inspool will help you here). Then type cps_tapereply -f -c"fnsfe" sgi##. You will then have to log in to D0FS::DZERO and release the tape from the batch queue (see section 2.4). You should also do these things when you un-combine inspoolers after drives have been repaired. Here is an example of how to tell which tapes have been 26 requested by which inspoolers: % pf inspool dzero 18798 1 41 19:09:48 ? 3:52 /usr/local/home/dzero/proman /exe/inspool_fnsfe_5 -inl WM5915 -dev sgi85 -queid dzero 19462 1 0 19:13:53 ? 1:29 /usr/local/home/dzero/proman /exe/inspool_fnsfe_8 -inl WM6248 -dev sgi86 -queid % This command shows you that inspoolers 5 and 8 are running on fnsfe, and that inspooler 5 has a tape on sgi85; inspooler 8 has a tape on sgi86. (Incidentally, for this example, fnsfe_0 and fnsfe_2 were sharing inspool_fnsfe_8.) You can check the status of the mounts by typing % cps_tape -lt TAPEDRIVE DEVTYPE STATUS ALLOC TAPE TAPE_STATUS sgi80 exabyte_8500 working allocated WN7875 mounted sgi81 exabyte_8500 working allocated WN7876 mounted sgi82 exabyte_8500 broken unallocated WN7849 mounted sgi83 exabyte_8200 broken unallocated WN7877 mounted sgi84 exabyte_8500 working allocated WN7880 mounted sgi85 exabyte_8500 working allocated WM5915 mounted sgi86 exabyte_8500 working allocated WM6248 mounted % To cancel the mount for inspooler 5, you would type cps_tapereply -f -c"fnsfe" sgi85. 2.3. Releasing tapes held on D0FS You need to do this when you receive a "Tape in use on D0FS" message, or when you have killed an inspooler. First, make sure the tape isn't being held for a legitimate reason. (Occasionally, someone else has cause to use the raw data tapes.) If it isn't, log into D0FS and look at the queue STAGE_IN_USE_ON_UNIX_FARM and find the entry number of the job that is holding the tape you need. (The job name will be WMxxxx_WMxxxx_XX_UNIX.) Stop that job using DELETE/ENTRY. Back on fnsfe, change the status of that tape from SUBMITTED to WAITING in injobs. Note: you can try logging into specific D0FS nodes, like D0RSEX, D0TSEX, D0RSUT, etc. if login to D0FS is denied due to "too many users." 2.4. Marking tapes BAD You should do this when you notice that a tape has accumulated a large number of error counts in the injobs file. Edit injobs, changing the status of the tape from WAITING to BADTAPE. Send a message to D0FS::COPYMAN identifying the bad tape; include the tapes_moved file for that tape. You should also rename the tapes_moved file for that tape in ~/proman/history to something like bad_tapes_moved.WMxxxx to make sure the automatic procedure 27 doesn't resubmit it. Do not attempt to process the tape again unless told to do so by an expert. 2.5. Injobs has nothing WAITING To make a long story short, this should not happen during the run, unless there has been a lengthy shutdown of the accelerator. If you see only a few or no tapes WAITING in injobs, you should check whether tapes are being vaulted and entered in the catalog properly. Check by logging into D0FS::DZERO and looking in the directory USR$DISK51:-[FMD0.MOVEDTAPES]. There you should see files called TAPES_MOVED., which list all the files on all tapes vaulted since the last file was created. You should check for recent dates on these files to verify that vaulting and cataloguing hasn't stopped. Next, on fnsfe, look in the history area ( his or cd ~/proman/history gets you there) for itapes_moved. files; there should be one corresponding to each file you found on D0FS. Look at the most recent of these files to see the tape labels. Then, look at the injobs file to check if the tape is in the queue. If you see a problem, send mail to D0FS::COPYMAN and to Shuichi Kunori so that they can investigate. They will advise you if tapes need to be added manually. 2.6. DST does not get to D0FS The normal chain of events after a file is processed calls for the STA to be written to tape and the DST to be copied to the D0$DATA$BUFFER area on D0FS, along with an RCP file indicating where the DST in question really belongs. A server process on D0FS looks for these RCP files, reads them, copies the relevant DST to the correct area, then deletes the DST from D0$DATA$BUFFER. So if the DST does not show up in D0$DATA$DST (or whatever "target" directory area it should end up in), the first thing to do is check the buffer area. If the DST is still there, check that the RCP file is also there and okay. If so, make sure the target area has space in it. When a target area is full, inform Lee Lueking that files are backing up in the buffer; he is responsible for deciding how to create space. If the DST hasn't reached the buffer area, it may be that the buffer area is full. (If not, see section 1.1.8.) In this case, it's probable that a target area is full and has caused the backup into the buffer; contact Lee Lueking to get things sorted out. The next thing to look for is a network problem. These are common, especially when D0FS goes down. Make sure D0FS is alive, then log into fnsfe and kill any uptest process you see that is over a day old. This should prod some file transfers into 28 starting. If there are no stale uptest processes, you can try pick_up_all to start the transfer, or use zftp to transfer the file manually. (In the latter case, you will have to find the DST file on the farm and rename it using lowercase letters before the transfer will work.) Note that if the DST is not copied, the STA (which has been written to tape) will not be catalogued. This may cause files to be reprocessed, generating duplicates. A process is run to check for this; in time, the farmers may be made responsible for removing the unwanted copy of any duplicate files, but this has not yet been implemented. 2.7. Killing/Restarting a parallel server You should do this only when a particular VM has been frozen (no files growing) for a long time, or when combining inspoolers. If this is the case, see if the d0reco_prl_fnsfX_# process is running. (Use pf prl to confirm that the process is running for the "hung" VM and to get the process ID.) If the process is running but the VM has been hung for a sufficiently long time, the following steps will kill the process and induce a restart: First, be sure you are on fnsfe and that tpm_submit_job is sleeping. (You can check this by typing ps -ef | grep sleep and looking for sleep 800 among the processes. If you don't see sleep 800, check that tpm_submit_job is running; continue checking for sleep 800 until you see it.) To kill the process, edit the resource file, changing the status of the VM whose server you intend to kill to UNAVAILABLE. Then, from the I/O node on which the server in question is running, type kill -9 . If you want it to finish its present activity and end gracefully, or if you intend for an immediate restart, do nothing else. However, if you want it to die quickly and not restart at once, you should also kill the inspool, outspool, insrv and outsrv processes for that VM if the next "wake" cycle of tpm_submit_job doesn't do it. Then, cancel the mounts (see example in section 2.2) and release the tapes from D0FS (section 2.3). To restart the VM, edit the resource file again (on fnsfe, while tpm_submit_job is sleeping), changing the status back to FARMLET. The process will be automatically restarted when tpm_submit_job wakes up, so after half an hour or so, you should check the area again and make sure all is well. 2.8. Multiple jobs INQUEUED on a VM This has happened when RECO crashes. You should check out the 29 wrkdb logfiles (see section 1.3.3) and notify a RECO expert. If the problem is run-specific, you can set the tapes for that run to PENDING in injobs; the VM will go on to a different run. When the RECO problem is fixed, be sure to set the relevant tapes back to WAITING so they will be processed. If there are multiple jobs INQUEUED on a VM, but no evidence of a RECO crash, you should do the following to clear up the tapes: First, for all but the most recently INQUEUED job on the VM in question, check with tpm_disp to make sure that the tapes have not been resubmitted and processed since then. If they have, do nothing about them. But if they have not (and if it has been over 48 hours since the first submission of a WMxxxx tape), check the status of the tapes in injobs; if it is SUBMITTED, change it back to waiting and remove it from the queue on D0FS (See Section 2.4) if it is sitting there. 2.9. Restarting D0RECO on a worker node First, see if RECO really is dead by doing rlogin to the worker node and doing ps -ef | grep dzero . You will see d0reco.x running and the elapsed CPU time. Repeat this command after 30 seconds and see whether the CPU time increases. If not, kill inreader.x, outwriter.x, and d0reco.x. (If you have another reason to be restarting RECO, you don't need to check whether the CPU time is increasing.) Kill the processes with kill -9 . They should restart automatically, but it is not a bad idea to check the worker node in question to make sure all is well. 2.10. Problem runs This is not very common, but you should know what to do, just in case. Some runs cause RECO to crash on every event. The first symptom is that a particular VM stays hung (usually with "total 0") for a long time. Look in the inspool/ and raw/ areas (subdirectories in the spooling area); you will see files piling up if RECO is having problems. Next, check one of the worker nodes for that VM to see if d0reco.x is running. If RECO is crashing, you will see the process appear and disappear when you do pf d0reco repeatedly. You should then look at a log file for one of the worker nodes (these are in the wrkdb area; see section 1.3.3) to see what problems RECO is having. Send mail to the appropriate experts describing the problem. If the problem is with the databases (DBL3), it will affect entire runs. It is a good idea to find all the raw data tapes corresponding to the affected run(s) and set them to PENDING in injobs until an expert informs you that the problem has been fixed. (The quickest way to get all the tape numbers corresponding to a particular run is to call the DAQ expert on shift in the control room and ask; they should know how to do this.) 30 2.11. Disk space shortage on /dzero When the free disk space on /dzero falls below 5%, all processing halts. Contact an expert (S. Kunori) for advice on cleaning up the disk. 2.12. Disk space shortage in spooling area When one of the spooling disks is 90% (or more) in use, processing on that VM will eventually grind to a halt. This problem is often a byproduct of unusually large STAs or outspool failures. When the disk space runs low, you will get mail messages. Check that the outspooler for the affected VM is running and that its tape has been mounted. Also, look at outtape ABEND messages for indications that the drive being used by the outspooler is failing. (See section 2.1 for further instructions on tape problems.) If the tape and drive are okay and the problem persists, you can try executing pick_up_all from the ~/proman/exe directory. This usually frees up enough space to get the disk use below 90%. Note: RESIST the temptation to rm files from the dst and sta directories; this can cause confusion in the databases and on D0FS. You can try deleting the files from the inspool and raw areas; that may free up enough space to get things going. 31 3) SUBMITTING SPECIAL JOBS Do this when you are instructed to do so by Lee Lueking or someone else with authority from OCPB. (Remember that all special jobs requests must be approved by OCPB before they are run.) You will be given a form describing each approved request, which you should keep until the job is completed, then return to Lee. This section describes the steps to take to process the data. Note regarding regular RECO jobs: This procedure has been automated for WMxxxx tapes (raw D0 data) to be processed with the current version of D0RECO. Automated processes check for new WM tapes in the log of vaulted tapes kept by the operators at Feynman; these tapes are automatically entered in the FATMEN catalog and a tapes_moved.WMxxxx file for the tape is generated. The tape is also appended to the end of the injobs file. So if someone requests a WM tape to be processed at high priority, all you need to do is to move it to the top of injobs. 3.1. Get tapes_moved file The person requesting the special job is responsible for providing a tapes_moved. file for each tape he or she wants processed; the request form should tell you where these file(s) are. You should copy the file(s) requested to the history area. So, go to directory ~/proman/history (use shorthand command his if you like) and copy the files from wherever the form says they are. The tapes_moved. file should look like this: sfe~/proman/history% more tapes_moved.VGL063 VGL063 1 TOP_W2E80A03N_VB300_G314SS3_A_07.X_RAW01 VGL063 2 TOP_W2E80A02N_VB300_G314SS3_A_04.X_RAW01 VGL063 3 TOP_W2E80A02N_VB300_G314SS3_A_06.X_RAW01 VGL063 4 TOP_W2E80A02N_VB300_G314SS3_A_08.X_RAW01 VGL063 5 TOP_W2E80A03N_VB300_G314SS3_A_02.X_RAW01 VGL063 6 TOP_W2E80A03N_VB300_G314SS3_A_04.X_RAW01 VGL063 7 TOP_W2E80A03N_VB300_G314SS3_A_08.X_RAW01 sfe~/proman/history% Sometimes the file is empty. This is a way of telling the farm to process every file on the tape. Empty or not, the file does need to be present or the job will not run. Occasionally, someone who wants every file on a tape will not make a tapes_moved file, in which case you may create an empty file for that job after checking with the requester. 3.2. Add the request to the injobs file Go to directory ~/proman/resources (use shorthand command res if you like). You can either add the new tape request to injobs manually, by editing the file, or use the automated command procedure to do it for you. 32 For the automated command procedure, type: % injob_create You will be prompted for the project name, which you should enter in all capitals (remember that UNIX is case-sensitive), and for the tapes to use, which you should enter one tape name per line (two consecutive signal the end of the list). 3.3. Edit the resource file In this same directory, you must edit the resource file to change one VM over to run the special project. Be sure the VM you switch is in FARMLET status and not UNAVAILABLE; otherwise the job will never run. If a VM is already assigned to this special project, you do not need to assign a second one unless the request is urgent. 3.4. What happens next When the VM you have assigned to the special project finishes its current job, it will pick up the first tape in the injobs file for which the special project has been requested. It will try to run the job. If the job fails the first time, you should edit the injobs file, changing the status of the relevant tapes from SUBMITTED to WAITING so that the VM will try again. As with any job, repeated failures may signify that the tape is bad. Do not switch the VM back to RECO_FULL_Vnn until the job has succeeded or it is decided that the tape is bad. If the job goes through (gets to ENDED status in tpm_disp), you should trace the output to confirm that processing was completed. Section 3.5 describes one method of doing this. When the job has finished (all tapes processed or declared bad), send an email message to the requester (cc to Lee Lueking, OCPB chairman), saying that the job has finished or indicating why it could not be completed (bad tape, etc.). Also, write the job status, date, and your initials at the bottom of the request form and return it to Lee. It is up to the requester (not you) to check the quality of the output and inform the OCPB of any problems. You should only redo a job if Lee (or another OCPB authority) authorizes it. Be sure to change the VM back to RECO_FULL_Vnn in the resource file when the job is finished. This will guarantee best use of our available CPU and I/O capacity. 3.5. Tracing a tape When you run a special job, you should confirm that the output does indeed get through the farm and onto D0FS and/or FATMEN 33 before you declare it finished. If you know that the job is ENDED according to tpm_disp, you can do the following to verify the output: - Note the date of processing from tpm_disp. - Go to the appropriate subdirectory of ~/proman/pdbkd (see section 1.3.2). - Type: grep STAT*.RCP, where is the name of the input tape you are tracing. The name of the output tape is given in the name of this RCP file. Check that all the files are listed. It is also a sensible idea to type out the STAT RCP files you find with grep to make sure the format is correct. If no matches are found, you can assume that the output file was not generated. (Check the VM the input tape ran on to see if it is still working on these files. If so, give it more time. Also, if the processing date is near a 6-day subdirectory's boundary, you may also want to check the next subdirectory of pdbkd.) - Go to the ~/proman/report/log area and look at the copied.OUTVSN and copy_log.OUTVSN files, where OUTVSN is the output tape label you found in the previous step. You can use these files to confirm that every file on the tape was transferred. There are also logfiles for the copying of DST and STA RCP files that you may want to check. - The final check is to look on D0FS (directory D0_DATA_MC for Monte Carlo) or FATMEN (use the file names you found in the STAT RCP files to determine the generic name) to confirm that the file showed up there. It may take a day or so for things to be catalogued, so do allow some time before doing this test. For more on tracing tapes, see the most recent version of: D0FS::USR$ROOT5:[DZERO.DOCS]-TAPE_TRACE(date).DOC 34 4) MAKING WEEKLY REPORTS In addition to the duties outlined above, several reports on farm progress are expected from the shifters on a regular basis. You should assemble these reports by 1 p.m. Monday so that they can be passed on to the person reporting for D0 at the all-experimenters' meeting. (Since the shift changes occur on Mondays, it is the responsibility of those whose shift is ending to make these reports for the final week of their shift.) The shifter(s) who prepared the report should present it at the production group meeting on Tuesday morning and post the text report to the PRODUCTION folder of D0NEWS. Finally, the shifters going to the farm users' meeting, every other Wednesday afternoon, should take the reports along as they will help in recalling what problems have occurred. The report should include the following elements: 4.1. Number of events processed (graph) The graph shows the daily progress of the farm over an interval of several months by showing, for every second day, the cumulative number of events processed between the start date (15 Jan 1994) and that day. The graph also gives the total number of events processed and the average number of events per day. Separate totals should be calculated for ALL stream data, special runs processed on the farm, and Monte Carlo; a combined total should be presented as well. The separate totals should be reported at the D0 production meeting on Tuesdays, while at the all-experimenters' meeting, only the combined total is needed. The procedure to generate the plots using the weekly report is as follows: 1) The script official_report is submitted by the farm automatically every Sunday evening. It generates a file called ~/proman/report/log/off_rep.mm-dd-yy, where the date the report was generated is used in the name. 2) On Monday morning, check that this file exists, is readable, and has output all the way up to the day before. Then execute: % nawk -f report.awk off_rep.mm_dd_yy > report.mm_dd_yy This will generate a file called report.mm_dd_yy, which gives the total number of events processed each day. 3) Ftp the report file you generated in step 2 to the [.REPORT] subdirectory on the D0FS::DZERO account. Then you will need to do: $ COPY where the first file is last week's report and the second file is the updated report to include the most recent week. (If you look in the [.REPORT] directory, this should become more clear.) Then, 35 $ APPEND REPORT.MM_DD_YY where the REPORT file is the one you just ftp'd and the second file is the one you just created with the COPY command. Edit the file to remove the overlapped days (everything between the beginning of the month and the end of last week). When you edit, notice that Sunday's total in last week's report was incomplete; you should use the one you got in the new report. 4) Use the REPORT.FOR program sitting in the D0FS area to generate a cumulative total for every second day. (I have found file protection violations when trying to use the FORTRAN compiler on D0FS, so you may need to copy all the relevant stuff to your own account to do this. Remember that you must leave copies of the data files on D0FS each week so that other shifters can find them.) Compile, link, and run the program. You will be prompted for a file name: give your input file name (the edited .DAT file from step 3). The output file is called TOTALS_OUT.DAT. 5) Run PAWX11. [I suspect this isn't set up on D0FS either, so you will have to run it from your own account.] You want to execute the macro TOTAL.KUMAC (copy it from D0FS to somewhere you can run PAW) to generate the plot. The macro will create a postscript version of the plot for you, which you can print. You will need one copy for the D0 representative at the all-experimenters' meeting, and one for you to show at the production meeting. Note: This procedure causes some information to be lost at the end of the month, due to the way the official_report script works. Until this is changed, you will need to run the script yourself (as described in section 3.5 of Kirill's manual) soon after the first of each month to get the information for the last few days of the previous month. Due to directory-size limits in UNIX, the weekly report program is not functioning properly. In place of the above, you can generate the plot using daily reports. This also has the advantage that information is not lost at the end of the month. Here are the steps to follow: 1) Go to the ~/proman/report directory and execute official_report_daily Mmm dd &, where Mmm dd is the month (3 letters) and date for which you want a report. Do this for each day for which a report is needed (Monday­Sunday). It is a good idea to run some of these during the week. Each daily report job produces an output file, off_rep_daily.ddMmm1994, which has the same format as the weekly report file. 2) On D0FS, in the [.REPORT] subdirectory, copy to . Edit the new file by typing in the total number of events in each daily report file you generated in Step 1. 3) Follow steps 4 and 5 from the weekly report. 36 4.2. The written report This is the file PRODUCTION_REPORT.MM_DD_YY (the date is the Monday the report period begins) in the [.LOG] subdirectory of the D0FS::DZERO account. It should include the following elements: 4.2.1. Dates covered and shifters' names Indicate also who was responsible on which days. 4.2.2. Weekly summary Monday morning, write a few sentences to highlight major problems, changes, plans, etc. Remember that the report will be made public, so avoid negative commentary on individuals or their systems. 4.2.3. Tape drives Note when tape drives went bad, were replaced, repaired, cleaned, whether inspoolers had to be combined due to shortage of drives. You should also include any drives that were reported bad during a previous week and fixed during the week your report covers. 4.2.4. Special projects List all special requests for the week, along with a status report, making it clear whether the job is in the queue, in progress, finished, failed miserably (what is being done about that?), etc. If there were projects outstanding from the previous week, include them also. 4.2.5. Current node configuration Monday morning, type the resource file and enter the project assigned to each VM. If a VM is unavailable, indicate that. 4.2.6. Number of tapes processed Look at the official report output files (daily or weekly). Count the number of input tapes processed each day and add them up. 4.2.7. Last run number processed Look through the official report output files again, and find the highest run number from the previous week's listings. 4.2.8. Tapes in injobs This is a count of the backlog. To get the total, type % res % more injobs | grep -c WAITING The response will be a number: the total number of tapes waiting. To get the breakdown by project, add another "pipe" to the UNIX command, e.g., % more injobs | grep WAITING | grep -c FULL will give you the count of tapes waiting for full (standard) D0RECO. (You can do the piping in either order, but be sure that the -c is in the last part, otherwise you won't get a sensible response.) Simillar commands will get you the count for the other projects: the individual project counts should add up to the total. Put all this information in the report. 4.2.9. Tape ABENDS, Bad files You should collect this information during the week from the e-mail messages. A good way to do this is to have two windows open to the D0FS::DZERO account when you are reading new mail there every day. Read mail in one window, and edit the weekly report file in the other window. Whenever you come to one of these messages, use the mouse to transfer the information to the report file (which is being edited at the time). Then move the message to the appropriate folder. If you do this daily, you will find it less tedious than sorting through a week's accumulation of farm mail on Monday in a frantic search for these messages. 4.2.10. Miscellaneous Here you can make a note of various things that go wrong: system crashes, disk mounts lost, resource files that disappear, processes that get hung for days, past or upcoming system work, etc. Again, it is wise to avoid negative commentary. Simply state what happened and what (if anything) has been done to fix it. 4.3. Changing log directories Because UNIX cannot handle directories with over 500 files, it is important to change the log directory from time to time. At present, this is done on the first of each month. First, make sure that no copy jobs are running by doing: % ps -ef | grep copy This should show no jobs running. Then rename (UNIX mv) ~/proman/report/log to ~/proman/report/logmmmyy, where mmm is the three-letter abbreviation for the month (Jan, Feb, etc.) and yy is the year. Finally, create a new ~/proman/report/log directory.