ESSENTIAL INFORMATION ON D0RECO_PRL AND ASSOCIATED PROCESSES Version 1.1 - 19 June 1995, Cathy Cretsinger Contents: 1. Where It Lives 2. Where It Runs 3. What It Does 4. How to Start It 5. How to Stop It 6. Log Files 7. Other Files 8. Diagnostics 9. And the expert is... 1) WHERE IT LIVES: The area ~/dpm/pkg/prl includes: d0reco_prl scripts, etc: d0reco_prl.csh insrv and outsrv executables empty.ous (used to handle input files with 0 events) rcp_prl_write (generates the RCP for each completed file) Scripts used by tpm to control starting of d0reco_prl jobs (shifters should not run these scripts by hand): check_minishell.csh start_mu_shell.csh 2) WHERE IT RUNS: One process d0reco_prl runs for each VM. Typically there is one insrv and one outsrv, but there may be more. An insrv with no outsrv is a sign of trouble (see below). 3) WHAT IT DOES: A) The short answer: d0reco_prl.csh is a c-shell script that controls actual processing of input (raw) data files. It checks the inspool area for new files, then starts insrv and outsrv processes, which transfer events to and from worker nodes. When an sta/dst file pair has finished, it renames them according to standard d0 filename conventions and generates a correpsonding RCP file. insrv is a fortran program that ships events over the network to the worker nodes via sockets. outsrv is a fortran program that receives processed events from the worker nodes via sockets. It keeps track of which events have or haven't been processed; when the file is completed, it produces a list of missing events. B) The long answer: What follows is the normal sequence of events. Knowing this may help you track down problems. -- d0reco_prl sees a file in the inspool subdirectory with a matching *.done file. -- It moves the file to the raw subdirectory and sets up a working area, which tells the workers what project to use. -- It starts insrv and outsrv, which begin transferring events to and from worker nodes. The worker nodes take some time to initialize d0reco, then events begin flowing back. A *.ios file appears in the main spooling area of the VM, a *.sta file appears in the proman/sta area, and a *.dst file appears in the proman/dst area. In the meantime, d0reco_prl looks for the next input file and submits it to insrv, if it has the same run number and project. -- When there are no more events, insrv will either go on to the next file (if it has the same run number and project) or die. A short time later, outsrv will finish and create a *.ous file in the proman/sta area. This file tells d0reco_prl that the corresponding *.sta is finished. -- d0reco_prl renames the *.sta and *.dst files to standard D0 conventions. It calls rcp_prl_write to generate STAF_ and DSTF_ RCP files, with information about the files. Then it deletes the raw file and the corresponding *.done file from the raw subdirectory. 4) HOW TO START IT: A d0reco_prl process is started for each VM by tpm_submit_job. In order for it to start, the VM must be set to FARMLET status in ~/proman/resources/resource. insrv and outsrv are started by d0reco_prl. Shifters should not attempt to start any of these by hand. 5) HOW TO STOP IT: In the appropriate spooling area, create a stop_prl file (no content required). When d0reco_prl sees this file, it will kill insrv and outsrv and exit. The stop_prl file will be deleted at exit time, so if you do not want d0reco_prl to restart, you must set the VM to UNAVAILABLE in ~/proman/resources/resource. A global stop (all 3 VMs on a given I/O node) is possible using ~/proman/util/kill_d0reco_prl. This script puts stop_prl files in all 3 spooling areas of the I/O node from which it is run. You should see insrv, outsrv, and d0reco_prl stop on that node, usually within 5 minutes. 6) LOG FILES: The d0reco_prl log files are kept in ~/proman/logdb/yyyy/mmm/d1-d6 directories, reflecting the start date of the process. To find log files for a specific VM (e.g. fnsff_0), go to the appropriate directory and ls PRL*fnsff_0* to get a listing. These log files contain a fair amount of helpful diagnostic info. The insrv, outsrv log files are kept in ~/proman/clidb/yyyy/mmm/d1-d6 directories. They contain very little helpful info. 7) OTHER FILES: Files created by d0reco_prl and insrv/outsrv can be found on the spooling area /spool00/dzero/fnsff_0 (etc). - subdirectory raw/ - contains files that are being processed by d0reco_prl; for each *.X_RAW file, there should be a corresponding *.done file that identifies the tape, file name, and project. - spooling directory - contains a *.ios file for each file currently being processed. - proman/sta/*.sta and proman/dst/*.dst files - are STA and DST formatted output files for files currently being processed. - proman/sta/*.ous file - is a file that appears when processing is complete; it keeps the missing-event info. - proman/sta/*.X_STA* files and proman/dst/*.X_DST* files - these are completed data files. STAs go to outspoolers; DSTs are moved to proman/buffer for transfer to D0FS after the outspooler has successfully exited. 8) TROUBLESHOOTING: If the size of the files in proman/sta is changing, it is safe to assume that d0reco_prl is working. The sta_diff column in mon_disp can tell you at a glance. Many things (some normal, some not) may hold up the writing of output files. Normal holdups usually clear themselves in an hour or less, so it's reasonable to do no investigating until a VM has been idle for over an hour. The following are some possible problems and solutions to them: 1. d0reco_prl is not running. Make sure the VM is set to FARMLET in the resource file. Wait for tpm_submit_job to go through a complete waking cycle; it should start d0reco_prl even if there are no tapes waiting for the project assigned in the resource file. 2. insrv and outsrv don't start First, make sure there's something in the VM's inspool directory. If not, check the inspooler. Next, check how full the VM's spooling disk is. If it is over 74% full, d0reco_prl will not start the servers. The preferred solution is to wait until some files move out of proman/buffer and/or proman/sta. If there are several completed files in proman/sta, check the outspooler. The decision to delete files is a drastic one. If it is deemed necessary, start by deleting anything in proman/dst that is over 4 days old and has no matching file in proman/sta. Next, delete raw files from either the raw or inspool areas. Do *not* delete files from proman/buffer; there are other ways to get transfers started if the buffer is too full. If neither of these conditions apply, check the d0reco_prl log file; it may give you a hint as to what's wrong. 3. insrv and outsrv start, then die If this happens repeatedly, suspect a network problem. Check all the VM's worker nodes and make sure at least some are functional. Report any that you find unresponsive to farm-admin and helpdesk. 4. insrv or outsrv only It is okay if a VM has an outsrv but no insrv; do not do anything. The outsrv should stop on its own within about 20 minutes; new servers will start after that. However, if a VM has had an insrv but no outsrv for more than 5 minutes, kill the insrv. Then d0reco_prl should start a new pair of servers within a few minutes. 5. servers running, but no output (*.sta) It can take a while for d0reco.x to initialize on worker nodes. If insrv and outsrv have only recently started, be patient. If it has been a while, check the worker nodes for reco crashes. 6. Asterisks in file names If you see a *.ios, *.sta, or *.ous with asterisks in the file name, such as 27000_***_088000_01.sta (yes, UNIX does allow * in a filename!), stop the d0reco_prl and delete all such files. If you see the sta, check for the ios and the dst as well; they will probably be bad too. When d0reco_prl restarts, watch to see that the asterisks don't recur. If they do, you should delete the raw files that appear to be causing the problem and make a note of it. 9) AND THE EXPERT IS... Cathy Cretsinger (FNALD0::CRETSINGER) or Shuichi Kunori (FNALV::KUNORI) Contact these people if you have a problem that you can't figure out from this document.