CMS Monte Carlo Production at FNAL
Shift Takers Manual for Fall Production, 2000
October 10, 2001

(1) What is the goal ?

Production covers three areas: generation+simulation (cmsim), Hit
Formatting (writeHits), Digitization and pileup (writeDigis).

cmsim: generated+simulated events ---> .fz files
writeHits: .fz files ---> Objectivity database
writeDigis: Objectivity database ---> Objectivity database

(2) What are our resources ?

enstore:
We use the Storage Tek tape library in enstore for mass
storage of .fz files and OBJY data files. The enstore catalog of files
on tape is managed with a filesystem lookalike called pnfs.
/pnfs is "mounted" on gallo and velveeta and can be queried like
a normal filesystem with commands like "ls", "mkdir", etc. The
exception is that to copy files to and from enstore you have to
use the "encp" command instead of "cp". Copying files to pnfs
with encp means that they actually get written to tape.
There is a known bug in pnfs in that some files will be listed
twice. The file is actually there however, so it is not a
serious bug.
It will be fixed after production is over.
The CMS production area in pnfs is under /pnfs/cms/production.

we create jobs and launch them on gallo.fnal.gov. The .fz files
come from the /pnfs/cms/production/Projects//results
area. The CMS production area on gallo where we create and launch jobs
is in /data/jetmet_production. The .fz files are also staged on the
popcrn nodes on which jobs are running.

The Objectivity database is called a federation because it is
actually a collection of database files. The C_federation is kept on
gallo.fnal.gov and D-federation is kept on velveeta.fnal.gov. The
scripts are kept on gallo.gnal.gov. When running writeHits or
writeDigis, it is important to monitor the usage of the /data disk on
velveeta and /data disk on gallo. The space requirements are about the
same.

popcrn31-40:
These are the nodes of the production farm. Each node has two
processors and cun run two jobs at a time. popcrn31-40 (except
37) are also pileup servers. They keep the minimum bias data and
serve it to jobs running writeDigis that need pileup. popcrn1-2 are
reserved for development.

cmsprod account:

Username: cmsprod
Password: xxxxxxx

fbsng (farm batch system next generation):
This is the batch system used on the popcrn nodes. Some useful
commands:
setup fbsng ( Must be done before using fbsng )
fbs lj ( Lists all jobs currently running or
queued )
fbs nodes ( Lists all running jobs by node )
fbs submit x.jdf ( Submits a job. )
Each fbs job supports multiple processes; typically the production

scripts will generate 5-10 processes per job.

production scripts:
Hans' production scripts are kept in cvs module cms_production.
The scripts we will use for production have been checked out for
you
already in the directory
/data/jetmet_production/C_scripts or
/data/jetmet_production/D_scripts

backup scripts:
Greg has written backup scripts for production. Go to the directory of either C or D
federation and start backing up data and system files.
The syntax are as follows:

Backup datafiles for both Hits and Digis:

HITS:
fb -v -i `pwd` -o jetmet_production -n -w -c
backup_data_files > keyword.txt

There are five optionals in the case of Hits:
1. Collections
2. Events
3. MCInfo
4. Hits
5. THits

Make sure you repeat backup command for all the optionals.

DIGIS:
fb -v -i `pwd` -o jetmet_production -n -w -c
backup_data_files > keyword.txt

There are five optionals in the case of Hits:
1. Collections
2. Events
3. Digis

Make sure you repeat backup command for all the optionals.

Backup of system files:
fb -v -i `pwd` -o backup_system_files > keyword.txt

Objectivity servers:
Objectivity uses AMS server to communicate across the network.
It also has a lock server to provide safe concurrent access to the
database files. AMS server should be running on velveeta, gallo, popcrn31-40 at all times
during production. Lock servers should be running on popcrn06 and popcrn07 at all times
during productin. To check, log on to the corresponding machines as cmsprod and type:

oocheckams
oocheckls

It doesn't matter where you do this from. If either need to be
started (or stopped) you can use the following:

setup systools
cmd ams-server start
cmd ams-server stop
cmd ams-lock start &
cmd ams-lock stop

In some circumstances, the lock server may not be able to be
stopped. In that case, call an expert (so he/she can issue a
"kill -9" and do the necessary cleanup.)

Web Site for more information:

http://computing.fnal.gov/cms/Monitor/cms_production.html

(3) So I just got in and I'm on shift today. What do I do ?

This tutorial will assume that you are starting out in the
beginning. In real life, you may start in the middle of any of these
procedures.

(a) Attend the production meeting every Monday and Thursday at 10:00 AM.
Get instructions there on what samples need to be processed.
Otherwise, wait for instructions from Greg.

(b) If this is your first day on shift, make sure that your name
is included in the FBS job description file so that you will be
notified when jobs finish. To do this:
i) Log into gallo as cmsprod.
ii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util
or B_scripts/cms_prod_util
iii) emacs Templates/hits_template.jdf
iv) add your email address to the EMAIL lines, (there is more
than one place in the file,) and save the changes.
v) Repeat for all digis_template*.jdf files in Templates.

(c) For doing CMSIM:
i) Log into gallo as cmsprod.
ii) setup fbsng
iii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util or B_scripts/cms_prod_util
iv) Check disk space of gallo:/data and velveeta:/data. Check
for at least 50 GB space on each. If there is not, ask
for assistance. Free space will be made. When logged
into gallo:
df -k /data
rsh velveeta df -k /data

The following are the steps involved in OOHit formatting.

scripts/DeclareCMSIMJobs.sh -v -n 40 data_set_name [number]

This command creates a directory for data_set_name under cmsim
directory. Then it creats a directory production under data
set name directory. Then it creates the following directory
structure under production directory

declared
created
in_progress
done
params
problems

The command gets list of all ntpl files for the given data set from
gallo specified directory and creates these files in "declared" directory.
If it fails to create a entry in "declared direcotry, it reports that error mesage.
If you receive no error message, it means everything is ok.

scripts/CreateCMSIMJobs.sh -v data_set_name

This command creats a batch directory under data set name.
Then it creates the following directory structure under batch
directory

asociations
created
declared
finished
jdf
logs
params
running
scripts
submitted

This command creats entries in asociations, created and declared
directories for each ntpl file. It also creates script from
cmsim_template for each ntpl file and puts them into script directory.
Then it creates job desription file for each entry and puts
them in jdf directory.

scripts/RunJob.sh -v -j cmsim data_set_name

This command submit all jobs one by one to production farms that
you have mension in command for a given data set. We
can see entries of all jobs that have ben submitted successfully
in batch/submitted directory. After the job have been completed successfully,
it moves files entry to done directory. If somehow job does not run successfully,
it moves files entry to problems directory.

(d) For doing OOHits:
i) Log into gallo as cmsprod.
ii) setup fbsng
iii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util
iv) Check disk space of gallo:/data and velveeta:/data. Check
for at least 50 GB space on each. If there is not, ask
for assistance. Free space will be made. When logged
into gallo:
df -k /data
rsh velveeta df -k /data

v) You will receive one or more data sets to process with
writeHits for the day.

The following are the steps involved in OOHit formatting.

scripts/DeclareHitsJobs.sh -v data_set_name

This command creates a directory for data_set_name under OOHit
directory. Then it creats a directory production under data
set name directory. Then it creates the following directory
structure under production directory

declared
created
in_progress
done
problems

The command gets list of all fz files for the given data set from
tape drive and creates these files in "declared" directory
without fz suffix. If it fails to create a entry in "declared"
direcotry, it reports that error mesage. If you receive no error
message, it means everything is ok.

scripts/CreateHitsJobs.sh -v data_set_nam

This command creats a batch directory under data set name.
Then it creates the following directory structure under batch
directory

asociations
created
declared
finished
jdf
running
scripts
submitted

This command creats entries in asociations, created and declared
directories for each fz file. It also creates script from
hits_template for each fz file and puts them into script directory.
Then it creates job desription file for each entry and puts
them in jdf directory.

scripts/RunJob.sh -v -j OOHit data_set_nam [number]

Staging
RunHits
ValidateHits

In first stage, each fz file is staged from enstore tape to run
time area. Then in second stage, hitformatting is done. In
final stage hit run number is validated. These three stages
executes one after the other and depend on the exit code of the
previous stage. If the exit code from previous stage was not zero,
it does not execute the next stage. After the job have
been completed successfully, it moves files entry to done directory.
If somehow job does not run successfully, it moves
files entry to problems directory.

(e) For doing OODigis:
i) Log into gallo as cmsprod.
ii) setup fbsng
iii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util
echo $PROD_RESOURCES
setenv PROD_RESOURCES `pwd`/scripts
iv) Check disk space of velveeta:/data. Check for at least 50 GB
space. If there is not, ask for assistance. Free space will
be made. When logged into gallo:

df -k /data
rsh velveeta df -k /data

v) You will receive one or more data sets to process with
writeDigis for the day.
The following steps are involved in OODigitization formatting.

scripts/DeclareDigisJobs.sh -v data_set_name pileup_descriptor

This command creates a directory for data_set_name under OOHit
directory. Then it creats a directory production under
data set name directory. Then it creates the following directory
structure under production directory

declared
created
in_progress
done
problems

The command gets list of all fz files for the given data set
from tape drive and creates these files in "declared" directory
without fz suffix. If it fails to create a entry in "declared"
direcotry, it reports that error mesage. If you receive no error
message, it means every thing is ok.

scripts/CreateDigisJobs.sh -v data_set_nam pileup_descriptor

This command creats a batch directory under data set name.
Then it creates the following directory structure under batch
directory

asociations
created
declared
finished
jdf
running
scripts
submitted

This command creats entries in asociations, created and
declared directories for each fz file. It also creates script
from hits_template for each fz file and puts them into script
directory. Then it creates job desription file for each entry
and puts them in jdf directory.

scripts/RunJob.sh -v -j OODigi data_set_nam [number]

This command submit all jobs one by one to production farms
that you have mension in command for a given data set. We
can see entries of all jobs that have ben submitted successfully
in batch/submitted directory. After the job have been
completed successfully, it moves files entry to done directory.
If somehow job does not run successfully, it moves files
entry to problems directory.

(4) What can I do while it's running ?

Some run-time Sanity Checks

i) Check that the jobs have been submitted OK with fbs lj.

ii) Check on which nodes jobs are running with fbs lj.

iii) After a few minutes, check that the disk usage on
velveeta:/data is growing. Check this from gallo
using "rsh velveeta du -sk /data".
Check it again shortly after and see if disk space is
accumulating.

iv) Each job will have N sections depending on what N you gave to
the "create_hits,digis_jobs" script. Count the number of
sections listed in the "fbs nodes" output for each job. Did any
sections croak ?

v) Lots of fun: After a while, you can get basic statistics on the

Web page. Go to
computing.fnal.gov:/cms/Monitor/cms_production.html
and follow the "Production Farms" link. You can get various
network traffic plots and CPU utilizations for all the popcorn
nodes, gallo, and velveeta. Also, you can get a summary of which
nodes have jobs running on them.

vi) Do you know your FBS job id numbers ? Then you can check
which event you are on by "python ~/bin/LogChecker.py "
This command parses the log files as they are written in
/data/fbs-logs on gallo and gets the last event number
processes

(5) How do I know it finished OK ?

(a) When all sections of the job have exited, you will receive several
EMails. Check the output of the Email labeled "main." It looks
like this ( which is a failed job! ) :

Section Info:
Job 2118 Section: main
Exec: ['/data/cms_production_220301/cms_production/scripts/
digis_dispatcher', '1', '2', '3', '4', '5', '6', '7',
'8', '9', '10', 'jm_sm_qq_qqh120_inv', '1034']
Submit_Time: Sun Apr 8 13:29:10 2001
Start_Time: Sun Apr 8 13:29:16 2001
End_time: Sun Apr 8 13:35:52 2001
Exit Code:1
Number of Process 10
-----------------------------
Process Info:
-----------------------------
Process 1
Node: popcrn26
Start Time: Sun Apr 8 13:29:16 2001
End Time: Sun Apr 8 13:35:47 2001
Exit Code:1
Reason:Killed by BMGR
CPU Time: 187
-----------------------------
Process 2
Node: popcrn10
Start Time: Sun Apr 8 13:29:16 2001
End Time: Sun Apr 8 13:35:30 2001
Exit Code:1
Reason:Killed by BMGR
CPU Time: 150
-----------------------------
Process 3
Node: popcrn37

Start Time: Sun Apr 8 13:29:16 2001
End Time: Sun Apr 8 13:35:47 2001

Exit Code:1
Reason:Killed by BMGR
CPU Time: 40
-----------------------------

Note the CPU time of each process. If any stick out, there may
have been a problem. Also check the exit codes. If any are non-zero,
there may have been a problem. However, problems do arise that do not
touch the exit code, so beware.

(b) Check the job directories.

i) Log into gallo as cmsprod.
ii) cd /data/jetmet-production/cms_db
iii)"ls cmsim//production/problems" or "ls OOHits/production/problems" or
"ls OODigis/production/problems"
iv) If there are any entries here, then there was a problem.