CMS Monte Carlo Production at FNAL
Shift Takers Manual for Fall Production, 2000
October 10, 2001


(1) What is the goal ?

    Production covers three areas: generation+simulation (cmsim), Hit
  Formatting (writeHits), Digitization and pileup (writeDigis).

      cmsim: generated+simulated events ---> .fz files
      writeHits: .fz files ---> Objectivity database
      writeDigis: Objectivity database ---> Objectivity database

(2) What are our resources ?

   enstore:
        We use the Storage Tek tape library in enstore for mass
      storage of .fz files and OBJY data files.  The enstore catalog of files
      on tape is managed with a filesystem lookalike called pnfs.
      /pnfs is "mounted" on gallo and velveeta and can be queried like
      a normal filesystem with commands like "ls", "mkdir", etc.  The
      exception is that to copy files to and from enstore you have to
      use the "encp" command instead of "cp".  Copying files to pnfs
      with encp means that they actually get written to tape.
        There is a known bug in pnfs in that some files will be listed
      twice.  The file is actually there however, so it is not a
      serious bug.
      It will be fixed after production is over.
      The CMS production area in pnfs is under /pnfs/cms/production.

      we create jobs and launch them on gallo.fnal.gov.  The .fz files
      come from the /pnfs/cms/production/Projects//results 
      area. The CMS production area on gallo where we create and launch jobs 
      is in /data/jetmet_production. The .fz files are also staged on the 
      popcrn nodes on which jobs are running.

      The Objectivity database is called a federation because it is
      actually a collection of database files.  The C_federation is kept on
      gallo.fnal.gov and D-federation is kept on velveeta.fnal.gov.  The
      scripts are kept on gallo.gnal.gov.  When running writeHits or
      writeDigis, it is important to monitor the usage of the /data disk on 
      velveeta and /data disk on gallo.  The space requirements are about the 
      same.

   popcrn31-40:
        These are the nodes of the production farm.  Each node has two
      processors and cun run two  jobs at a time.  popcrn31-40 (except
      37) are also  pileup servers.  They  keep the minimum bias  data and
      serve it to  jobs running writeDigis that need  pileup.  popcrn1-2 are
      reserved for development.

  cmsprod account:


       Username: cmsprod
       Password: xxxxxxx

        fbsng (farm batch system next generation):
        This is the batch system used on the popcrn nodes.  Some useful
      commands:
            setup fbsng      ( Must be done before using fbsng )
            fbs lj           ( Lists all jobs currently running or
queued )
            fbs nodes        ( Lists all running jobs by node )
            fbs submit x.jdf ( Submits a job. )
      Each fbs job supports multiple processes; typically the production

      scripts will generate 5-10 processes per job.

  production scripts:
        Hans' production scripts are kept in cvs module cms_production.
      The scripts we will use for production have been checked out for
you
      already in the directory
              /data/jetmet_production/C_scripts or
              /data/jetmet_production/D_scripts

  backup scripts:
        Greg has written backup scripts for production. Go to the directory of either C or D
	federation and start backing up data and system files.
	The syntax are as follows:
	
	Backup datafiles for both Hits and Digis:
	
	HITS:
	fb -v -i `pwd` -o jetmet_production -n  -w  -c 
	backup_data_files > keyword.txt
	
	There are five optionals in the case of Hits:
	1. Collections
	2. Events
	3. MCInfo
	4. Hits
	5. THits
	
	Make sure you repeat backup command for all the optionals.
	
	DIGIS:
	fb -v -i `pwd` -o jetmet_production -n  -w  -c 
	backup_data_files > keyword.txt
	
	There are five optionals in the case of Hits:
	1. Collections
	2. Events
	3. Digis
	
	Make sure you repeat backup command for all the optionals.
	
	Backup of system files:
	fb -v -i `pwd` -o  backup_system_files > keyword.txt
	     
        
  Objectivity servers:
          Objectivity uses AMS server to communicate across the network.
          It also has a lock server to provide safe concurrent access to the
          database files.  AMS server should be running on velveeta, gallo, popcrn31-40 at all times
          during production. Lock servers should be running on popcrn06 and popcrn07 at all times
	  during productin.  To check, log on to the corresponding machines as cmsprod and type:

                 oocheckams 
                 oocheckls 
		 
        It doesn't matter where you do this from.  If either need to be
        started (or stopped) you can use the following:

                 setup systools
                 cmd ams-server start
                 cmd ams-server stop
                 cmd ams-lock start &
                 cmd ams-lock stop

        In some circumstances, the lock server may not be able to be
        stopped.  In that case, call an expert (so he/she can issue a
        "kill -9" and do the necessary cleanup.)

  Web Site for more information:

        http://computing.fnal.gov/cms/Monitor/cms_production.html

(3) So I just got in and I'm on shift today.  What do I do ?

    This tutorial will assume that you are starting out in the
beginning. In real life, you may start in the middle of any of these
procedures.

    (a) Attend the production meeting every Monday and Thursday at 10:00 AM.
        Get instructions there on what samples need to be processed.
        Otherwise, wait for instructions from Greg.

    (b) If this is your first day on shift, make sure that your name
        is included in the FBS job description file so that you will be
        notified when jobs finish.  To do this:
           i) Log into gallo as cmsprod.
          ii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util 
					    or B_scripts/cms_prod_util
         iii) emacs Templates/hits_template.jdf
          iv) add your email address to the EMAIL lines, (there is more
              than one place in the file,) and save the changes.
           v) Repeat for all digis_template*.jdf files in Templates.

    (c) For doing CMSIM:
           i) Log into gallo as cmsprod.
          ii) setup fbsng
         iii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util or B_scripts/cms_prod_util
          iv) Check disk space of gallo:/data and velveeta:/data. Check
              for at least 50 GB space on each.  If there is not, ask
              for assistance.  Free space will be made.  When logged
              into gallo:
                             df -k /data
                             rsh velveeta df -k /data

              
           The following are the steps involved in OOHit formatting. 

           scripts/DeclareCMSIMJobs.sh -v -n 40 data_set_name [number]

           This command creates a directory for data_set_name under cmsim 
           directory. Then it creats a directory production under data 
           set name directory. Then it creates the following directory 
           structure under production directory 

                declared 
                created 
                in_progress 
                done 
								params
                problems 

           The command gets list of all ntpl files for the given data set from 
           gallo specified directory and creates these files in "declared" directory. 
					 If it fails to create a entry in "declared direcotry, it reports that error mesage. 
					 If you receive no error message, it means everything is ok. 
					 
           scripts/CreateCMSIMJobs.sh -v data_set_name

           This command creats a batch directory under data set name. 
           Then it creates the following directory structure under batch
           directory 

                asociations 
                created 
                declared 
                finished 
                jdf 
								logs
								params
                running 
                scripts 
                submitted 

           This command creats entries in asociations, created and declared 
           directories for each ntpl file. It also creates script from
           cmsim_template for each ntpl file and puts them into script directory.
           Then it creates job desription file for each entry and puts
           them in jdf directory. 

           scripts/RunJob.sh -v -j cmsim data_set_name 

           This command submit all jobs one by one to production farms that 
           you have mension in command for a given data set. We
           can see entries of all jobs that have ben submitted successfully 
           in batch/submitted directory.  After the job have been completed successfully, 
					 it moves files entry to done directory. If somehow job does not run successfully, 
					 it moves files entry to problems directory. 
  
    (d) For doing OOHits:
           i) Log into gallo as cmsprod.
          ii) setup fbsng
         iii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util
          iv) Check disk space of gallo:/data and velveeta:/data. Check
              for at least 50 GB space on each.  If there is not, ask
              for assistance.  Free space will be made.  When logged
              into gallo:
                             df -k /data
                             rsh velveeta df -k /data

           v) You will receive one or more data sets to process with
              writeHits for the day.
              
           The following are the steps involved in OOHit formatting. 

           scripts/DeclareHitsJobs.sh -v data_set_name 

           This command creates a directory for data_set_name under OOHit 
           directory. Then it creats a directory production under data 
           set name directory. Then it creates the following directory 
           structure under production directory 

                declared 
                created 
                in_progress 
                done 
                problems 

           The command gets list of all fz files for the given data set from 
           tape drive and creates these files in "declared" directory
           without fz suffix. If it fails to create a entry in "declared" 
           direcotry, it reports that error mesage. If you receive no error
           message, it means everything is ok. 

           scripts/CreateHitsJobs.sh -v data_set_nam 

           This command creats a batch directory under data set name. 
           Then it creates the following directory structure under batch
           directory 

                asociations 
                created 
                declared 
                finished 
                jdf 
                running 
                scripts 
                submitted 

           This command creats entries in asociations, created and declared 
           directories for each fz file. It also creates script from
           hits_template for each fz file and puts them into script directory.
           Then it creates job desription file for each entry and puts
           them in jdf directory. 

           scripts/RunJob.sh -v -j OOHit data_set_nam [number] 

           This command submit all jobs one by one to production farms that 
           you have mension in command for a given data set. We
           can see entries of all jobs that have ben submitted successfully 
           in batch/submitted directory. OOHit formatting consist of
           three stages. 

                 Staging 
                 RunHits 
                 ValidateHits 

           In first stage, each fz file is staged from enstore tape to run 
           time area. Then in second stage, hitformatting is done. In
           final stage hit run number is validated. These three stages 
           executes one after the other and depend on the exit code of the
           previous stage. If the exit code from previous stage was not zero, 
           it does not execute the next stage. After the job have
           been completed successfully, it moves files entry to done directory.
           If somehow job does not run successfully, it moves
           files entry to problems directory. 
  
  
                 
    (e) For doing OODigis:
           i) Log into gallo as cmsprod.
          ii) setup fbsng
         iii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util
	      echo $PROD_RESOURCES
	      setenv PROD_RESOURCES `pwd`/scripts
          iv) Check disk space of velveeta:/data. Check for at least 50 GB
              space.  If there is not, ask for assistance.  Free space will
              be made.  When logged into gallo:

                             df -k /data
                             rsh velveeta df -k /data

           v) You will receive one or more data sets to process with
              writeDigis for the day.
              The following steps are involved in OODigitization formatting. 

              scripts/DeclareDigisJobs.sh -v data_set_name  pileup_descriptor 

              This command creates a directory for data_set_name under OOHit 
              directory. Then it creats a directory production under
              data set name directory. Then it creates the following directory
              structure under production directory 

                     declared 
                     created 
                     in_progress 
                     done 
                     problems 

              The command gets list of all fz files for the given data set 
              from tape drive and creates these files in "declared" directory
              without fz suffix. If it fails to create a entry in "declared" 
              direcotry, it reports that error mesage. If you receive no error
              message, it means every thing is ok. 

              scripts/CreateDigisJobs.sh -v data_set_nam  pileup_descriptor 

              This command creats a batch directory under data set name. 
              Then it creates the following directory structure under batch
              directory 

                    asociations 
                    created 
                    declared 
                    finished 
                    jdf 
                    running 
                    scripts 
                    submitted 

               This command creats entries in asociations, created and 
               declared directories for each fz file. It also creates script 
               from hits_template for each fz file and puts them into script 
               directory. Then it creates job desription file for each entry 
               and puts them in jdf directory. 

              scripts/RunJob.sh -v -j OODigi data_set_nam [number] 

              This command submit all jobs one by one to production farms 
              that you have mension in command for a given data set. We
              can see entries of all jobs that have ben submitted successfully
              in batch/submitted directory. After the job have been
              completed successfully, it moves files entry to done directory.
              If somehow job does not run successfully, it moves files
              entry to problems directory. 
     
(4) What can I do while it's running ?

     Some run-time Sanity Checks

       i) Check that the jobs have been submitted OK with fbs lj.

      ii) Check on which nodes jobs are running with fbs lj.

     iii) After a few minutes, check that the disk usage on
          velveeta:/data is growing.  Check this from gallo 
          using "rsh velveeta du -sk /data".
          Check it again shortly after and see if disk space is
          accumulating.

      iv) Each job will have N sections depending on what N you gave to
          the "create_hits,digis_jobs" script.  Count the number of
          sections listed in the "fbs nodes" output for each job.  Did any
          sections croak ?

       v) Lots of fun: After a while, you can get basic statistics on the

          Web page. Go to
               computing.fnal.gov:/cms/Monitor/cms_production.html
          and follow the "Production Farms" link.  You can get various
          network traffic plots and CPU utilizations for all the popcorn
          nodes, gallo, and velveeta.  Also, you can get a summary of which
          nodes have jobs running on them.

      vi) Do you know your FBS job id numbers ?  Then you can check
          which event you are on by "python ~/bin/LogChecker.py "
          This command parses the log files as they are written in
          /data/fbs-logs on gallo and gets the last event number
          processes

(5) How do I know it finished OK ?

  (a) When all sections of the job have exited, you will receive several
      EMails.  Check the output of the Email labeled "main."    It looks
      like this ( which is a failed job! ) :


        Section Info:
        Job 2118 Section: main
        Exec: ['/data/cms_production_220301/cms_production/scripts/
                digis_dispatcher', '1', '2', '3', '4', '5', '6', '7',
        '8', '9', '10', 'jm_sm_qq_qqh120_inv', '1034']
        Submit_Time: Sun Apr  8 13:29:10 2001
        Start_Time:  Sun Apr  8 13:29:16 2001
        End_time: Sun Apr  8 13:35:52 2001
        Exit Code:1
        Number of Process 10
        -----------------------------
        Process Info:
        -----------------------------
        Process 1
        Node: popcrn26
        Start Time: Sun Apr  8 13:29:16 2001
        End Time: Sun Apr  8 13:35:47 2001
        Exit Code:1
        Reason:Killed by BMGR
        CPU Time: 187
        -----------------------------
        Process 2
        Node: popcrn10
        Start Time: Sun Apr  8 13:29:16 2001
        End Time: Sun Apr  8 13:35:30 2001
        Exit Code:1
        Reason:Killed by BMGR
        CPU Time: 150
        -----------------------------
        Process 3
        Node: popcrn37

        Start Time: Sun Apr  8 13:29:16 2001
        End Time: Sun Apr  8 13:35:47 2001

        Exit Code:1
        Reason:Killed by BMGR
        CPU Time: 40
        -----------------------------

     Note the CPU time of each process.  If any stick out, there may
     have been a problem.  Also check the exit codes.  If any are non-zero, 
     there may have been a problem.  However, problems do arise that do not 
     touch the exit code, so beware.

  (b) Check the job directories.

           i) Log into gallo as cmsprod.
          ii) cd /data/jetmet-production/cms_db
         iii)"ls cmsim//production/problems" or "ls OOHits/production/problems" or
              "ls OODigis/production/problems"
          iv) If there are any entries here, then there was a problem.