Photo Production Software

Introduction.

Proposed document describes operations needed to run "Imaging" pipeline . These commands are supported by the GGG package and by the photo-prod package.

General structure.

Production is organized using Open Science Grid (OSG-0.8.0) software that was installed on a number of SDSS cluster nodes and GGG package to create Condor submit and DAG files. To be able to use the software you will need a valid VOMs proxy. It is recommended to create the proxy with long lifetime to permit the software to run without bothering about frequent proxy renovations. Before to start any of the described commands the user should source the setup.sh (new photo production) or setupSN.sh (old production) file stored in the root production directory ( it is /data/dp30.a/data/photo-prod/ in our case).

As it is required by the GGG package the production is based on the list of jobs stored in a database. The production is managed by the JobManager Java class. The work of the program is governed by a set of tables in the photoProd database. Those tables are:



Location of the database, database user and password are stored in the corresponding setup file. Please, look in the file to see what parameters are defined in it. Presently we are using MySQL data base located in /data/dp30.a/data/prodDB subdirectory. It contains both databases for "photo" production photoProd, snProd and for "spectro" production spectroProd. The MySQL server is not included in the sdssdp30 startup script and will not be restarted automatically with the system reboot. But there is a "cron" process that checks the server status every 15 minutes. If the server PID will not be detected by the script the server will be restarted.

Parameters stored in the setup.sh have explaining comments. Among important parameters that influence behavior of the JobManager are WAIT_TIME and BATCH_SIZE. The first one provides delay in seconds between submission of job batches. This is used to make a load on NFS more even. The second one defines a batch size - a number of jobs that will be grouped and processed with a single DAG manager. It is not recommended to create batches more that 5 jobs.



The JobManager runs in the endless loop until the "stop" flag will be set or the program will be aborted. It uses GGG components to create and submit jobs. The maximum number of jobs that can be run simultaneously is defined by the "ncpu" parameter of the pool table. The JobManager tracks the status of the submitted jobs in the jobs table.

The main production directory /data/dp30.a/data/photo-prod contains a set of subdirectories:

  1. GGG - the GGG resources subdirectory as created by the GGG distribution file.

  2. scripts - scripts used in the production. This scripts are really used in the production. There is a copy of this directory in the photo-prod subdirectory. This is used for development and for backing up to CVS using Eclipse development tool.

  3. site-info - the directory that contains the site description XML file.

  4. xdag_dir - the directory containing XML template files for creation of the Condor submit and DAG files. A copy of this directory exists in the photo-prod subdirectory.

  5. storage - the directory where for each job a subdirectory with the job ID name is created. Those will contain condor submit files and jobs error and output files. User should clean this subdirectory periodically.

  6. var - a subdirectory where for each job a unique directory is created. Those contain Condor output and error files and needed to debug any Condor problems. User should clean those periodically to save the disk space.





The database creation.

The production system provides several programs to facilitate creation of the initial database. This includes:

  1. inPool - if 0 the site is included in the production. If -1 excluded. Removing a site from the pool can lead to a pause in the production without stopping the JobManager.

  2. status - if 0 the site is OK. The status of the site can be checked by an independent program. Changing the status to negative value can pause the production without stopping the JobManager.

  3. ncpu - contains maximum number of jobs that can be submitted to the site.

  4. submitted - number of currently submitted jobs

  5. staging - number of jobs that perform IO operations

  6. running - number of jobs that have "running" status in the jobs table. It does not mean that the job is really running, the job can be just queued and waiting for resources.

To create the pool table use the command: " java -jar photo-prod/bin/photo_fat.jar gov.fnal.sdss.utils.PoolDB"

+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key |Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| ID          | varchar(40) | YES  |     |NULL    |       |
| executable  | varchar(10) | YES  |     |NULL    |       |
| depen       | varchar(60) | YES  |     |NULL    |       |
| Status      | varchar(20) | YES  |     |NULL    |       |
| SiteName    | varchar(20) | YES  |     |NULL    |       |
| DagLogDir   | varchar(20) | YES  |     |NULL    |       |
| SubTime     | varchar(20) | YES  |     |NULL    |       |
| ElapsedTime | varchar(20) | YES  |     |NULL    |       |
| DagManID    | varchar(6)  | YES  |     |NULL    |       |
| StageDone   | varchar(10) | YES  |     |NULL    |       |
+-------------+-------------+------+-----+---------+-------+

The contents of the table is created by JobDB program. A user can update the table anytime providing that the production (JobManager) will be paused for the period of the operation. To create/update the jobs table use the command:

"java -jar spectro-prod/bin/photo_fat.jar gov.fnal.sdss.utils.JobDB RUN RERUN COLUMN SF EF SF EF ..."

Here SF EF pair represents a range of photo events on which the whole column is split. User can run the program several times with the same parameters, the program will not modify/recreate already existing jobs. The program creates also an entry in the frames table, which describes the status of processing of the whole column. Two types of jobs are created in the jobs table - prepFrames job and a set of Photo jobs. The prepFrames job has priority and suppose to create a common data set to be used by all Photo jobs for this column. Only after successful completion of the prepFrames job corresponding Photo jobs will be submitted. During the production some parameters in the table will be changing. The meaning of parameters is as following:

  1. ID - the job ID is composed from the RUN, RERUN, COLUMN and frames. For example the prepFrames job ID will look like RUN-RERUN-COL while Photo job ID will be RUN-RERUN-COL-SF-EF.

  2. executable - name of the job "prepFrames" or "Photo".

  3. depen - prepFrames job ID the Photo job depends on.

  4. Status - the current status of the job ("created","running","done","fail")

  5. SiteName - "FNAL_GPFARM"

  6. DagLogDir - the name of subdirectory in ./var/ directory where Condor log files will be stored. This is a unique name created for each Condor DAG manager.

  7. SubTime - the time when the job was submitted.

  8. ElapsedTime - the time in milliseconds passed from the submission time.

  9. DagManId - the id of the DAG manager processing this job. User can see it using condor_q command.

  10. StageDone - the field filled at the job creation or completion. Contains the last successful step ("Linker","created","stage-in","run","stage-out","clean") This is not really used as we are submitting one script who's status is known only after the job completion.

Running The Production

During the photo production data files will be returned from grid to specified in fpPlan.par host in the form of tar.gz files. These files should be uncompressed and untarred to compete the processing cicle. Running simultaneously many tar and gzip operations can bring unacceptable load on corresponding host. To avoid this a special daemon resServD was developed. This daemon creates /tmp/resDir where processing pipeline will put tickets indicating transfered data files. The daemon scans the directory to detect these tickets and starts gzip and tar processes to unpack the data asynchronously with the main production pipeline.

Attention should be paid to check that daemons are working on corresponding hosts and that directories /tmp/resDir exists. To check that the daemon process exists one can do:

ps -lfe |grep java

this should return something like:

java -jar /usr/local/resServ/servers_fat.jar

To restart the daemon one can do: ksu; /etc/init.d/resServD stop; /etc/init.d/resServD start;

To run the "photo" production one need to start the JobManager. Providing the database and all tables are created as described above the user should do following operations:



Before to start the JobManager be sure that you have in your .k5login file the line like user/cron/host@FNAL.GOV where the host is the host where the JobManager is running.

It is useful to run condor_q command to watch running jobs. To have the command to repeat automatically use the command:

watch condor_q ${usrename} where the ${username} is your login name.



Resubmitting jobs.

Whenever you need to resubmit a job check that the "Status" of the job is cleared. You can do this by setting status flag to "waiting" in the prod_stat table and manually cleaning the "Status" in the jobs table for the specified ID. And then change the status flag in the prod_stat table to "running". There is a command to resubmit all "failed" jobs. To use the command you have to pause the production as described above and use the command:

java -jar photo-prod/bin/photo_fat.jar gov.fnal.eag.utils.Resubmit

This will clear all necessary fields in the jobs table for jobs who's status is "fail". Pay attention that in the path to the Resubmit command is used eag instead of sdss.