ggg.logo

Generic Grid Gofer

Nikolay Kuropatkin

Introduction



This is the second edition of the document. Since the fist release of the Generic Grid Gofer (GGG) package I have successfully run several productions. The SDSS “spectro” and “photo” pipelines were converted to GRID environment using the package. The last job I have developed is a prototype of the DES simulation that is included as an example job in the GGG package. This is an example of running a Java application with the dynamic deployment of the Java Run Time Environment directly on the worker node of a Grid site. Included examples are modified to correspond to OSG-0.4.0.

The main purpose of the document is to demonstrate a potential user how to start running his program on the GRID using the GGG. To do this we will consider prerequisites needed for a user to run a job on the GRID. We will discuss what software the user need to have on the submission site. What minimal set of tools he/she need to know. What web sites would be of interest to be familiar with the GRID paradigm. Finally how to install and use the GGG to help to run the production. The last task is demonstrated with the help of a demo application.

Prerequisites

Let us consider a user that has a job that runs on his/her laptop. Now there is a need to run 10^N jobs, where N > 3, with the same executables but different sets of data. The natural choice for the user seems to be a GRID. There are many sites with many CPU's capable to run the program. We will talk about OSG (Open Science Grid) that will be referenced as GRID in the document.

How to start.

Having all the requirements mentioned in the previous section satisfied a user can finally proceed with the job. Login to the submission host and source the VDT setup.sh (source /opt/vdt/setup.sh). This will create proper environment for work with the GRID. User can check that the environment is set with the command "which condor", this should return a string similar to " /opt/vdt/condor/sbin/condor". Create VOMS PROXY using the "voms-proxy-init" command ( the system administrator should show you how to run the command). Check that you have valid proxy with the command "voms-proxy-info". If this works you can proceed with the first GRID job. Use the GridCat to find some nice site and write out the Gatekeeper Host Name, we will reference the name as <host>. It is recommended to use sites where OSG-0.4.0 or newer is installed. Use the command "globus-job-run <host>/jobmanager-fork -l /bin/date". This should return a string like "Tue Sep 13 15:49:05 EDT 2005". If it works - congratulations, you learn how to run a simple command on the remote host. Having a valid "voms" proxy is necessary step you have to pass.

How to install GGG

The GGG package is distributed in the form of self extracting jar file GGGdist.jar. Run command "java -jar GGGdist.jar"
and follow on screen dialog. In case of successful installation you will have in selected production directory following subdirectories:

Files in the GGG/scripts, GGG/dist, GGG/site_info and GGG/xdag_dir are provided as examples only. You have to create/copy those subdirectories to your base directory. The GGG is looking for corresponding files in subdirectories in the base directory. This permits you to easily modify examples for your own purposes without destroying the distribution examples.

In the base directory you will find setup.sh file. Modify it to reflect various environment variables specific to your software installation and production.

You also need to create in the base directory two subdirectories – storage and var. Make them writable by anybody. GGG will use these subdirectories during production to create jobs and store job's log files.

Site description files.

Having installed GGG user can start from creating a number of site description files that will be used in the production. For this use the GridCat to collect information about a site. Then use following command:

java -jar GGG/bin/GGG_fat.jar gov.fnal.eag.site.SiteCreator

this will open an interactive window with the form you have to fill up. Click on "submit" button to create the site description file with the name that was entered as the site name. The site creation process can be simplified by reading information directly from the GridCat web page, but I will leave the exercise for future.

Since the release of OSG-0.4.0 a job submitted to remote site get Grid environment set automatically, so only a few parameters from the site description file will be really used in the production. Nevertheless on some sites the temporary disk directory (WNTMP) that is used by the demo program is not defined and it is safe to provide it through the site description file.

This command also shows how to use the "fat" jar file to run any main class in the package.

Software distribution.

To be able to run jobs on GRID a user need to distribute his software on all selected sites. The process of the software distribution also is a Grid job. It should not be a problem to create a tar file with the necessary software and deploy it on the remote site using a job similar to the one in the example.

If the software has a reasonable size a user can use a dynamic deployment as is shown in the example program.

Warning: The example program is provided as an example only. User can not run it as critical data files are removed to reduce size of the distribution file.



Create database.

To run the production in automatic way user need to create the database. See the "technical description" about the structure of the database. I strongly recommend to unpack and study the source of JobDB.java , PoolDB.java and StatusDB.java programs to adopt them for your data model and create the database. Check that information about the database in setup.sh script is correct.

Submission host software.

Use demo production examples found in GGG/scripts and GGG/xdag_dir directories to create corresponding files for your production. You will need 3 shell scripts and 3 XML files as a minimal set. Those “PRE” and “POST” scripts will be run on the submission host and are mainly responsible for the bookkeeping. Debug them to be sure that they change the job status in the database correctly. The main script (“DemoApp.sh” ) will be run on the remote site, and should be done carefully to process all possible errors and results with recording performed steps and their results in corresponding log file.



Start production.

It is recommended to run a couple of jobs before start automatic production. To do this run following command:

java -jar GGG/bin/GGG_fat.jar gov.fnal.eag.dag.Submitter njobs SiteName

Here parameters are number of jobs to submit and the Site Name where to submit jobs. Use condor_q command to watch the jobs run well. Check that bookkeeping scripts works correctly. If everything is fine you are ready for the production. To start production run the command:

java -jar GGG/bin/GGG_fat.jar gov.fnal.eag.planner.JobManager YourJobDescription job_condor

Here YourJobDescription is the name of the job description file without extension, and job_condor is the condor submit file name that you have created for your job ( most probably will not be changed).

The “Job Manager” will submit executable described in the job description file to a compute node on the remote site. This can be a simple script that reads information from the grid environment to determine where to create working directories and run corresponding executables pre-installed on the remote site, or to copy those executables using globus-url-copy from specified site and execute them.

It is very easy to organize the production monitoring using information from the database. But this topic is more advanced and is not included in the document.