Generic Grid Gofer
Nikolay Kuropatkin
Introduction
This is the second edition of the document. Since the fist release of the Generic Grid Gofer (GGG) package I have successfully run several productions. The SDSS “spectro” and “photo” pipelines were converted to GRID environment using the package. The last job I have developed is a prototype of the DES simulation that is included as an example job in the GGG package. This is an example of running a Java application with the dynamic deployment of the Java Run Time Environment directly on the worker node of a Grid site. Included examples are modified to correspond to OSG-0.4.0.
The main purpose of the document is to demonstrate a potential user how to start running his program on the GRID using the GGG. To do this we will consider prerequisites needed for a user to run a job on the GRID. We will discuss what software the user need to have on the submission site. What minimal set of tools he/she need to know. What web sites would be of interest to be familiar with the GRID paradigm. Finally how to install and use the GGG to help to run the production. The last task is demonstrated with the help of a demo application.
Prerequisites
Let us consider a user that has a job that runs on his/her laptop. Now there is a need to run 10^N jobs, where N > 3, with the same executables but different sets of data. The natural choice for the user seems to be a GRID. There are many sites with many CPU's capable to run the program. We will talk about OSG (Open Science Grid) that will be referenced as GRID in the document.
The first prerequisite is to became a certified member of a Virtual Organization (VO) recognized by the GRID. Ask your system manager how to obtain the GRID certificate, and to became a member of a VO.
Second, user can submit jobs to GRID only from a host registered in the GRID with valid host certificate, and where the basic GRID software (VDT (http://www.cs.wisc.edu/vdt//index.html)) is installed (submission host). The VDT includes Condor-G (http://www.cs.wisc.edu/condor/) and Globus (http://www.globus.org/) packages the user should be familiar with. Ask your system administrator where the VDT is installed.
Be familiar with some monitoring facilities like the GridCat (http://osg-cat.grid.iu.edu/). This will be needed to find the status of the GRID and sites where to submit your jobs.
Ask the system administrator about installed database (MySQL or PostGreSQL), and ask to give you permission to create databases. Alternatively you can install one locally in your account.
Ask the system administrator to install Java SDK 1.5 or newer. This will be needed to work with GGG package. The present version of the VDT includes Java 1.4, but GGG requires version 1.5 or newer. Alternatively you can install the Java locally.
Install GGG in the directory where you are going to run production (base directory). How to install the GGG will be discussed later.
How to start.
Having all the requirements mentioned in the previous section satisfied a user can finally proceed with the job. Login to the submission host and source the VDT setup.sh (source /opt/vdt/setup.sh). This will create proper environment for work with the GRID. User can check that the environment is set with the command "which condor", this should return a string similar to " /opt/vdt/condor/sbin/condor". Create VOMS PROXY using the "voms-proxy-init" command ( the system administrator should show you how to run the command). Check that you have valid proxy with the command "voms-proxy-info". If this works you can proceed with the first GRID job. Use the GridCat to find some nice site and write out the Gatekeeper Host Name, we will reference the name as <host>. It is recommended to use sites where OSG-0.4.0 or newer is installed. Use the command "globus-job-run <host>/jobmanager-fork -l /bin/date". This should return a string like "Tue Sep 13 15:49:05 EDT 2005". If it works - congratulations, you learn how to run a simple command on the remote host. Having a valid "voms" proxy is necessary step you have to pass.
How to install GGG
The GGG package is distributed in the form
of self extracting jar file GGGdist.jar.
Run command "java -jar GGGdist.jar"
and follow on
screen dialog. In case of successful installation you will have in
selected production directory following subdirectories:
GGG - the directory where the package resources are.
The GGG directory will have following subdirectory structure:
bin - where executable GGG_fat.jar will be located. We will discuss how to use it later.
doc - where java documentation of the package API is located.
scripts - where demo application scripts are located. Use them as templates to develop your own production.
site_info - where examples of site description files are located. Use "SiteCreator" program and information from GridCat to create sites for your production.
src - where GGG_src.jar file is located. I would recommend to use eclipse to import the file as the GGG project. This permits you easily modify some programs for your needs, and run production from the SDK.
xdag_dir - where examples of XML files are located. Use them as templates to create your own production.
Files in the GGG/scripts, GGG/dist, GGG/site_info and GGG/xdag_dir are provided as examples only. You have to create/copy those subdirectories to your base directory. The GGG is looking for corresponding files in subdirectories in the base directory. This permits you to easily modify examples for your own purposes without destroying the distribution examples.
In the base directory you will find setup.sh file. Modify it to reflect various environment variables specific to your software installation and production.
You also need to create in the base directory two subdirectories – storage and var. Make them writable by anybody. GGG will use these subdirectories during production to create jobs and store job's log files.
Site description files.
Having installed GGG user can start from creating a number of site description files that will be used in the production. For this use the GridCat to collect information about a site. Then use following command:
java -jar GGG/bin/GGG_fat.jar gov.fnal.eag.site.SiteCreator
this will open an interactive window with the form you have to fill up. Click on "submit" button to create the site description file with the name that was entered as the site name. The site creation process can be simplified by reading information directly from the GridCat web page, but I will leave the exercise for future.
Since the release of OSG-0.4.0 a job submitted to remote site get Grid environment set automatically, so only a few parameters from the site description file will be really used in the production. Nevertheless on some sites the temporary disk directory (WNTMP) that is used by the demo program is not defined and it is safe to provide it through the site description file.
This command also shows how to use the "fat" jar file to run any main class in the package.
Software distribution.
To be able to run jobs on GRID a user need to distribute his software on all selected sites. The process of the software distribution also is a Grid job. It should not be a problem to create a tar file with the necessary software and deploy it on the remote site using a job similar to the one in the example.
If the software has a reasonable size a user can use a dynamic deployment as is shown in the example program.
Warning: The example program is provided as an example only. User can not run it as critical data files are removed to reduce size of the distribution file.
Create database.
To run the production in automatic way user need to create the database. See the "technical description" about the structure of the database. I strongly recommend to unpack and study the source of JobDB.java , PoolDB.java and StatusDB.java programs to adopt them for your data model and create the database. Check that information about the database in setup.sh script is correct.
Submission host software.
Use demo production examples found in GGG/scripts and GGG/xdag_dir directories to create corresponding files for your production. You will need 3 shell scripts and 3 XML files as a minimal set. Those “PRE” and “POST” scripts will be run on the submission host and are mainly responsible for the bookkeeping. Debug them to be sure that they change the job status in the database correctly. The main script (“DemoApp.sh” ) will be run on the remote site, and should be done carefully to process all possible errors and results with recording performed steps and their results in corresponding log file.
Start production.
It is recommended to run a couple of jobs before start automatic production. To do this run following command:
java -jar GGG/bin/GGG_fat.jar gov.fnal.eag.dag.Submitter njobs SiteName
Here parameters are number of jobs to submit and the Site Name where to submit jobs. Use condor_q command to watch the jobs run well. Check that bookkeeping scripts works correctly. If everything is fine you are ready for the production. To start production run the command:
java -jar GGG/bin/GGG_fat.jar gov.fnal.eag.planner.JobManager YourJobDescription job_condor
Here YourJobDescription is the name of the job description file without extension, and job_condor is the condor submit file name that you have created for your job ( most probably will not be changed).
The “Job Manager” will submit executable described in the job description file to a compute node on the remote site. This can be a simple script that reads information from the grid environment to determine where to create working directories and run corresponding executables pre-installed on the remote site, or to copy those executables using globus-url-copy from specified site and execute them.
It is very easy to organize the production monitoring using information from the database. But this topic is more advanced and is not included in the document.