############################## SAM-Grid User Specifications ############################## ------------------------- Abstract ------------------------- This document is the first draft of the SAM-Grid user manual. It contains the instructions to 1) install and configure the products 2) describe the user's jobs via the SAM-Grid Job Description Language 3) submit jobs to the SAM-Grid 4) monitor the status of the jobs 5) troubleshoot ------------------------- 1.1 Introduction ------------------------- SAM-Grid is a virtual project whose core is the D0-PPDG group at Fermilab and which includes off-site D0 collaborators under the aegis of various Grid projects. It's mission is to enable fully distributed computing for D0 and CDF, by: - enhancing SAM as the distributed data handling system of the experiments. - incorporating standard Grid tools and protocols. - developing new solutions for Grid computing together with Computer Scientists. Under this mission, the project strives to unite the D0 efforts from the multifarious Grid activities (PPDG, EU DataGrid, GridPP and more), off-site analysis work and other aspirations distributed throughout the D0 collaboration. The two main areas of work are Job Handling (including specification, brokering, scheduling etc.) and Monitoring and Information Services. -------------------------- 1.2 Installing SAM-Grid -------------------------- A site can join the SAM-Grid in three ways: 1) as a monitoring site AND/OR 2) as an execution site AND/OR 3) as a submission site. Since the current focus of the SAM-Grid development is enabling distributed sam analysis jobs, the discussion below assumes the site runs a sam station. NOTE: This infrastructure allows the monitoring of a sam station, even if such station is not accessible for remote job submission. Disclaimer: installing any of the jim packages will drive you through the installation of globus: the installation will be MUCH easier if the product area is NOT NFS shared. However, below you will find instructions on how to install globus in this scenario as well. ------------------------------------ 1.2.1 Summary of the activities as root ------------------------------------ In order to install the whole jim software suite, root access is needed for the following actions: o Opening ports in the firewall from the head node: grid gatekeeper: 2119 Open to all other submission sites Open to the world enables us to add new submission sites without changing the configuration of all the execution sites: this is important for scalability. job-managers : Any adjacent 500 ports, or more (limits the number of cuncurrent job submitted to 500) Open to the submission sites. Same consideration as above for open to the world. grid MDS : 2135 Open to samadams.fnal.gov, better to FNAL to enable possible fall over mechanisms. If the site runs a sam station, these are the ports that needs to be opened: sam : 4550-4555 Open to FNAL sam_dcache_cp : 25126 Mainly to cdfendca3.fnal.gov o Installation and tailoring of products: the following summary assumes a local ups database. * Install the Globus Security Infrastructure (GSI), in particular -the tailoring of the first of the globus products (in particular globus_rm_server) and sam_gsi_config -editing /etc/xinetd.d/globus-gatekeeper to restrict the port range for the job-managers: add the line env = GLOBUS_TCP_PORT_RANGE=min_port,max_port -the request of the host certificate to your local CA and a sam service certificate to the Fermilab KCA. -establishing a cron job that updates the grid-mapfile after the KCA has released the sam service certificate for the installation. -starting up globus_rm_server * Install the jim_broker_client to enable the site to be a submission site, in particular -the tailoring of jim_broker_client -starting the jim_broker_client daemons. ------------------------------------ 1.2.2 Installing a Monitoring Site ------------------------------------ The SAM-Grid monitoring service is available on the web at http://samadams.fnal.gov:8080/prototype/ (prototypical release). In order for a site to be monitored, there are 3 steps to follow: - install globus MDS at least on one machine of the site - configure/update MDS with the SAM-Grid schema/information hierarchy - inform the SAM-Grid team of the availability of the new monitoring site with following details: Host and Port where MDS is running, Jim-Site's name that was chosen by the site administrator while tailoring of jim_info_providers. Our information system has been developed and tested with MDS 2.1. MDS can be installed directly from www.globus.org or, if the installation platform is Linux, via ups/upd. The latter is the easiest way, since many configuration steps are taken automatically at installation time. Instructions for both ways are listed below. *ups/upd installation: NOTE: Make sure to follow the instructions printed out at installation time. ------------------- INSTALLATION OF THE GLOBUS INFORMATION SERVICES SERVER TOOLS - as products: > upd install globus_is_server -G-c ------------------- CONFIGURATION OF GLOBUS SECURITY INFRASTRUCTURE NOTE: these instructions must be followed only one time, otherwise you may overwrite your work. If the "products" area is NFS mounted AND root on the execution node had not write privilege to the product area, do as follow: -As root for every execution node 1. mkdir /your/local/area/ (example: /local/globus/) -As products on the product installation machine 1. setup globus_location 2. mkdir ${GLOBUS_LOCATION}/etc/globus_packages/setup/trusted_ca_setup 3. cd ${GLOBUS_LOCATION}/etc/globus_packages/setup/trusted_ca_setup 4. ln -s /your/local/area/globus_ssl_utils_setup.gpt globus_ssl_utils_setup.gpt Note: if you don't do this, the following will NOT work. For shared and NON shared installations: -As root for every execution node 1. setup globus_location 2. ${GLOBUS_LOCATION}/setup/globus/setup-gsi" This script overwrites the following files (+ the above link if the installation is shared) /etc/grid-security/globus-host-ssl.conf /etc/grid-security/globus-user-ssl.conf /etc/grid-security/grid-security.conf - contact the SAM-Grid team to get the grid-mapfile needed to further configure GSI ------------------- INSTALLATION OF THE GLOBUS INFORMATION SERVICES CLIENT TOOLS - as products: > upd install globus_is_client -G-c Note that warnings are generally OK: client and server packages share common code, which is not overritten; a warning is issued, instead ------------------- *www.globus.org installation: TBD ------------------- INSTALLATION OF THE MDS 2.1 (within Globus Toolkit 2.0) http://www.globus.org/toolkit/download/previous-versions.html ------------------- *after MDS installation: ------------------- CONFIGURATION OF MDS: general note. If the MDS installation is a shared one, and MDS will be started from another host than the one it was originally installed with, then the following files in $GLOBUS_LOCATION/etc/ "need" to be local: [1] grid-info-resource-register.conf [2] grid-info.conf The hostname in these files needs to be modified accordingly. These files should then be transferred to a local area, creating symbolic links to $GLOBUS_LOCATION/etc/ During the installation of jim_info_providers as described below, the site administrator can actually configure this MDS installation to monitor 'all' the execution sites at a monitoring site. This can be done at the time of tailoring of the jim_info_providers (highly recommended). ------------------- INSTALLATION OF JIM_INFO_PROVIDERS 1) -as user products > upd install -G-c jim_info_providers -q GCC-2.95.2 > ups tailor jim_info_providers -q GCC-2.95.2 > ups update_mds jim_info_providers -q GCC-2.95.2 The following needs to be done if your installation of globus is NFS shared. If it is not, goto 3) 2)create a local globus area (like /local/globus) on the host that will run MDS, then > export LOCAL_GLOBUS_LOCATION=/local/globus > mkdir -p ${LOCAL_GLOBUS_LOCATION}/etc > mv $GLOBUS_LOCATION/etc/grid-info-resource-register.conf \ ${LOCAL_GLOBUS_LOCATION}/etc > mv $GLOBUS_LOCATION/etc/grid-info.conf ${LOCAL_GLOBUS_LOCATION}/etc > ln -s ${LOCAL_GLOBUS_LOCATION}/etc/grid-info-resource-register.conf \ $GLOBUS_LOCATION/etc/grid-info-resource-register.conf > ln -s ${LOCAL_GLOBUS_LOCATION}/etc/grid-info.conf \ $GLOBUS_LOCATION/etc/grid-info.conf > mkdir -p ${LOCAL_GLOBUS_LOCATION}/var > chown sam ${LOCAL_GLOBUS_LOCATION}/var > rm -r $GLOBUS_LOCATION/var > ln -s ${LOCAL_GLOBUS_LOCATION}/var $GLOBUS_LOCATION/var 3) -as user sam > ups start_mds jim_info_providers -q GCC-2.95.2 ('ups stop_mds' is also available for jim_info_providers) ------------------- ------------------------------------ 1.2.3 Installing an Execution Site ------------------------------------ The current focus of the SAM-Grid development is enabling distributed sam analysis jobs. At this time, the best way to join the SAM-Grid as an execution site is first by installing a sam station (www-d0.fnal.gov/sam/). ------------------- INSTALLATION OF GLOBUS INFORMATION SERVICES BUNDLE FROM ups/upd -as user products > upd install -G-c globus_is_server > upd install -G-c globus_is_client ------------------- INSTALLATION OF GLOBUS RESOURCE MANAGEMENT BUNDLE FROM ups/upd -as user products > upd install -G-c globus_rm_server Setup GSI (if not done yet) as descibed above in par 1.2.2 in "CONFIGURATION OF GLOBUS SECURITY INFRASTRUCTURE" If you are using DOE certificate authority then follow steps 2 & 3 below to further setup GSI. Otherwise follow your CA specific instructions to request certificates to further configure GSI and jump to Step4. 2) Install, configure and use the sam_gsi_config product. As user products > upd install -G-c sam_gsi_config As user root, execute > ups tailor sam_gsi_config You'll be asked to choose among various VO: you probably want either jimsam or jimcaf. > setup sam_gsi_config > sam_cert_request The command above will drive you through the request of a sam service certificate (typically 1 day response) DO THE FOLLOWING COMMANDS ONLY AFTER YOU'VE RECEIVED THE SAM SERVICE CERTIFICATE. > ups get_gridmap sam_gsi_config Edit your crontab for root and add something like 0 * * * * . /usr/local/etc/setups.sh && ups get_gridmap sam_gsi_config > /dev/null 2>&1 This will keep up to date your grid-mapfile. 3) request a host certificate to a CA for your installation (typically 1 day response) Follow the instructions for "Requesting a host or service certificate" in link below:: http://www.doegrids.org/pages/cert-request.htm 4) As root ups tailor globus_rm_server Follow the displayed instructions if any. 5) If you are having a shared installation or your products area is not local, then follow the instructions below , otherwise go to step 6. (i) setup globus_location (ii) Create the following symbolic links, if you haven't done so already (/your/local/area must be "w" accessible by "root") > export LOCAL_GLOBUS_LOCATION=/your/local/area > mkdir -p ${LOCAL_GLOBUS_LOCATION}/etc > mkdir -p ${LOCAL_GLOBUS_LOCATION}/var > chown sam ${LOCAL_GLOBUS_LOCATION}/var > rm -r ${GLOBUS_LOCATION}/var > ln -s ${LOCAL_GLOBUS_LOCATION}/var ${GLOBUS_LOCATION}/var > cp $GLOBUS_LOCATION/etc/globus-job-manager.conf \ ${LOCAL_GLOBUS_LOCATION}/etc/globus-job-manager.conf > ln -s ${LOCAL_GLOBUS_LOCATION}/etc/globus-job-manager.conf \ $GLOBUS_LOCATION/etc/globus-job-manager.conf (iii) Modify /your/local/area/etc/globus-job-manager.conf to reflect information for each host you are setting up as an Execution site. -globus-gatekeeper-host myhost.foo.bar -globus-gatekeeper-subject "host/certificate/subject/string/for/myhost.foo.bar" 6) As root ups start globus_rm_server ------------------- INSTALLATION OF JIM_INFO_PROVIDERS -as user products > upd install -G-c jim_info_providers -q GCC-2.95.2 > ups tailor jim_info_providers -q GCC-2.95.2 -as user sam > ups start jim_info_providers -q GCC-2.95.2 OR list jim_info_providers in sam_bootstrap NOTE:: If you are going to have shared installation of "jim_info_providers" i.e., multiple machines are going to run "jim_info_providers" but they all are going to use the same installation, then you will have to perform the last 2 steps ( "ups tailor .." & "ups start ...") from each machine. ------------------------------------ 1.2.4 Installing a Submission Site ------------------------------------ Setting up a submission site requires the installation of the Globus Data Management bundle and jim_broker_client package. Via ups/upd this is done by: ------------------- INSTALLATION OF GLOBUS DATA MANAGEMENT BUNDLE FROM ups/upd This is used to enable GSI support > upd install globus_dh_client -G-c Setup GSI (if not done yet) as descibed above in par 1.2.2 in "CONFIGURATION OF GLOBUS SECURITY INFRASTRUCTURE" ------------------- INSTALLATION OF JIM_BROKER_CLIENT -as user products > upd install samgrid -G-c -as user root > ups tailor jim_broker_client enter the local directory for log & spool and email-id > ups start jim_broker_client ------------------- In order to begin submitting jobs the following setup needs to be done: > setup samgrid -------------------------------- 1.3 Quick-Start: Example Jobs -------------------------------- Example 1: User has a simple SAM Analysis job on a predefined dataset definition which needs to be submitted to SAM Grid. The executable file for the job doesn't need any arguments, and there are no requirements for the executing resources. The job is a part of the experiment D0 and development. In order to keep track of the job submission the user defines the log file. User creates a job description file (MyJob.jdf) into his/her writable working directory: sam_dataset = gg-jw-test executable = /home/user/sam_analysis/MyJob cpu-per-event = 1s job_type = sam_analysis sam_universe = dev sam_experiment = d0 log = MyJob.log group = grid instances = 1 First step to do before submission is to authenticate oneself by running 'grid-proxy-init' and entering a password. By entering 'samg submit MyJob.jdf' the job is submitted to the SAM-Grid for execution. Starting job(s)... Global Job ID(s): user_sameggs.fnal.gov_143144_3125_0 Job(s) started successfully. The job can be referenced from the monitoring site by its Global Job ID. Example 2: This time user has an executable that needs to be executed in SAM-Grid but the job would not be a part of any specific type of job, i.e. the job would be of type 'vanilla'. The executable takes two arguments which the user will input through the "arguments" attribute. The job description file for this job is: executable = /home/user/vanilla/MySecondJob arguments = arg1 arg2 instances = 1 Again job is submitted by 'samg submit MySecondJob' and can be referenced from the monitoring site by its Global Job ID. --------------------------- 2.1 Job Description File --------------------------- In order to submit a job to SAM-Grid, you need to create a job description file (jdf). The jdf can contain a number of attributes from which some are required. The syntax that is required for the jdf is case-sensitive. The order of the attributes is not required. However, when running job instances the "instances" attribute should be located after its attributes have been defined. In case of multiple instances, only the attributes that are changed should be written again in the jdf. Example of a jdf: sam_dataset = gg-jw-test executable = /bin/ls arguments = arg1 arg2 cpu-per-event = 1s job_type = sam_analysis station_name = samadams sam_universe = dev sam_experiment = d0 requirements = Memory >= 32 log = test.log group = dzero instances = 1 arguments = arg3 arg4 instances = 1 ----------------- 2.2 Attributes ----------------- The valid attributes are listed in the "attributes.config" file which is located in the /etc directory in SAM-Grid product home directory. They are grouped according to the way the 'samg submit' handles them. It is possible to have different types of jobs. "vanilla" jobs are used to directly execute the user's file. "sam_analysis" jobs are brokered to a SAM-Grid resource. In the future also "monte_carlo" type jobs will be supported. The required attributes for sam_analysis jobs are "sam_dataset", "sam_universe", "sam_experiment", "executable", "cpu-per-event", and "instances". For further details refer to the specifications of the Sam-Grid job description language that are presented in Appendix A. ----------------- 2.3 Submission ----------------- Once you have created the jdf you are able to submit the job, assuming that you have authenticated yourself by running 'grid-proxy-init' and entering your password. After this the submission is done by entering 'samg submit '. You will get the Global Job Id which can be used for reference when monitoring the job. For each instance of a job, output and error files are generated. In addition to this you can specify a log file that will keep track of the job during submission. These files are especially important when troubleshooting. ---------------- 2.4 Brokering ---------------- ---------------- 2.5 Monitoring ---------------- --------------- 3.1 Use Cases --------------- --------------------- 4.1 Troubleshooting --------------------- When the job is not submitted or the job does not seem to run as it should (and standard output doesn't help) one should look at the output, error, and log files. When e.g. there is a SAM specific problem, you will see the output in the output file (.out). Please check the jdf and if necessary compare the problematic attribute to the attribute listed in the specifications. ------------ Appendix A ------------ *****SAM-Grid JDL***** --Specifications for SAM-Grid JDL-- --------- These are the specifications for the SAM-Grid job description language. The job description language distinguishes the different job types (sam_analysis, vanilla). In the specifications presented below, the attributes for "sam_analysis" are listed first and the "vanilla" attributes second. ------------------------------------- A.1 SAM Analysis JDL Specifications ------------------------------------- job_type = sam_analysis Specifies the job type. executable = The user's executable to be run. (required) instances = Run instances of the job. The instances attribute may be declared many times if the user wants to change certain attributes for job instances. The attributes input, initialdir, arguments, priority, coresize, and image_size can be defined again for each instance. (required) cpu-per-event = The estimated CPU time used per event. s|m|h (seconds|minutes|hours) (required) sam_dataset = Name of the dataset definition to be used in the job. The dataset definition must be predefined. (required) sam_universe = Specifies the universe for the job. This is required to match with the resource. (required) sam_experiment = Specifies the experiment that the job is dedicated to. This also is required to match with the resource. (required) requirements = The expression must evaluate to true on the matching machine. (optional, default is 1) globusscheduler = Specifies the Globus resource to which the job should be submitted. Normally this will be defined by the matching resource. However, it may be declared by an advanced-user if so preferred. (optional, default is the matched resource) station_name = The station name at which the job will be executed, assuming that the requirements are satisfied (including "sam_universe" and "sam_experiment"). If user does not define the station name, brokering will determine it from the matching station. However, station name may be declared if user prefers a certain station. (optional) arguments = Parameters to be passed to the executable. The parameters must NOT be enclosed into double quotes (e.g. arguments = arg1 arg2 arg3) (optional) input = Any STDIN input the job needs to get while running are located in this file. If not specified, default is /dev/null. (optional, default is /dev/null) log = The log filename. (optional) dataset-version = The particular dataset id number, or "new" or "last". (optional) group = The group name for which the job belongs to. (optional) file-cut = Deliver only this number of files to the project. (optional) keep-batch-script = True Do not remove the temporary batch script after execution. 'True' is the only valid value for this if declared. (optional) ------------------------------------- A.2 Vanilla JDL Specifications ------------------------------------- job_type = vanilla Specifies the job type. executable = The user's executable to be run. (required) instances = Run instances of the job. The instances attribute may be declared many times if the user wants to change certain attributes for job instances. The attributes input, initialdir, arguments, priority, coresize, and image_size can be defined again for each instance. (required) requirements = The expression must evaluate to true on the matching machine. (optional, default is 1) globusscheduler = Specifies the Globus resource to which the job should be submitted. Normally this will be defined by the matching resource. However, it may be declared by an advanced-user if so preferred. (optional, default is the matched resource) arguments = Parameters to be passed to the executable. The parameters must NOT be enclosed into double quotes (e.g. arguments = arg1 arg2 arg3) (optional) input = Any STDIN input the job needs to get while running are located in this file. If not specified, default is /dev/null. (optional, default is /dev/null) log = The log filename. (optional) -------------------------------------------------------------------------- A.3 Possible Extensions for JDL That Are Under Development (NOT PREFERED) -------------------------------------------------------------------------- rtfile = The file in which to return the results of the sam job. (optional, SAM Analysis only) initialdir = Specifies the preexisting working directory for the job. If not specified the user's current working directory is used. (optional) priority = Prioritizes the condor jobs (-20 to 20) owned by the user self. (optional) nice_user = Normally, when a machine becomes available to Condor, Condor decides which job to run based upon user and job priorities. Setting nice_user equal to True tells Condor not to use your regular user priority, but that this job should have last priority amongst all users and all jobs. (optional) getenv = If set to True, all current shell environment variables will be copied into the job ClassAd. Default is False. There is a limit of 10240 characters for the environment variables attached to the job ClassAd. (optional) environment = It is possible to list environment variables to be used on the job's environmentbefore it is executed. List should be in the form of: =;= (optional) image_size = Tells condor an estimate of the maximum virtual memory size that the program will occupy while running. Expressed in kilobytes. If not specified, Condor willmake a reasonably accurate estimate of the memory used. If image_size is underestimated, the program may crash. (optional) coresize = Should the user's program abort and produce a core file, coresize specifies the maximum size in bytes of the core file which the user wishes to keep. (optional) rendezvousdir = Used to specify the shared-filesystem directory to be used for filesystem authentication when submitting to a remote scheduler. Should be a path to a preexisting directory. (optional) x509directory = Used to specify the directory which contains the certificate, private key, and trusted certificate directory for GSS authentication. If this attribute is set, the environment variables X509 USER KEY, X509 USER CERT, and X509 CERT DIR are exported with default values. (optional) x509userproxy = Used to override the default pathname for X509 user certificates. The default location for X509 proxies is the /tmp directory, which is generally a local filesystem. Setting this value would allow Condor to access the proxy in a shared filesystem (e.g., AFS). Condor will use the proxy specified in the submit file first. If nothing is specified in the submit file, it will use the environment variable X509 USER CERT. If that variable is not present, it will search in the default location. (optional) globusrsl = Used to provide any additional Globus RSL string attributes which are not covered by regular submit file parameters. (optional) transfer_executable = If transfer_executable is set to false the remote machine is checked for the executable but executable is not transfered over. (optional) brokering_algorithm = The brokering_algorithm attribute defines the method of ranking. The user selects one of a few values which are available to define the characteristic on which the ranking will be based on. The values are still under consideration. (optional) ---------------------- TIPS & TRICKS If the /var/local/jim_broker/condorg/log/NegotiatorLog shows error like this --xxx---- Over schedd resource limit (1) ... only consider startd ranks1 --xxx---- and jobs get Rejected, you need to cleanup the account file in location, /var/local/jim_broker/condorg/spool/Accountantnew.log