JIM FAQ

Original Author: Gabriele Garzoglio
Most Recent Update:
$Author: garzogli $
$Revision: 1.4 $

$Date: 2004/10/25 22:33:42 $.

Symptoms:

  1. I have submitted a job and it has been idle in the queue for a long time.

I have submitted a job and it has been idle in the queue for a long time.

  1. A job can stay idle even 10 minutes, if the grid is particularly busy. If it is idle longer than that, there is generally a problem.
  2. Has the job been matched to a remote site? Follow this procedure to find out:
    Check the status of the queue of the "submission site" where the job was submitted:
    1. go to http://samgrid.fnal.gov:8080  and click on "Get information about the submission sites"
    2. click on the submission site "Scheduler Name"
    3. look at the row corresponding to the given job (it is identified by its global job id): the job has been matched if there is an entry for the "Execution Site" column
    IF the job has NOT been matched, check if the site wanted by the job is advertised:
    1. Find out what site the job wants by following this procedure:
      1. From the page on the status of queue, click on the Idle status link for the given job
      2. Click on Details: this shows the full description (classad) of the job
      3. Look at the "Requirements" attribute (first column): the second column should show a long expression, including TARGET.station_name_ == "MyStationName": MyStationName identifies the site where the job wants to run.
        IF you don't find
        TARGET.station_name_, the job is trying to use automatic brokering: contact an expert.
    2. Find out if the site is advertised
      1. go to http://samgrid.fnal.gov:8080  and click on "Get information about the advertised sites."
      2. IF you do NOT see MyStationName in the list of "Station Name", the site is NOT advertised: ask the site administrator to restart jim_advertise. This is generally done using the commands "ups stop server_run" and "ups run server_run". If jim_advertise is running but no classad is collected, contact an expert.
        ELSE the site is advertised, click on the MyStationName link: this is the description of the site (classad). There may be more than one description for a site, depending on the services offered by the site. Compare the "Requirements" attribute of the job description with all the site descriptions available: in the job "Requirements" expression, "TARGET" refers to attributes of the site description. Check that every element of the "Requirements" expression matches an attribute of the site description.
        For example, if the job has the element TARGET.jobmanager_name_=="jobmanager-mcfarm", check that the site description has an attribute "jobmanager_name_" with value "jobmanager-mcfarm"
        • IF there is no site description that matches a job, there is either a configuration problem at the site (e.g. the site should advertise that it offers an interface to mcfarm via jobmanager-mcfarm and it is not)  or the user expects a site to offer more services than what it was installed there. Clarify the situation with the user and the site administrators.
        • ELSE the matchmaker cannot find the right match: contact an expert
    ELSE the job has been matched, but the job is still in Idle state, instead of Running or Complete state (if it is in Held state, look at the appropriate FAQ). This means that the submission site cannot receive services from the gatekeeper at the execution site. This may be cause by
    1. the machine is not reachable because of a network problem or because of a firewall. Ask the site administrator to check the log file of the gatekeeper to see if the gatekeeper has been contacted at all. The log file is in $GLOBUS_LOCATION/var/ . The specific $GLOBUS_LOCATION used can be found from /etc/xinetd.d/globus-gatekeeper (NOTE: on some installation we run more than one gatekeeper - one for LCG, one for SAM-Grid - : in this case the SAM-Grid generally run it on port 2120 instead of the default 2119: make sure the admin looks at the right xinetd.d file).
    2. there is a problem with the configuration of the Globus Security Infrastructure. These are the possible reasons:
      1. either the submission site or the execution site have a problem in the Certificate Authority certificates: call an expert.
      2. the execution site does not authorize the user to run on their resources. Find out the user identity (certificate subject) by looking at the job description
        1. From the page on the status of queue, click on the Idle status link for the given job
        2. Click on Details: this shows the full description (classad) of the job
        3. Look at the value of the attribute "cert_subject"
        4. Look at /etc/grid-security/grid-mapfile at the execution site machine: is the subject there?