JIM FAQ
Original Author: Gabriele Garzoglio
Most Recent Update:
$Author: garzogli $
$Revision: 1.4 $
$Date: 2004/10/25 22:33:42 $.
Symptoms:
- I have submitted a job and it has
been idle in the queue for a long time.
I have submitted a job and it
has been idle in the queue for a long time.
- A job can stay idle even 10 minutes, if the grid is
particularly busy. If it is idle longer than that, there is generally a
problem.
- Has the job been matched to a remote site? Follow this
procedure to find out:
Check the status of the queue of the "submission site" where the job
was submitted:
- go to http://samgrid.fnal.gov:8080
and click on "Get information about
the submission
sites"
- click on the submission site "Scheduler Name"
- look at the row corresponding to the given job (it is
identified by its global job id): the job has been matched if there is
an entry for the "Execution Site" column
IF the job has NOT been
matched, check if the site wanted by
the job is
advertised:
- Find out what site the job wants by following this
procedure:
- From the page on the status of queue, click on the Idle
status link for the given job
- Click on Details: this shows the full description
(classad)
of the job
- Look at the
"Requirements" attribute (first column): the second column should show
a long expression, including TARGET.station_name_
== "MyStationName": MyStationName identifies the site where the job
wants
to run.
IF you don't find TARGET.station_name_,
the job is trying to use automatic brokering: contact an expert.
- Find out if the site is advertised
- go to http://samgrid.fnal.gov:8080
and click on "Get information about
the advertised
sites."
- IF you do NOT see
MyStationName in the list of "Station Name", the site is NOT
advertised: ask the site administrator to restart jim_advertise. This
is generally done using the commands "ups stop server_run" and "ups run
server_run". If jim_advertise is running but no classad is collected,
contact an expert.
ELSE the site is
advertised, click on the MyStationName link: this is the description of
the site (classad). There may be more than one description for a site,
depending on the services offered by the site. Compare the
"Requirements" attribute of the job description with all the site
descriptions available: in the job "Requirements" expression, "TARGET"
refers to attributes of the site description. Check that every element
of the "Requirements" expression matches an attribute of the site
description.
For example, if the job has the element TARGET.jobmanager_name_=="jobmanager-mcfarm",
check that the site description has an attribute "jobmanager_name_" with value "jobmanager-mcfarm"
- IF there is no site description that
matches a job, there is either a configuration problem at the site
(e.g. the site should advertise that it offers an interface to mcfarm
via jobmanager-mcfarm and it is not) or the user expects a site
to offer more services than what it was installed there. Clarify the
situation with the user and the site administrators.
- ELSE the matchmaker cannot find the
right match: contact an expert
ELSE the job has been
matched, but the job is still in Idle state, instead of Running or
Complete state (if it is in Held state, look at the appropriate FAQ).
This means that the submission site cannot receive services from the
gatekeeper at the execution site. This may be cause by
- the machine is not reachable because of a network problem
or because of a firewall. Ask the site administrator to check the log
file of the gatekeeper to see if the gatekeeper has been contacted at
all. The log file is in $GLOBUS_LOCATION/var/ . The specific
$GLOBUS_LOCATION used can be found from /etc/xinetd.d/globus-gatekeeper
(NOTE: on some installation we run more than one gatekeeper - one for
LCG, one for SAM-Grid - : in this case the SAM-Grid generally run it on
port 2120 instead of the default 2119: make sure the admin looks at the
right xinetd.d file).
- there is a problem with the configuration of the Globus
Security Infrastructure. These are the possible reasons:
- either the submission site or the execution site have a
problem in the Certificate Authority certificates: call an expert.
- the execution site does not authorize the user to run on
their resources. Find out the user identity (certificate subject) by
looking at the job description
- From the page on the status of queue, click on the Idle
status link for the given job
- Click on Details: this shows the full description
(classad)
of the job
- Look
at the value of the attribute "cert_subject"
- Look
at /etc/grid-security/grid-mapfile at the execution site machine: is
the subject there?