Debugging problems during a MOP based production over several sites is a complex job and requires several different types of steps performed, before the real nature of problem is diagnosed. The soultion sometimes is trivial, but not obvious esaily found, or sometimes it is a complex issue and will need expert attention.
Follwoing are the activities which seems to be very appropriate for running a MOP based production.
1. Verify health of MOP master:
a. Do a condor_q and verify that you see a large number
of jobs in Running status.
b. do a "condor_q -l xxxx.x | grep Out", and
then open the run.out file spitted by this command for condor job
xxxx.x, verify that job seems to be progressing into various stages,
cmkin-->cmsim--> etc. Look for the last activity,most probably a statement
showing the latest event number being processed. Perform this step on several
jobs chosen randomly.
c. Perfom step c on cluster by cluster basis, by doing
a "condor_q -dag" and picking more then one job from a cluster.
d. Look for Held processes, few Held processes seems
to be a "normal" behaviour.
e. Do a 'uptime' time to time and verify
that the load on MOP master is not more than 3/4 at anytime.
f. Look for a number of gahp_server threads (use pstree),
they will go away, but if not kill the condor_gridmanager process.
g. Use ganglia/Monalisa monitoring tools, to see if
system load and other paramenets are looking healthy.
h. Do "df -h" and verify free disk space. There must
be several tens of GB free on /data and "enough" on other partitions.
i. Verify that various condor processes, including condor_master
are running.
<gallo.fnal.gov> ps -elf | egrep condor_
040 S cmsprod 6509 1 1 60
0 - 1122 do_sel Nov13 ?
02:55:51 condor_master
000 S cmsprod 6510 6509 0 60 0
- 1114 do_sel Nov13 ? 00:01:59
condor_collector -f
000 S cmsprod 6511 6509 0 60 0
- 1061 do_sel Nov13 ? 00:02:02
condor_negotiator -f
000 S cmsprod 6512 6509 0 60 0
- 1310 do_sel Nov13 ? 00:11:22
condor_startd -f
000 S cmsprod 6513 6509 0 60 0
- 3095 do_sel Nov13 ? 00:23:57
condor_schedd -f
j. Look for produced ntuples in /data/out_ntup/RUN-II
area, all files, specially freshly arrived, should be of the order of similar
size.
k. Look into various log files, GridLog, condor logs
etc. and look into following files if they exist,
dprintf_failure.MASTER
dprintf_failure.NEGOTIATOR
dprintf_failure.SHADOW
2. Verify health of MOP Worker site.
a. Should be able to "globus-run-job <site>
/bin/date" to all sites
b. should be able to verify condor_q/condor_status on
all sites, using globus-job-run.
globus-job-run grinulix.phys.ufl.edu
/vdt/bin/condor_status
globus-job-run t2cms0.sdsc.edu
/vdt/vdt-1.1.4/bin/condor_status
globus-job-run tier2.cacr.caltech.edu
/vdt/bin/condor_q
c. Verify all sites have enough disk-spaces in respective
partitions.
d.$MOP_DIR/mop_submitter/site-info, contain all site
files, and files ending with Igt are currently being used (for fermi its velveeta.*).
These file provides info. about
each site, and one can reference them for site verification.
e. Its sometime required to see if worker nodes are
up and running too, and they have enough disk space left (specially in /tmp
area).
f. Look at changing size of .fz file(s) being produced,
check back after a period of more than an hour.
g. Look into various log files, GridLog, condor logs
etc. and look for failure files.
3. Submitting new jobs:
Follwoing are important points to remember:-
i. In absence of a scheduler,
jobs are submitted manually.
ii. No need to keep a large
number of jobs in waiting condition, while there are only a certain number
of CPUs avaialble,
and
jobs take more than 24 hours, this gives more liberty and control
and ability to diagnose problems, without losing jobs.
So first look for free CPUs at various site.
Following return number of free CPUs at various sites,
globus-job-run
grinulix.phys.ufl.edu /vdt/bin/condor_status -total | grep Total | awk '{print
$5}'
globus-job-run
tier2.cacr.caltech.edu /vdt/bin/condor_status -total | grep Total | awk
'{print $5}'
globus-job-run
t2cms0.sdsc.edu /vdt/vdt-1.1.4/bin/condor_status -total | grep Total | awk
'{print $5}'
FermiLab is using fbsng, so following will return number
of jobs already in queue, if this number is small or zero then its OK to
add more jobs.
globus-job-run
velveeta /data/ANZAR/testfbs.sh
Otherwise login/ssh to velveeta and do a 'fbs queues' to look
at the queue.
The jobs are submitted while 'cwd' is /data/AZIZ/MOP/mop_master/impala_mcj/run-XX,
where XX is a run number, chooose the latest run number.
You should also source /data/AZIS/setup.csh to setup
proper environment for all activities.
Note: Its always better to submit jobs in batches, not bigger than 15-20 at a time.
RunJob.sh -m floridaIgt -v -j mcj eg02_BigJets 10
RunJob.sh -m caltechIgt -v -j mcj eg02_BigJets 10
RunJob.sh -m ucsdIgt -v -j mcj eg02_BigJets 10
RunJob.sh -m velveeta -v -j mcj eg02_BigJets 10
Note: I have tried to use an auto submit script, that looks for
free CPUs every hour and submitts new jobs, but due to (known!) reasons its
not very wise to use it.
If job submission fails, with an error message like this,
<gallo.fnal.gov> RunJob.sh -m velveeta -v -j mcj eg02_BigJets 20
localFunctions.sh resource file for US-MOP
TARBALL=true, installed in _insert_MOP_REMOTE_DAR_ROOT_.
No further action defined.
Selecting the list of scripts
...selecting from batch/created area.
No jobs were selected.
This will mean that MCRunJob tracking area is out of created batch jobs,
you will need to make some more jobs.
cd to /data/AZIZ/MOP/mop_master/impala_mcj/py_script
run "./Linker.py script=eg02_BigJets_00002417.sav"
This will make 200 new jobs, and you can resume job
submission.
4. Large number of Held Processes
If there is a large number of Held processes, this needs
more attention.
1. Verify that Held processes belong to single site,
i. In
this case do the steps specified in site health verification.
ii.
Trace down a couple of pocesses, upto the running jobs at remote sites.
ii.
If a site is completely down, its wise to just "Hold" processes manually
untill issue is resolved, or just remove them all.
holding the dagman running the batch is enough to hold a set of jobs held.
(condor_q -dag, condor_hold, condor_rm)
2. If processes are held for all sites (should
never happen), this obviously means problem at MOP master site.
i. Look
for MOP master health status, track down running jobs, condor processes,
disk spaces etc...
5. Large number of Idle processes
1. This could be normal.
2. Track clusters to sites, verify that either site
also has a large Idle queue, ( this is ok ) or all jobs at site are
just Idle (not OK).
3. Master site do not have gahp_server, refere to master
site section.
6. New submitted jobs do not move
1. Wait for 5-10 mins,
2. Look for gahp_server processes
i. large number of gahp_server
processes (wait ....if no affect.....kill condor_gridmanager process, and
wiat 5-10 mins).
ii. No gahp_server process,
most probably killed itself, wait a while, system should start to move again.
iii. Worst case, submit some
jobs manually to DagMan, try not to use condor/other jobmanager at target
site. Use the fork manager,. and verify your
job goes through. I
iv. Sending condor_restart helps
? (worst case scenario!)
v. Restart condor_master (should
be required only if city is on fire!).
3. Better to call experts, if noting helps.
7. Jobs fail immediately after submission
A. Jobs fail in stage-in
i. Look into stage-in.out/err
files for failing jobs,
a. is globus-url-copy working
b. no harm in trying a couple of copies manually
c. space left on site ?.
d. source file exits on MOP master (this happened when something messes up
with MCRunJob tracking area).
B. Jobs held in stage-in
i. Verify globus-url-copy, site
is reachable, site is healthy.
C. Jobs fail in run stage
i. Look for run.out/run.err files (this
is stdout/stderr of running job).
ii. Look for *out/err in the mop-tmp-bla-bla
area for the failing jobs.
iii. Might need to manually HOLD jobs as
MOP master attempts restart, over writting files, making it tough for debugging.
8. Jobs fail in Run stage, and keeps on failing, while few of them are
still running (say 18 runs, and over 18 all fails)
One/two
machine is having problems !!!!
Which
is draining the queue, so better identify this machine, by looking for all
jobs that failed at this machine.
Fix
the problem, like disk-space etc.
Stop
submitting more untill verified its OK.
9. Orphan processes
i. If you feel that number of jobs queued at MOP master
is lesser than the number of jobs running at a worker site, this might mean
some orphan
processes, taking the CPU.
ii. If possible, track them and kill them (only if you
are damn sure!).
iii Let them drain over time (collect .ntup files manually
later).
iv. Avoid some Orphan processes by removing Held processes
from MOP master and killing, after tracking, the (once) related process
at worker site. (tough !!).
10. Ever running processes
i. If you notice jobs in the
queue for a long time, 2-3 days or more, this might be because of writeAllHits
process caught in an infinite loop.
ii. Track down the process to
runnning job (uh!!). Kill the writeAllHits process running for thousands
of minutes.
11. One of the stages fail, while its prior stage is successfuly returned
with MOP_RETURN_VAL=0
i. Make sure all correct site files with correct entries
are present, especially for the failing stage.
12. Number of Free CPUs at remote site keeps on revolving (example 20-10-0-20
)
i. Most probably there are Orphan jobs
ii. Identify the jobs at Worker site
ii. Kill them.
13. MOP log file
/tmp/moplog.$userid contains information about the jobs you are running.
===================================================================
Useful References:
http://home.fnal.gov/~anzar/MOP_MASTER/mop.html
http://www.cs.wisc.edu/~adesmet/cms_mop_master.html
http://grid.fnal.gov/test_beds/MOP_install.html
======================================================================
Last updated: 08/29/2003
Prepared by:
Muhammad Anzar Afaq
11/19/2002
Fermilab, CD-ODS/CD-CMS
anzar@fnal.gov
630 840 6856