Troubleshooting Activities

Please Note that some of the MOP/McRunjob commands mentioned here are obsolete but rest of the debuging procedure is still the same.

Debugging problems during a MOP based production over several sites is a complex job and requires several different types of steps performed, before the real nature of problem is diagnosed. The soultion sometimes is trivial, but not obvious esaily found, or sometimes it is a complex issue and will need expert attention.

Follwoing are the activities which seems to be very appropriate for running a MOP based production.

1. Verify health of MOP master:
    a. Do a condor_q and verify that you see a large number of jobs in Running status.
    b. do a "condor_q -l xxxx.x | grep Out", and then open the run.out file spitted by this command for condor job xxxx.x, verify that job seems to be progressing into various stages,  cmkin-->cmsim--> etc. Look for the last activity,most probably a statement showing the latest event number being processed. Perform this step on several jobs chosen randomly.
    c. Perfom step c on cluster by cluster basis, by doing a "condor_q -dag" and picking more then one job from a cluster.
    d. Look for Held processes, few Held processes seems to be a "normal" behaviour.
    e.  Do a 'uptime' time to time and verify that the load on MOP master is not more than 3/4 at anytime.
    f. Look for a number of gahp_server threads (use pstree), they will go away, but if not kill the condor_gridmanager process.
    g. Use ganglia/Monalisa monitoring tools, to see if system load and other paramenets are looking healthy.
    h. Do "df -h" and verify free disk space. There must be several tens of GB free on /data and "enough" on other partitions.
    i. Verify that various condor processes, including condor_master are running.
                     <gallo.fnal.gov>  ps -elf | egrep condor_
                    040 S cmsprod   6509     1  1  60   0    -  1122 do_sel Nov13 ?        02:55:51 condor_master
                    000 S cmsprod   6510  6509  0  60   0    -  1114 do_sel Nov13 ?        00:01:59 condor_collector -f
                    000 S cmsprod   6511  6509  0  60   0    -  1061 do_sel Nov13 ?        00:02:02 condor_negotiator -f
                    000 S cmsprod   6512  6509  0  60   0    -  1310 do_sel Nov13 ?        00:11:22 condor_startd -f
                    000 S cmsprod   6513  6509  0  60   0    -  3095 do_sel Nov13 ?        00:23:57 condor_schedd -f
    j. Look for produced ntuples in /data/out_ntup/RUN-II area, all files, specially freshly arrived, should be of the order of similar size.
    k. Look into various log files, GridLog, condor logs etc. and look into following  files if they exist,
                    dprintf_failure.MASTER
                    dprintf_failure.NEGOTIATOR
                    dprintf_failure.SHADOW

2. Verify health of MOP Worker site.
    a. Should be able to "globus-run-job <site> /bin/date" to all sites
    b. should be able to verify condor_q/condor_status on all sites, using globus-job-run.
        globus-job-run grinulix.phys.ufl.edu /vdt/bin/condor_status
        globus-job-run  t2cms0.sdsc.edu /vdt/vdt-1.1.4/bin/condor_status
        globus-job-run tier2.cacr.caltech.edu /vdt/bin/condor_q
    c. Verify all sites have enough disk-spaces in respective partitions.
    d.$MOP_DIR/mop_submitter/site-info, contain all site files, and files ending with Igt are currently being used (for fermi its velveeta.*).
        These file provides info. about each site, and one can reference them for site verification.
    e. Its sometime required to see if worker nodes are up and running too, and they have enough disk space left (specially in /tmp area).
    f. Look at changing size of .fz file(s) being produced, check back after a period of more than an hour.
    g. Look into various log files, GridLog, condor logs etc. and look for failure files.

3. Submitting new jobs:
    Follwoing are important points to remember:-
        i. In absence of a scheduler, jobs are submitted manually.
        ii. No need to keep a large number of jobs in waiting condition, while there are only a certain number of CPUs avaialble,
            and jobs take more than 24 hours, this gives  more liberty and control and ability to diagnose problems, without losing jobs.

   So first look for free CPUs at various site.
    Following return number of free CPUs at various sites,
            globus-job-run grinulix.phys.ufl.edu /vdt/bin/condor_status -total | grep Total | awk '{print $5}'
            globus-job-run tier2.cacr.caltech.edu /vdt/bin/condor_status -total | grep Total | awk '{print $5}'
            globus-job-run  t2cms0.sdsc.edu /vdt/vdt-1.1.4/bin/condor_status -total | grep Total | awk '{print $5}'

    FermiLab is using fbsng, so following will return number of jobs already in queue, if this number is small or zero then its OK to add more jobs.
            globus-job-run velveeta /data/ANZAR/testfbs.sh
   Otherwise login/ssh to velveeta and do a 'fbs queues' to look at the queue.

    The jobs are submitted while 'cwd' is /data/AZIZ/MOP/mop_master/impala_mcj/run-XX, where XX is a run number, chooose the latest run number.
    You should also source /data/AZIS/setup.csh to setup proper environment for all activities.

    Note: Its always better to submit jobs in batches, not bigger than 15-20 at a time.

    RunJob.sh -m floridaIgt -v -j mcj eg02_BigJets 10
    RunJob.sh -m caltechIgt -v -j mcj eg02_BigJets 10
    RunJob.sh -m ucsdIgt -v -j mcj eg02_BigJets 10
    RunJob.sh -m velveeta -v -j mcj eg02_BigJets 10

  Note: I have tried to use an auto submit script, that looks for free CPUs every hour and submitts new jobs, but due to (known!) reasons its not very wise to use it.

If job submission fails, with an error message like this,

<gallo.fnal.gov> RunJob.sh -m velveeta -v -j mcj eg02_BigJets 20
localFunctions.sh resource file for US-MOP
TARBALL=true, installed in _insert_MOP_REMOTE_DAR_ROOT_.
No further action defined.
Selecting the list of scripts
      ...selecting from batch/created area.
No jobs were selected.

This will mean that MCRunJob tracking area is out of created batch jobs, you will need to make some more jobs.
    cd to /data/AZIZ/MOP/mop_master/impala_mcj/py_script
    run "./Linker.py script=eg02_BigJets_00002417.sav"
    This will make 200 new jobs, and you can resume job submission.

4. Large number of Held Processes
    If there is a large number of Held processes, this needs more attention.
    1. Verify that Held processes belong to single site,
            i. In this case do the steps specified in site health verification.
            ii. Trace down a couple of pocesses, upto the running jobs at remote sites.
            ii. If a site is completely down, its wise to just "Hold" processes manually untill issue is resolved, or just remove them all.
                holding the dagman running the batch is enough to hold a set of jobs held. (condor_q -dag, condor_hold, condor_rm)
     2. If processes are held for all sites (should never happen), this obviously means problem at MOP master site.
            i. Look for MOP master health status, track down running jobs, condor processes, disk spaces etc...

5. Large number of Idle processes
    1. This could be normal.
    2. Track clusters to sites, verify that either site also has a large Idle queue, ( this is ok )  or all jobs at site are just Idle (not OK).
    3. Master site do not have gahp_server, refere to master site section.

6. New submitted jobs do not move
1. Wait for 5-10 mins,
2. Look for gahp_server processes
        i. large number of gahp_server processes (wait ....if no affect.....kill condor_gridmanager process, and wiat 5-10 mins).
        ii. No gahp_server process, most probably killed itself, wait a while, system should start to move again.
        iii. Worst case, submit some jobs manually to DagMan, try not to use condor/other jobmanager at target site. Use the fork manager,. and verify your
              job goes through. I
        iv. Sending condor_restart helps ? (worst case scenario!)
        v. Restart condor_master (should be required only if city is on fire!).
3. Better to call experts, if noting helps.

7. Jobs fail immediately after submission
A. Jobs fail in stage-in
        i. Look into stage-in.out/err files for failing jobs,
                a. is globus-url-copy working
                b. no harm in trying a couple of copies manually
                c. space left on site ?.
                d. source file exits on MOP master (this happened when something messes up with MCRunJob tracking area).
B. Jobs held in stage-in
        i. Verify globus-url-copy, site is reachable, site is healthy.
C. Jobs fail in run stage
      i. Look for run.out/run.err files (this is stdout/stderr of running job).
      ii. Look for *out/err in the mop-tmp-bla-bla area for the failing jobs.
      iii. Might need to manually HOLD jobs as MOP master attempts restart, over writting files, making it tough for debugging.

8. Jobs fail in Run stage, and keeps on failing, while few of them are still running (say 18 runs, and over 18 all fails)
            One/two machine is having problems !!!!
            Which is draining the queue, so better identify this machine, by looking for all jobs that failed at this machine.
            Fix the problem, like disk-space etc.
            Stop submitting more untill verified its OK.

9. Orphan processes
    i. If you feel that number of jobs queued at MOP master is lesser than the number of jobs running at a worker site, this might mean some orphan
        processes, taking the CPU.
    ii. If possible, track them and kill them (only if you are damn sure!).
    iii Let them drain over time (collect .ntup files manually later).
    iv. Avoid some Orphan processes by removing Held processes from MOP master and killing, after tracking, the (once) related process at worker site. (tough !!).

10. Ever running processes
        i. If you notice jobs in the queue for a long time, 2-3 days or more, this might be because of writeAllHits process caught in an infinite loop.
        ii. Track down the process to runnning job (uh!!). Kill the writeAllHits process running for thousands of minutes.

11. One of the stages fail, while its prior stage is successfuly returned with MOP_RETURN_VAL=0
    i. Make sure all correct site files with correct entries are present, especially for the failing stage.

12. Number of Free CPUs at remote site keeps on revolving (example 20-10-0-20 )
    i. Most probably there are Orphan jobs
    ii. Identify the jobs at Worker site
    ii. Kill them.
 
13. MOP log file
/tmp/moplog.$userid contains information about the jobs you are running.

===================================================================

Useful References:
http://home.fnal.gov/~anzar/MOP_MASTER/mop.html
http://www.cs.wisc.edu/~adesmet/cms_mop_master.html
http://grid.fnal.gov/test_beds/MOP_install.html

======================================================================

Last updated: 08/29/2003

Prepared by:
Muhammad Anzar Afaq
11/19/2002
Fermilab, CD-ODS/CD-CMS
anzar@fnal.gov
630 840 6856