GBS: Ganga-based Batch Submission: Design Manual V01-15-01
A fault tolerant batch submission framework layered on Ganga
Status: As of version V01-15-01 GBS is a running system
although it has only been tested so far on an internal dummy application.
Major features (IOPool, IOItem and TaskConnector) are missing. As a
temporary, and possibly permanent, replacement for the IO items, Tasks
directly create the inputs for their Jobs.
Here is the
code
See also
n west
Last modified: Fri Jan 4 07:54:49 GMT 2008
Return to GBS home page
GBS Design Manual Contents
GBS is a fault tolerant batch submission framework layered on Ganga. It is not
a book-keeping system, although it would be simple to interface it to
one. The core concept is one of task: a set of jobs that need
to be run using the same script with a range of inputs. With GBS a
Task is created that oversees the submission of jobs and
their recovery in case of failure. Once complete, the results are
harvested and the Task data removed.
An initial assumption is that the system will be layered on Ganga and
analysis begins with 3 generic Use Cases:-
- "I want to be able to define a data set and run the same script
on every member of the set. Later, when new data is available, I
want to augment the set and run the new data through the
same script".
- "I want a system that's fault tolerant both at the job level
and at the system level, with a way to program error recovery. For
example at the job level I want to run an MC job on a range of run and
sub-run numbers, but if a job SegVs I want to be able to rerun with a
new seed. At the farm level, if there is a catastrophic failure I want
the system only keep resubmitting a few jobs until things are back to
normal".
- "I want to have the output from one set of jobs provide the
input for one or more new sets of jobs".
Beyond the task orientated use cases there are also the operational ones:-
- "I want it to be easy to use. In particular it should present
clear comprehensible summaries, perhaps as a web page, and it should
be simple to suspend, resume, or adjust the job submission levels."
- "I want to be able to tests out new jobs and debug then before going into production".
- "When there are major problems I want to be able to kill all
running jobs and suspend submission of new ones. Once the problem is
fixed I want to be resume operations".
- "When things do end it tears it needs to be easy to apply fixes
globally."
All of the above are generic and experiment independent, but actual
implementation will be experiment specific which leads to the
conclusion that the required system is an extensible framework,
written in python, and which provides an API into which experiments
can plug specific algorithms. Measures of success are
- How natural is this API?
- How minimal is the experiment code needed be to complete the
system?
So the good news is that GBS can be used by any experiment, but the
bad news is that it can be used by none until they supply plug-ins to
cover their needs, although this may mean no more that extending the
script that actually runs the job so that in can handle retries.
The system will be a implemented as an OO set of python scripts that
run as a regular cron job to maintain a steady level of running jobs
and react to the outcome of these jobs. It's based on a number of concepts:-
The first concept is that of a task:-
- A task is the execution of a series of jobs, each using the same
script on a pool of inputs to produce a pool of outputs.
- Each job may consume on or more inputs but only produces a single
output. This asymmetry is logical and not restrictive. A job ideally
only runs once so produces a single output. It is up to the
experiment to encode what that output is - it could be a list of file
names. On the other hand a downstream job make take its input from
multiple upstream jobs so in general jobs need multiple inputs.
- The type of a task is determined by the type of job it runs and
not the script. The same type of task could be configured to run with
different scripts, so long as they are compatible with the task
i.e. don't run MC scripts with a task designed for reconstruction
work!
- The number of different types of task is determined by the
experiment although GBS would provide a basic set to do most, and if
lucky, all of the work.
- Tasks have names, making it possible to have the same task running
multiple times, each working on a different set of input pools.
The next concept is an input or output pool of I/O items that
represents the input and outputs of a job. These will contain
arbitrary strings that will provide input to the job as it configures
a Ganga job or records the outputs the job produces. These strings
could be file names for reconstruction or run numbers + seeds for MC.
I/O items also have to have the concept of a "book-keeping ID" BKID. It
will probably be based on the triplet: (detector/sub-system, run
number, sub-run number). These can all be strings so, for example run
number could 12340-12349 or sub-run number could be *. The exact representation
can be left to implementation so long as it satisfies the following:-
- Within an I/O pool the BKID is unique (it may be used to
locate an I/O item in the pool).
- There is a simple relationship between the BKIDs of a job input and output.
- If a job output is to provide input to a downstream job they will have the same BKID.
- When tasks list themselves they will display the output BKIDs of the jobs they own.
Part of the model is that new items can be added to the input pool, so
when new data is available production can be restarted.
Now I'll address the individual job error recovery model. It would be
hard to develop a general purpose system that could perform a post
mortem on the smoking remains of a failed job and harder still work
out the correct procedure to recover, so the system doesn't attempt
to. Instead the idea is that the best place to diagnose faults and
plot a recovery course is the executing script itself. It passes back
a small text file which records success, recoverable failure or fatal
(requires human intervention). When the script next runs it will will
be passed back, along with the standard args, additional ones derived
from this text file so can pick up where it left off. This does
require that the designer of the script puts thought into each step,
diagnosing problems and structuring so as to be able to resume where
the previous incarnation left off. On the plus side the system is open
ended, recovery can be as smart as the script writer can imagine. Some
recovery can also be built into the GBS, when the GBS job receives the
text file back. So some basic rules will have to apply to the
communication files and it may well end up as an object, at least on
the GBS end.
At the system failure level error recovery is based on the idea: run
often and make small changes. For example the cron could run every
half hour but only ever release say 10 new jobs up to some total
number so as not to needless overfill queues. So some catastrophic
failure e.g. an SE goes away would result in the failure many jobs but
the system would only resubmit a few at a time effectively probing the
system until the failure was rectified. The system could easily be made
a little smarter, throttling back further if the failure rate was
high.
This makes for a responsive system but does require that the system be
lightweight. One simple step in this direction is the principle: only
keep a Ganga job for the minimum amount of time: once it completes,
harvest its products and delete it. That way the Ganga Job Registry
is kept to a minimum size and it's this that principally determines
the time it takes to start Ganga.
This is the ability to connect tasks together; the output IO items in
one pool feeding the inputs of the next. I wouldn't put any effort
into this unless and until the rest is working; there not a lot of
point in connecting things that don't work! However I'll mention one
basic problem of connectors: knowing when a downstream job is ready to
run. If it takes a single input then it's easy, but what if it takes
all the sub-runs, produced separately, how does it know how many to
expect? The proposed solution is that as soon as a job is created,
and before it even gets submitted, it places an entry (a "promise") in
it's output pool, so a downstream job can see all its potential input
well ahead of time and can wait for the upstream jobs to fulfil their
promises.
GBS provides a complete set of objects, sufficient to do a very simple
production and these will be identified in the
Identified Objects
section. On top of this an experiment is free to develop one or more systems by
creating new objects either inheriting and refining the GBS ones or
starting again from scratch. Where new generic features are
identified these can be moved back into GBS by creating new
objects. So eventually there may be a variety of objects and then there
has to be a way of organising them into valid combinations. This is
achieved using roles and models:-
- Role
A role is the part the object plays within GBS. The following roles are defined:-
- Model
A model is a valid combination of objects used
to implement a specific task. For each role the model identifies the
class to be used to create objects for that role and can optionally
provide a set of fixed arguments to be passed when constructing the
object. So although models will normally involve different
combinations of classes, it is possible for two models to use the same
combination of classes but just configure the objects differently.
This is a singleton dictionary that holds configuration data e.g. the
name of the experiment and the directory where all the GBS data is
stored.
Provides a base for all persistable objects providing an interface for I/O and
holding name and type (Class).
This provides a single set of mappings from the set
of
roles
to the class objects that implement them and the way their objects
should be configured. A GBSModel provides a one line summary and
detailed description to help a user decide what model is appropriate
for a given task. If given a role it can create a suitable, configured
object.
GBS will provide at least one model, the default, but experiments are
free to add further ones and even replace the default.
This registers the various models. It can provide the user with the
current list, and access to individual ones so that the user can get a
detailed description. Once the model has been selected objects are
created via the GBSModelRegistry.
This singleton is used to log information about GBS itself,
principally as a diagnostic to fix bugs.
This is a singleton that gives the user access to the system. It can
list active Tasks, create news ones and request the destruction
of old ones (the Task will refuse it there are running jobs).
It can also give access to individual Tasks.
A Task is responsible for a single task; shepherding all it's
jobs through to successful completion. It may accept more input.
It can monitor its jobs and decide how many new ones to submit.
It has a two phase life: in the first part it is configured. Almost
all its state can be changed at this stage. It is possible to submit
single test jobs, but these should only produce output that can be
deleted afterwards. Once it is O.K. it moves to execution phase and
can start running jobs. From now on only operational changes can be
made, for example:-
- The maximum number of jobs to run can be run.
- The running state:-
- "Active" Will submit new jobs.
- "Passive" Jobs have been submitted but won't submit any more.
Other possible states are:-
- "Dormant" No running jobs, none will be submitted, but input pool not complete
- "Complete"
As jobs change the state of the output pools the Manager is
informed.
These objects are configured to bridge between tasks. Once two
compatible tasks exist a connection can be made between them. As one
task changes its output pool it informs the Manager which in
turns informs its TaskConnectors.
The holds a list of IOItems. It can locate items from their
Book-keeping IDS BKID. It can accept new IOItems.
This represents a single input to a job or the entire output from it.
It has a BKID and a data string that encodes the actual I/O, for
example file name or a list of them.
This is responsible for running a single script on a set of inputs to
produce an output. It gets created and given an application script when
a Task has IOItems in its input IOPool that are not associated
with a Job. It is offered the IOPool and can take as many
inputs as it requires. A Job could even be allowed to say: "I
want more and I won't run until you give them to me". Tasks
would first offer the input pool to existing, but unsatisfied,
Jobs, and only create new Jobs if there is anything
left. When requested, a Job submits a Ganga job to run the
application script and have it return the termination status in a text
file which it uses in co-operation with the application script to
recover from errors. Jobs, once created, live until the end of
the Task.
This object encapsulates the GBS end of job error recovery. Although
it could be part of
Job
it is factored out to focus this critical component into a single
interface. It gathers all the information about a job and, in
principle, its past history and determines what should be done next.
When the job gets results back from Ganga it creates a
JobAnalyser and passes itself to it. As a separate step it asks the
JobAnalyse to apply them. The pseudo code looks like
analyser = CreateObject(current_model,"JobAnalyser")
analyser.Analyse(self)
analyser.Apply()
This minimises the amount of experiment code that needs to get written
to tailor error handling to specific needs. A new JobAnalyser can be
written that inherits from the default one and can pass the Analyse()
call down to it to do all the spade work. Then, with all the
information to hand, it can fine tune the response before returning to
the calling Job.