GBS: Ganga-based Batch Submission: Design Manual V01-15-01

A fault tolerant batch submission framework layered on Ganga

Status: As of version V01-15-01 GBS is a running system although it has only been tested so far on an internal dummy application. Major features (IOPool, IOItem and TaskConnector) are missing. As a temporary, and possibly permanent, replacement for the IO items, Tasks directly create the inputs for their Jobs.
Here is the code See also

n west

Last modified: Fri Jan 4 07:54:49 GMT 2008
Return to GBS home page

GBS Design Manual Contents

Scope

GBS is a fault tolerant batch submission framework layered on Ganga. It is not a book-keeping system, although it would be simple to interface it to one. The core concept is one of task: a set of jobs that need to be run using the same script with a range of inputs. With GBS a Task is created that oversees the submission of jobs and their recovery in case of failure. Once complete, the results are harvested and the Task data removed.

Use Cases and Requirements

An initial assumption is that the system will be layered on Ganga and analysis begins with 3 generic Use Cases:-

"I want to be able to define a data set and run the same script on every member of the set. Later, when new data is available, I want to augment the set and run the new data through the same script".
"I want a system that's fault tolerant both at the job level and at the system level, with a way to program error recovery. For example at the job level I want to run an MC job on a range of run and sub-run numbers, but if a job SegVs I want to be able to rerun with a new seed. At the farm level, if there is a catastrophic failure I want the system only keep resubmitting a few jobs until things are back to normal".
"I want to have the output from one set of jobs provide the input for one or more new sets of jobs".

Beyond the task orientated use cases there are also the operational ones:-

"I want it to be easy to use. In particular it should present clear comprehensible summaries, perhaps as a web page, and it should be simple to suspend, resume, or adjust the job submission levels."
"I want to be able to tests out new jobs and debug then before going into production".
"When there are major problems I want to be able to kill all running jobs and suspend submission of new ones. Once the problem is fixed I want to be resume operations".
"When things do end it tears it needs to be easy to apply fixes globally."

All of the above are generic and experiment independent, but actual implementation will be experiment specific which leads to the conclusion that the required system is an extensible framework, written in python, and which provides an API into which experiments can plug specific algorithms. Measures of success are

How natural is this API?
How minimal is the experiment code needed be to complete the system?

So the good news is that GBS can be used by any experiment, but the bad news is that it can be used by none until they supply plug-ins to cover their needs, although this may mean no more that extending the script that actually runs the job so that in can handle retries.

Design

Outline Design

Concepts

The system will be a implemented as an OO set of python scripts that run as a regular cron job to maintain a steady level of running jobs and react to the outcome of these jobs. It's based on a number of concepts:-

Tasks

The first concept is that of a task:-

A task is the execution of a series of jobs, each using the same script on a pool of inputs to produce a pool of outputs.
Each job may consume on or more inputs but only produces a single output. This asymmetry is logical and not restrictive. A job ideally only runs once so produces a single output. It is up to the experiment to encode what that output is - it could be a list of file names. On the other hand a downstream job make take its input from multiple upstream jobs so in general jobs need multiple inputs.
The type of a task is determined by the type of job it runs and not the script. The same type of task could be configured to run with different scripts, so long as they are compatible with the task i.e. don't run MC scripts with a task designed for reconstruction work!
The number of different types of task is determined by the experiment although GBS would provide a basic set to do most, and if lucky, all of the work.
Tasks have names, making it possible to have the same task running multiple times, each working on a different set of input pools.

I/O

The next concept is an input or output pool of I/O items that represents the input and outputs of a job. These will contain arbitrary strings that will provide input to the job as it configures a Ganga job or records the outputs the job produces. These strings could be file names for reconstruction or run numbers + seeds for MC.

I/O items also have to have the concept of a "book-keeping ID" BKID. It will probably be based on the triplet: (detector/sub-system, run number, sub-run number). These can all be strings so, for example run number could 12340-12349 or sub-run number could be *. The exact representation can be left to implementation so long as it satisfies the following:-

Within an I/O pool the BKID is unique (it may be used to locate an I/O item in the pool).
There is a simple relationship between the BKIDs of a job input and output.
If a job output is to provide input to a downstream job they will have the same BKID.
When tasks list themselves they will display the output BKIDs of the jobs they own.

Part of the model is that new items can be added to the input pool, so when new data is available production can be restarted.

Error Recovery

Now I'll address the individual job error recovery model. It would be hard to develop a general purpose system that could perform a post mortem on the smoking remains of a failed job and harder still work out the correct procedure to recover, so the system doesn't attempt to. Instead the idea is that the best place to diagnose faults and plot a recovery course is the executing script itself. It passes back a small text file which records success, recoverable failure or fatal (requires human intervention). When the script next runs it will will be passed back, along with the standard args, additional ones derived from this text file so can pick up where it left off. This does require that the designer of the script puts thought into each step, diagnosing problems and structuring so as to be able to resume where the previous incarnation left off. On the plus side the system is open ended, recovery can be as smart as the script writer can imagine. Some recovery can also be built into the GBS, when the GBS job receives the text file back. So some basic rules will have to apply to the communication files and it may well end up as an object, at least on the GBS end.

At the system failure level error recovery is based on the idea: run often and make small changes. For example the cron could run every half hour but only ever release say 10 new jobs up to some total number so as not to needless overfill queues. So some catastrophic failure e.g. an SE goes away would result in the failure many jobs but the system would only resubmit a few at a time effectively probing the system until the failure was rectified. The system could easily be made a little smarter, throttling back further if the failure rate was high.

This makes for a responsive system but does require that the system be lightweight. One simple step in this direction is the principle: only keep a Ganga job for the minimum amount of time: once it completes, harvest its products and delete it. That way the Ganga Job Registry is kept to a minimum size and it's this that principally determines the time it takes to start Ganga.

Task Connection

This is the ability to connect tasks together; the output IO items in one pool feeding the inputs of the next. I wouldn't put any effort into this unless and until the rest is working; there not a lot of point in connecting things that don't work! However I'll mention one basic problem of connectors: knowing when a downstream job is ready to run. If it takes a single input then it's easy, but what if it takes all the sub-runs, produced separately, how does it know how many to expect? The proposed solution is that as soon as a job is created, and before it even gets submitted, it places an entry (a "promise") in it's output pool, so a downstream job can see all its potential input well ahead of time and can wait for the upstream jobs to fulfil their promises.

Extensibility, Roles and Models

GBS provides a complete set of objects, sufficient to do a very simple production and these will be identified in the Identified Objects section. On top of this an experiment is free to develop one or more systems by creating new objects either inheriting and refining the GBS ones or starting again from scratch. Where new generic features are identified these can be moved back into GBS by creating new objects. So eventually there may be a variety of objects and then there has to be a way of organising them into valid combinations. This is achieved using roles and models:-

Role
A role is the part the object plays within GBS. The following roles are defined:-
Model
A model is a valid combination of objects used to implement a specific task. For each role the model identifies the class to be used to create objects for that role and can optionally provide a set of fixed arguments to be passed when constructing the object. So although models will normally involve different combinations of classes, it is possible for two models to use the same combination of classes but just configure the objects differently.

Identified Objects

GBSConfig

This is a singleton dictionary that holds configuration data e.g. the name of the experiment and the directory where all the GBS data is stored.

GBSObject

Provides a base for all persistable objects providing an interface for I/O and holding name and type (Class).

GBSModel

This provides a single set of mappings from the set of roles to the class objects that implement them and the way their objects should be configured. A GBSModel provides a one line summary and detailed description to help a user decide what model is appropriate for a given task. If given a role it can create a suitable, configured object.

GBS will provide at least one model, the default, but experiments are free to add further ones and even replace the default.

GBSModelRegistry

This registers the various models. It can provide the user with the current list, and access to individual ones so that the user can get a detailed description. Once the model has been selected objects are created via the GBSModelRegistry.

GBSLogger

This singleton is used to log information about GBS itself, principally as a diagnostic to fix bugs.

Manager

This is a singleton that gives the user access to the system. It can list active Tasks, create news ones and request the destruction of old ones (the Task will refuse it there are running jobs). It can also give access to individual Tasks.

Task

A Task is responsible for a single task; shepherding all it's jobs through to successful completion. It may accept more input. It can monitor its jobs and decide how many new ones to submit.

It has a two phase life: in the first part it is configured. Almost all its state can be changed at this stage. It is possible to submit single test jobs, but these should only produce output that can be deleted afterwards. Once it is O.K. it moves to execution phase and can start running jobs. From now on only operational changes can be made, for example:-

The maximum number of jobs to run can be run.
The running state:-
- "Active" Will submit new jobs.
- "Passive" Jobs have been submitted but won't submit any more.
Other possible states are:-
- "Dormant" No running jobs, none will be submitted, but input pool not complete
- "Complete"

As jobs change the state of the output pools the Manager is informed.

TaskConnector

These objects are configured to bridge between tasks. Once two compatible tasks exist a connection can be made between them. As one task changes its output pool it informs the Manager which in turns informs its TaskConnectors.

IOPool

The holds a list of IOItems. It can locate items from their Book-keeping IDS BKID. It can accept new IOItems.

IOItem

This represents a single input to a job or the entire output from it. It has a BKID and a data string that encodes the actual I/O, for example file name or a list of them.

Job

This is responsible for running a single script on a set of inputs to produce an output. It gets created and given an application script when a Task has IOItems in its input IOPool that are not associated with a Job. It is offered the IOPool and can take as many inputs as it requires. A Job could even be allowed to say: "I want more and I won't run until you give them to me". Tasks would first offer the input pool to existing, but unsatisfied, Jobs, and only create new Jobs if there is anything left. When requested, a Job submits a Ganga job to run the application script and have it return the termination status in a text file which it uses in co-operation with the application script to recover from errors. Jobs, once created, live until the end of the Task.

JobAnalyser

This object encapsulates the GBS end of job error recovery. Although it could be part of Job it is factored out to focus this critical component into a single interface. It gathers all the information about a job and, in principle, its past history and determines what should be done next. When the job gets results back from Ganga it creates a JobAnalyser and passes itself to it. As a separate step it asks the JobAnalyse to apply them. The pseudo code looks like

  analyser = CreateObject(current_model,"JobAnalyser")
  analyser.Analyse(self)
  analyser.Apply()

This minimises the amount of experiment code that needs to get written to tailor error handling to specific needs. A new JobAnalyser can be written that inherits from the default one and can pass the Analyse() call down to it to do all the spade work. Then, with all the information to hand, it can fine tune the response before returning to the calling Job.