GBS: Ganga-based Batch Submission: Programmer Manual V01-15-01

A fault tolerant batch submission framework layered on Ganga

Status: As of version V01-15-01 GBS is a running system although it has only been tested so far on an internal dummy application. Major features (IOPool, IOItem and TaskConnector) are missing. As a temporary, and possibly permanent, replacement for the IO items, Tasks directly create the inputs for their Jobs.
Here is the code See also

n west

Last modified: Thu Mar 13 13:10:26 GMT 2008
Return to GBS home page

GBS Programmer Manual Contents

This manual deals the extension of CBS by the development of new classes and models.

Coding Conventions

CBS adheres to the following conventions whenever the author can bring them to mind.

Class member functions start with an upper case letter e.g. AddTask(self,name,model="default").
Class data members start with an lower case letter e.g. taskManagers.
For data members and truly private member functions keep "private" (i.e. name mangled) by using the double underscore prefix '__'
For "protected" methods i.e. those that need to be accessed, including method overloading use a single underscore prefix '_'. This doesn't afford any automatic protection, there is no concept of "private" in python, so is no more than a reminder to end users to play fair and not to directly access anything starting with a single underscore.

The Storage Directory model

All role playing objects have to be stored on disk between program invocations. All objects, except the singleton Manager have parents and have names that are unique to their parent which allows for a simple natural directory structure:-

The state of an child object is stored in a file called:-
```
  <child_name>.state
```
This file is stored in a directory called:-
```
  <parent_name>
```

So a partial directory structure is of the form:-

 <top-dir>/
   Manager.state
   Manager/
      <task name 1>.state
      <task name 1>/
        <job name 1>.state
        <job name 2>.state
       ...

      <task name 2>.state
      <task name 2>/
        <job name 1>.state
        <job name 2>.state
       ...

Developing a New Class

New classes can developed for each of the roles and then incorporated into a new or existing model (see Developing a New Model ).

Object I/O
Code Layout

Object I/O

All role playing objects have a core responsibility to store themselves on disk each time they change and be able to restore themselves from disk at program start up. Although it would be possible to start again from scratch it will certainly be easier to fulfill these responsibilities by inheriting from an exiting class or from GBSObject which all the standard ones do, either directly or indirectly. In this case the rules for I/O are simple:-

All classes must provide a
```
_DoMemberIO(self,ioh)
```
member function that steers the I/O of it's own member data. ioh is a GBSIOHelper class that simplifies the process of writing I/O. For every stored data members that has to be a single line in the _DoMemberIO function of the form:-
```
    <member variable> = ioh(<Description>  <type>  <member variable>)

    where

      <member variable>  is the member variable e.g. self.__model
      <Description>      is a short description of the variable e.g. "Model"
      <type>             is the type of persistency used and must be one of the following:-
                             "s"   Convert to/from string
                             "i"   Convert to/from integer
                             "f"   Convert to/from floating point
                             "p"   Convert using pickle
```
The descriptions serve two purposes: they are written to the state file along with the data, making it easier to read and interpret these files, and when read back the description is checked to see that it is as expected and hence that the file is not corrupt.
Once all stored data members have been dealt with the function should end by calling the same method in the direct antecedent in their inheritance chain. For example GBSTask:-
```
    def _DoMemberIO(self,ioh):
        self.__model = ioh("Model","s",self.__model)
        GBSObject._DoMemberIO(self,ioh)
```
I/O is initiated by the GBSObject's Read() and Write() methods which are simply an interface to the generic DoIO(mode) method which constructs a GBSIOHelper and then calls the overloaded _DoMemberIO(ioh) method at the top of the inheritance tree to gets its data transferred before calling the classes further down the tree.
Note that this model doesn't support multiple inheritance; if any developer requires that they will have to be responsible for the I/O.
In their initialiser, they must first initialise themselves and then call the initialiser of their direct antecedent in the inheritance chain. For example GBSTask
```
    def __init__(self,name,parent,model):
        self.__model = model
        GBSObject.__init__(self,name,parent)
```
It might seem more natural to initialise the base class first, the way that C++ does, but done this way means that when GBSObject's initialiser is called and has completed its initialisation, it can then check to see if the state file exists and either call the Read() or Write() method as appropriate to bring the object into synchronisation with its state file.
After changing the state of an object it must use the Write() method to bring the state file up to date.
Parent objects that own child objects, for example GBSTask own GBSJob will naturally have lists that hold these child objects. This state must NOT be persisted. Instead the child objects store themselves in their parent's directory according to The Storage Directory model and when the parent objects are first created they should use this structure to repopulate their lists which should be regarded as a cache optimisation.
All classes must provide a
```
WriteFamily(self)
```
member function that must first call
```
Write(self)
```
and then call WriteFamily for all its children if any. Currently the only use for this method is a way to refresh the entire Storage Directory to help with Schema Evolution - see the next section.

Schema Evolution

It may well happen that the persisted state of objects changes over time and then the question is how to keep existing data in sync with the code? Tasks are expected to be short lived structures, set up to perform a particular production, the results gather and then torn down. So the simplest strategy is to freeze the version of GBS used for existing tasks and only use the latest version of the code with its revised schema for new tasks owned by new managers. However, a simple schema evolution is include for those cases where this is not practical.

To signal that a data member has been added place a '+' before the name passed to ioh, for example:-

  self.newStuff = ioh("+New Stuff","s",self.newstuff)

On reading if the entry isn't in the file the existing value is returned. Of course this requires that the caller has provided a sensible default, but it's good programming practise to do that in the class' __init(self)__ method in any case. Writing proceeds as normal i.e. in the above case 'New Stuff' is written out.

To remove an item, precede the name with a '-'. Of course in this case the caller isn't interested in any obsolete value that still remains in the file. For example:-

  ioh("-Old Stuff","s",self.oldStuff)

On reading if the value exists it will be returned but it will not be an error if it does not. On writing the item is not written to the file.

This system is best used as a migration strategy. The changes are made and the entire state rewritten:-

  man.WriteFamily()

Then, having checked the state files really have migrated across, update the ioh call sequences to reflect the new state.

Code Layout

Classes should follow the following layout:-

A __doc__ string that describes what class this inherits from and summarises the user callable methods that this class adds. This allows the user to:-
```
  print obj.__doc__
```
to instruct them on the object's use
If needed this can be followed by (i.e. outside of __doc__) more detailed documentation of the class and its methods.
The comment:-
```
    ######  GBSObject inherited responsibilities   ###### 
```
Note: This header comment, and the others described below need to be indented to the first level of indentation, if they start in the first column they cause doxygen to truncate its Class Reference page for the class.
Followed by:-
1. __init__. If children are to be reloaded this should be factored out into a call to __ReloadChildren() so as to separate out this private code.
2. _DoMemberIO(self,ioh) - see Object I/O.
3. GetType(self). Just returns class name.
4. __repr__(self). Should invoke the __repr__() method of its immediate base class and then add a very brief summary of its state. This permits the user to examine objects simply by typing them into the python prompt.
The comment:-
```
    ######  User Callable Methods (Getters then Setters)  ###### 
```
Followed by these methods organised alphabetically by Getters and then Setters
The comment:-
```
    ######  Private Methods (not user callable)  ###### 
```
Followed by these methods organised alphabetically.

Developing a New Model

GBS supports an arbitrary number of models so long as each is assigned a unique name. The default model is registered by register_models.py

Experiment wishing to add additional models should:-

Create a directory which at the least contains:-
```
  __init__.py
  register_user_models.py 
```
Change the configuration option UserModelsPath (which will be expanded so may contain environmental variables) to point to this directory.

Then, after register_models.py has registered the default it will the invoke register_user_models.py to register extensions.

For each model, each role is assigned a class object from which objects can be created. Further, for each role within the model a dictionary can be assigned which is passed to the object constructor (i.e. class object) to customise it for this model. So two models could use the same set of class objects but simply customise the objects created from them in different ways.

All object constructors require the following four arguments:-

Its name which must be unique to its parent.
Its parent object from which it can get services e.g. where to store its data.
The model name so that if it in turn creates objects they will use the same model.
The model specific dictionary (see above) which may be null.

The system works as follows:-

Objects are created by calling GBSModelRegistry's method CreateObject passing in the name of the model, the name of the role the object is to play, its name and its parent object. For example the GBSManager creates GBSTasks by:-
```
  GetModelRegistry().CreateObject(model,"Task",name,self)
```

This method looks up the GBSModel and calls that objects CreateObject method passing the role, object name and parent.

 def CreateObject(self,role,object_name,parent) :
    m = self.GetModel(model)
    return m.CreateObject(role,object_name,parent)

The GBSModel CreateObject method consults its class map hash to get the required class object and passes in the required ctor arguments:-

  def CreateObject(self,role,object_name,parent) :
    return self.__class_map[role](object_name,parent,self.__name,self.__ctor_list_map[role])

Error Recovery

Most of the power of error recovery is devolved from GBS to the application script running on some GRID batch worker. That script is responsible for communicating back the status at the end of the job along with some associated data. In this section we look at both sides of communication channel:-

User Application Script Interface
Job Interface

User Application Script Interface

Information is passed into the script by way of environmental variables and results communicated back via a GBS Log File (GLF) which is a text file whose lines have the standard format:-

  <date-time>  <keyword>  <associated data>

Keyword Associated data GBS Response
SUCCEEDED The output file data for the corresponding IOItem Marks the job as SUCCEEDED
FAILED A suitable diagnostic string Marks the job as FAILED. Flags the job for user intervention
RETRY Retry information to be returned to application script the next time the job is submitted
Convention: Use RESTART to start again from scratch. Attempts to re-run job up to a limit and then marks as FAILED requiring user intervention
HOLD Retry information to be returned to application script the next time the job is submitted
Convention: Use RESTART to start again from scratch. Places job on hold. When released job will retry.
INFO Information about job progress. None, although useful if intervention required.
RETURN_FILE Absolute file name to be returned. Returns file. See Runtime output sandbox names
To return multiple files have multiple RETURN_FILE lines.

Keyword	Associated data	GBS Response
SUCCEEDED	The output file data for the corresponding IOItem	Marks the job as SUCCEEDED
FAILED	A suitable diagnostic string	Marks the job as FAILED. Flags the job for user intervention
RETRY	Retry information to be returned to application script the next time the job is submitted Convention: Use RESTART to start again from scratch.	Attempts to re-run job up to a limit and then marks as FAILED requiring user intervention
HOLD	Retry information to be returned to application script the next time the job is submitted Convention: Use RESTART to start again from scratch.	Places job on hold. When released job will retry.
INFO	Information about job progress.	None, although useful if intervention required.
RETURN_FILE	Absolute file name to be returned.	Returns file. See Runtime output sandbox names To return multiple files have multiple RETURN_FILE lines.

The file should contain with exactly one SUCCEEDED, FAILED, HOLD or RETRY keyword and must have non-empty associated data, but there can be any (reasonable) number of INFO lines that can describe how the job is progressing and can be used when diagnosing problems. A helper script, accessed via $GBS_LOG, is provided to simplify including the date and time, which can be a very useful diagnostic, in a standard format:-

  $GBS_LOG INFO So far so good ...
  $GBS_LOG INFO spoke too soon, I cannot run code, so signal retry from step 3 ...
  $GBS_LOG RETRY 3

The core GBS code makes no assumptions as to the form of the associated data; that is left entirely a matter for the application script, and possibly a experiment extension to the core job analysis code. However, there is one convention:-

For RETRY the associated data RESTART should be used to start again from scratch

This is only used when analysing global log files; for a RETRY that just RESTART, it is assumed that the job did nothing useful and any CPU used was wasted, otherwise the CPU is considered useful.

The application script should exit without an error code; the error should be signalled via the GBS Log File.

The job submitted by the Job runs a job wrapper script that prepares an GLF and then calls the application script. The reason for the two stage approach is that:-

The job wrapper script does not move from the initial working directory ensuring that the GLF File will be returned in the sandbox.
It does very little and is very unlikely to fail; even if the application script aborts the job wrapper should end normally.

The job wrapper prepares an environment for the application script:-

Variable Meaning
GBS_HOME Home directory i.e. working directory when job launched.
GBS_LOG Helper script to write to GBS Log File e.g.:-

$GBS_LOG SUCCEEDED CandA.root CandS.root ntupleSR.root

GBS_MODE Either "Test" or "Production"
The application should not write to production data areas if in "Test" mode.
GBS_RETRY_COUNT Will be 0 for first attempt.
GBS_RETRY_ARG_1
GBS_RETRY_ARG_1
GBS_RETRY_ARG_2
... GBS_RETRY_ARG_n
Retry args. There will be none if GBS_RETRY_COUNT = 0
Otherwise the number will depend on how many retry args were returned
from the script and how the Job responded to them.

Variable	Meaning
GBS_HOME	Home directory i.e. working directory when job launched.
GBS_LOG	Helper script to write to GBS Log File e.g.:- $GBS_LOG SUCCEEDED CandA.root CandS.root ntupleSR.root
GBS_MODE	Either "Test" or "Production" The application should not write to production data areas if in "Test" mode.
GBS_RETRY_COUNT	Will be 0 for first attempt.
GBS_RETRY_ARG_1 GBS_RETRY_ARG_1 GBS_RETRY_ARG_2 ... GBS_RETRY_ARG_n	Retry args. There will be none if GBS_RETRY_COUNT = 0 Otherwise the number will depend on how many retry args were returned from the script and how the Job responded to them.

After this the wrapper passes the following arguments to the application script:-

The standard application fixed args as configured into the Global Args of the Task
The data items as configured into the Local Args of the Job

So a template for an application script might look something like this:-


#  Record first info. 
$GBS_LOG INFO Starting execution of $0 in mode $GBS_MODE on `hostname`

#  Load standard script args

first_arg=$1; shift
second_arg=$1; shift
...
last_arg=$1; shift

start_step=1  #Normally start at step 1

#  Deal with recovery

if [ $GBS_RETRY_CONT > 0 ]; then
# Call recovery routine
  recover() 
fi

run($start_step)  Resume normal running at start_step

###############################################

recover() {

#  This is the recovery code. 


#   Attempt to recover.  If successful it may update GLF file and exit
#   if that now completes to job or it might modify start_step to
#   resume standard execution.  The simplest possibly recovery would
#   be to start again at step 1, in which case the recover routine
#   doesn't have to do anything.

#   If unsuccessful it should update the GLF file and exit e.g.:-

  $GBS_LOG RETRY recover_arg_1 ... $recover_arg_n
  exit 0

} 

###############################################

run() {

#  This runs the the standard application as a sequence of steps and
#  can resume at a supplied step

  start_step=$1
  
  if [ $start_step -le 1 ] ; then
    $GBS_LOG INFO  Starting step 1
    #  Do step 1

#   Were some fatal error to occur:-
    $GBS_LOG FAILED diagnostic info
    exit 0
    
  ...
  fi
  
  if [ $start_step -le 2 ] ; then
    $GBS_LOG INFO Starting step 2
    #  Do step 2
  ...
  fi
  
  ...
  
  if [ $start_step -le n ] ; then
    $GBS_LOG INFO Starting step n
    #  Do step n
  ...
  fi
  
  #  If everything end O.K., signal complete

  $GBS_LOG SUCCEEDED $output_file_1 ... $output_file_n
  exit 0

Job Interface

Information is passed back to the Job via:-

The Ganga status code
The GLF - GBS log file

If the application ends with a non zero exit code or its GLF doesn't contain exactly one SUCCEEDED, FAILED, HOLD or RETRY keyword the the error is treated as an unhandled exit. How GBS deals with it and handled, i.e. official HOLD or RETRY, requests depends on two things:-

How fast the job fails. For this failures are divided into two groups: early i.e. within a few minutes and late i.e. everything else. This recognises the fact that if something fairly catastrophic has happened, like the software disk has gone away, it will show up pretty fast. In this case it's probably worth retrying quite a few times. The application script can help here by making a few tests to establish everything appears to be O.K.. On the other hand if the job has at least run for a few minutes then it's more likely that it is a script related fault.
Whether the failure is unhandled or not.

There are 3 separate retry limits:-

EARLY_RETRY_LIMIT = 20?
Used for both handled and unhandled exits.
LATE_UNHANDLED_RETRY_LIMIT = 1?
Allow one in case there was a system failure during the job. Until the system recovers subsequent errors will be early. Perhaps allow the job to put the keyword NO_RETRY in its GLF file to prevent any unhandled retry if to do so could be dangerous.
LATE_HANDLED_RETRY_LIMIT = 5?
This is for handled retries.

The situation is complicated by the fact that failures can occur at multiple levels which all have to be handled. The table below lists the "communication levels" at which the information is returned and the way GBS responds.

Communication
Level Description GBS response
GANGA Ganga communication failed or returned an unrecognised code. Treat as LATE_UNHANDLED
GRID Ganga returned aborted or cancelled Treat ABORTED as LATE_UNHANDLED
Treat CANCELLED by placing job on HOLD
BATCH Ganga return failed or returned O.K., but GLF either missing or does not contain job wrapper start line. Treat as EARLY
WORKER GLF exists with job wrapper start line but no job wrapper end line If there are two or more entries in GLF use to determine minimum running time otherwise set running time as 0. Then classify as either EARLY or LATE_UNHANDLED
APPLICATION GLF exists with job wrapper start line and end lines but either application ends with a non-zero code or fails to write one of SUCCEEDED, FAILED, HOLD or RETRY to the GLF. User start and end times to classify as EARLY or LATE_UNHANDLED
USER As APPLICATION but application ends O.K., and writes one of SUCCEEDED, FAILED, HOLD or RETRY to the GLF. Pass back SUCCEEDED and FAILED but subject handle HOLD and RETRY as either EARLY or LATE_HANDLED and subject them to limits.

Communication Level	Description	GBS response
GANGA	Ganga communication failed or returned an unrecognised code.	Treat as LATE_UNHANDLED
GRID	Ganga returned aborted or cancelled	Treat ABORTED as LATE_UNHANDLED Treat CANCELLED by placing job on HOLD
BATCH	Ganga return failed or returned O.K., but GLF either missing or does not contain job wrapper start line.	Treat as EARLY
WORKER	GLF exists with job wrapper start line but no job wrapper end line	If there are two or more entries in GLF use to determine minimum running time otherwise set running time as 0. Then classify as either EARLY or LATE_UNHANDLED
APPLICATION	GLF exists with job wrapper start line and end lines but either application ends with a non-zero code or fails to write one of SUCCEEDED, FAILED, HOLD or RETRY to the GLF.	User start and end times to classify as EARLY or LATE_UNHANDLED
USER	As APPLICATION but application ends O.K., and writes one of SUCCEEDED, FAILED, HOLD or RETRY to the GLF.	Pass back SUCCEEDED and FAILED but subject handle HOLD and RETRY as either EARLY or LATE_HANDLED and subject them to limits.

In those cases where the job is to be retried but communication failed to reach USER level, the Job uses the same retry information as for the previous attempt.

The JobAnalyser, on behalf of the Job it is servicing, has to analyse the results, decide what to do next, and then apply actions which consist of making the following changes to the Job.

Data Member Action
__earlyFails
__lateFailsHandled
__lateFailsUnhandled Increment as appropriate
__statusCode Set to one of: SUCCEEDED, FAILED, RETRY or HOLD
statusText The reason it reached this status

For SUCCEEDED the "reason" is the output file list
For RETRY the "reason" is the retry arg list
- if RETRY but without signal from application, use the value found in __currentRetryArgs.
For FAILED or HOLD it really is the reason!

__currentRetryArgs If application signalled HOLD or RETRY and it was accepted, update with the data it passed, otherwise leave alone

Data Member	Action
__earlyFails __lateFailsHandled __lateFailsUnhandled	Increment as appropriate
__statusCode	Set to one of: SUCCEEDED, FAILED, RETRY or HOLD
statusText	The reason it reached this status For SUCCEEDED the "reason" is the output file list For RETRY the "reason" is the retry arg list - if RETRY but without signal from application, use the value found in __currentRetryArgs. For FAILED or HOLD it really is the reason!
__currentRetryArgs	If application signalled HOLD or RETRY and it was accepted, update with the data it passed, otherwise leave alone

When the Job is ready to submit the next attempt it:-

Increments __tryNumber
For first try clears __currentRetryArgs and for all other attempts uses what was there from the previous attempt.