A fault tolerant batch submission framework layered on Ganga
Status: As of version V01-15-01 GBS is a running system
although it has only been tested so far on an internal dummy application.
Major features (IOPool, IOItem and TaskConnector) are missing. As a
temporary, and possibly permanent, replacement for the IO items, Tasks
directly create the inputs for their Jobs.
Here is the
code
See also
<child_name>.state
<parent_name>
<top-dir>/ Manager.state Manager/ <task name 1>.state <task name 1>/ <job name 1>.state <job name 2>.state ... <task name 2>.state <task name 2>/ <job name 1>.state <job name 2>.state ...
_DoMemberIO(self,ioh)member function that steers the I/O of it's own member data. ioh is a GBSIOHelper class that simplifies the process of writing I/O. For every stored data members that has to be a single line in the _DoMemberIO function of the form:-
<member variable> = ioh(<Description> <type> <member variable>) where <member variable> is the member variable e.g. self.__model <Description> is a short description of the variable e.g. "Model" <type> is the type of persistency used and must be one of the following:- "s" Convert to/from string "i" Convert to/from integer "f" Convert to/from floating point "p" Convert using pickleThe descriptions serve two purposes: they are written to the state file along with the data, making it easier to read and interpret these files, and when read back the description is checked to see that it is as expected and hence that the file is not corrupt.
Once all stored data members have been dealt with the function should end by calling the same method in the direct antecedent in their inheritance chain. For example GBSTask:-
def _DoMemberIO(self,ioh): self.__model = ioh("Model","s",self.__model) GBSObject._DoMemberIO(self,ioh)I/O is initiated by the GBSObject's Read() and Write() methods which are simply an interface to the generic DoIO(mode) method which constructs a GBSIOHelper and then calls the overloaded _DoMemberIO(ioh) method at the top of the inheritance tree to gets its data transferred before calling the classes further down the tree.
Note that this model doesn't support multiple inheritance; if any developer requires that they will have to be responsible for the I/O.
def __init__(self,name,parent,model): self.__model = model GBSObject.__init__(self,name,parent)It might seem more natural to initialise the base class first, the way that C++ does, but done this way means that when GBSObject's initialiser is called and has completed its initialisation, it can then check to see if the state file exists and either call the Read() or Write() method as appropriate to bring the object into synchronisation with its state file.
WriteFamily(self)member function that must first call
Write(self)and then call WriteFamily for all its children if any. Currently the only use for this method is a way to refresh the entire Storage Directory to help with Schema Evolution - see the next section.
To signal that a data member has been added place a '+' before the name passed to ioh, for example:-
self.newStuff = ioh("+New Stuff","s",self.newstuff)On reading if the entry isn't in the file the existing value is returned. Of course this requires that the caller has provided a sensible default, but it's good programming practise to do that in the class' __init(self)__ method in any case. Writing proceeds as normal i.e. in the above case 'New Stuff' is written out.
To remove an item, precede the name with a '-'. Of course in this case the caller isn't interested in any obsolete value that still remains in the file. For example:-
ioh("-Old Stuff","s",self.oldStuff)On reading if the value exists it will be returned but it will not be an error if it does not. On writing the item is not written to the file.
This system is best used as a migration strategy. The changes are made and the entire state rewritten:-
man.WriteFamily()Then, having checked the state files really have migrated across, update the ioh call sequences to reflect the new state.
print obj.__doc__to instruct them on the object's use
If needed this can be followed by (i.e. outside of __doc__) more detailed documentation of the class and its methods.
###### GBSObject inherited responsibilities ######Note: This header comment, and the others described below need to be indented to the first level of indentation, if they start in the first column they cause doxygen to truncate its Class Reference page for the class.
Followed by:-
###### User Callable Methods (Getters then Setters) ######Followed by these methods organised alphabetically by Getters and then Setters
###### Private Methods (not user callable) ######Followed by these methods organised alphabetically.
Experiment wishing to add additional models should:-
__init__.py register_user_models.py
For each model, each role is assigned a class object from which objects can be created. Further, for each role within the model a dictionary can be assigned which is passed to the object constructor (i.e. class object) to customise it for this model. So two models could use the same set of class objects but simply customise the objects created from them in different ways.
All object constructors require the following four arguments:-
GetModelRegistry().CreateObject(model,"Task",name,self)
def CreateObject(self,role,object_name,parent) : m = self.GetModel(model) return m.CreateObject(role,object_name,parent)
def CreateObject(self,role,object_name,parent) : return self.__class_map[role](object_name,parent,self.__name,self.__ctor_list_map[role])
<date-time> <keyword> <associated data>
Keyword | Associated data | GBS Response |
---|---|---|
SUCCEEDED | The output file data for the corresponding IOItem | Marks the job as SUCCEEDED |
FAILED | A suitable diagnostic string | Marks the job as FAILED. Flags the job for user intervention |
RETRY | Retry information to be returned to application script
the next time the job is submitted Convention: Use RESTART to start again from scratch. | Attempts to re-run job up to a limit and then marks as FAILED requiring user intervention |
HOLD | Retry information to be returned to application script
the next time the job is submitted Convention: Use RESTART to start again from scratch. | Places job on hold. When released job will retry. |
INFO | Information about job progress. | None, although useful if intervention required. |
RETURN_FILE | Absolute file name to be returned. | Returns file. See Runtime output sandbox names To return multiple files have multiple RETURN_FILE lines. |
The file should contain with exactly one SUCCEEDED, FAILED, HOLD or RETRY keyword and must have non-empty associated data, but there can be any (reasonable) number of INFO lines that can describe how the job is progressing and can be used when diagnosing problems. A helper script, accessed via $GBS_LOG, is provided to simplify including the date and time, which can be a very useful diagnostic, in a standard format:-
$GBS_LOG INFO So far so good ... $GBS_LOG INFO spoke too soon, I cannot run code, so signal retry from step 3 ... $GBS_LOG RETRY 3
The core GBS code makes no assumptions as to the form of the associated data; that is left entirely a matter for the application script, and possibly a experiment extension to the core job analysis code. However, there is one convention:-
For RETRY the associated data RESTART should be used to start again from scratch
This is only used when analysing global log files; for a RETRY that just RESTART, it is assumed that the job did nothing useful and any CPU used was wasted, otherwise the CPU is considered useful.
The application script should exit without an error code; the error should be signalled via the GBS Log File.
The job submitted by the Job runs a job wrapper script that prepares an GLF and then calls the application script. The reason for the two stage approach is that:-
The job wrapper prepares an environment for the application script:-
Variable | Meaning |
---|---|
GBS_HOME | Home directory i.e. working directory when job launched. |
GBS_LOG | Helper script to write to GBS Log File e.g.:-$GBS_LOG SUCCEEDED CandA.root CandS.root ntupleSR.root |
GBS_MODE | Either "Test" or "Production" The application should not write to production data areas if in "Test" mode. |
GBS_RETRY_COUNT | Will be 0 for first attempt. |
GBS_RETRY_ARG_1 GBS_RETRY_ARG_1 GBS_RETRY_ARG_2 ... GBS_RETRY_ARG_n | Retry args. There will be none if GBS_RETRY_COUNT = 0 Otherwise the number will depend on how many retry args were returned from the script and how the Job responded to them. |
After this the wrapper passes the following arguments to the application script:-
So a template for an application script might look something like this:-
# Record first info. $GBS_LOG INFO Starting execution of $0 in mode $GBS_MODE on `hostname` # Load standard script args first_arg=$1; shift second_arg=$1; shift ... last_arg=$1; shift start_step=1 #Normally start at step 1 # Deal with recovery if [ $GBS_RETRY_CONT > 0 ]; then # Call recovery routine recover() fi run($start_step) Resume normal running at start_step ############################################### recover() { # This is the recovery code. # Attempt to recover. If successful it may update GLF file and exit # if that now completes to job or it might modify start_step to # resume standard execution. The simplest possibly recovery would # be to start again at step 1, in which case the recover routine # doesn't have to do anything. # If unsuccessful it should update the GLF file and exit e.g.:- $GBS_LOG RETRY recover_arg_1 ... $recover_arg_n exit 0 } ############################################### run() { # This runs the the standard application as a sequence of steps and # can resume at a supplied step start_step=$1 if [ $start_step -le 1 ] ; then $GBS_LOG INFO Starting step 1 # Do step 1 # Were some fatal error to occur:- $GBS_LOG FAILED diagnostic info exit 0 ... fi if [ $start_step -le 2 ] ; then $GBS_LOG INFO Starting step 2 # Do step 2 ... fi ... if [ $start_step -le n ] ; then $GBS_LOG INFO Starting step n # Do step n ... fi # If everything end O.K., signal complete $GBS_LOG SUCCEEDED $output_file_1 ... $output_file_n exit 0
There are 3 separate retry limits:-
The situation is complicated by the fact that failures can occur at multiple levels which all have to be handled. The table below lists the "communication levels" at which the information is returned and the way GBS responds.
Communication Level | Description | GBS response |
---|---|---|
GANGA | Ganga communication failed or returned an unrecognised code. | Treat as LATE_UNHANDLED |
GRID | Ganga returned aborted or cancelled | Treat ABORTED as LATE_UNHANDLED Treat CANCELLED by placing job on HOLD |
BATCH | Ganga return failed or returned O.K., but GLF either missing or does not contain job wrapper start line. | Treat as EARLY |
WORKER | GLF exists with job wrapper start line but no job wrapper end line | If there are two or more entries in GLF use to determine minimum running time otherwise set running time as 0. Then classify as either EARLY or LATE_UNHANDLED |
APPLICATION | GLF exists with job wrapper start line and end lines but either application ends with a non-zero code or fails to write one of SUCCEEDED, FAILED, HOLD or RETRY to the GLF. | User start and end times to classify as EARLY or LATE_UNHANDLED |
USER | As APPLICATION but application ends O.K., and writes one of SUCCEEDED, FAILED, HOLD or RETRY to the GLF. | Pass back SUCCEEDED and FAILED but subject handle HOLD and RETRY as either EARLY or LATE_HANDLED and subject them to limits. |
In those cases where the job is to be retried but communication failed to reach USER level, the Job uses the same retry information as for the previous attempt.
The JobAnalyser, on behalf of the Job it is servicing, has to analyse the results, decide what to do next, and then apply actions which consist of making the following changes to the Job.
Data Member | Action |
---|---|
__earlyFails __lateFailsHandled __lateFailsUnhandled | Increment as appropriate |
__statusCode | Set to one of: SUCCEEDED, FAILED, RETRY or HOLD |
statusText | The reason it reached this status
|
__currentRetryArgs | If application signalled HOLD or RETRY and it was accepted, update with the data it passed, otherwise leave alone |
When the Job is ready to submit the next attempt it:-