Notes on the discussion about running the d0 framework/sam on cluster of
computers (Wed June 20, 2001).
Present: Gabriele Garzoglio, Lee Lueking, Ruth Pordes, Igor Terekhov,
         Vicky White
External input: Dave Fagan

1) Introducing the concept of phases in the job processing in sam:
we discussed the possibility of implementing a sam command of the "submit"
family, where the user specifies an input dataset and a series of scripts
logically chained together such that the output of one script becomes the
input for the next (piping). In this scenario, should the intermidiate
output files of each script be written into sam as a new dataset?
If we suppose that the output of the chain is a single file of filtered
events, one should store the intermediate output files: in fact, in case
of crashes, one wants to know how many intermediate files contributed to
the single output file in order to compute luminosity. A different
approach could be considering the whole process invalid if some nodes
crashed, yet this is quite radical considering that the output, even if
not complete, may be of interest.
A drawback of storing information at every step is the need of a reliable
network between the cluster and the station. In principle, since all the
information for the various phases is processed within the cluster, such a
reliable connection is not needed for processing the whole chain. Gliches
in the network could be responsible for holding jobs that otherwise would
start immediately, whithout having to wait for the output of the priviuos
script to be stored and then redelivered to them via sam.
We concluded that implementing a mechanism for transfering the
intermidiate information at a deferred time would be desiriable.
Some consideration has been made on what information should be included as
meta data, in order to be able to reconstruct the generator of each
intermefiate dataset. Storing the name/version of the processes up to
that output dataset would be a piece of information; adding a data tier to
characterize this new kind of dataset would be another; other information
as well may be desirable.

2) The interaction of sam with Root:
Many D0 collaborators will use Root to do analysis; furthermore, the Root
format is one of the two chosen to store the D0 thumbnails.
Creating infrastructures to manipulate Root's input/output via sam may be
of interest.
In principle, if the managing of processing phases was already implemented
(point 1 here), one could produce a series of rootple in phase one of the
process; stage them back to a common location (i.e using point 3 here);
run a second phase that chains the rootple together and launch the root
analysis on it; launch a phase 3 script that send to sam the output of the
root analysis (a histogram?), storing it using a new data tier definition.
Of course, this approach is very rough, yet an infrastructure to chain
root files together without having the files physically present on disk is
not available. Creating such an infrastructure would allow root to start
the analysis on some files already present (chaining them together),
while a delivery system (sam) could bring the files needed to complete the
analysis.
Being able to start a root analysis without having all the files present
at the same time may also be dangerous because of dead locks: if the
new files that arrive need to be chained to the others in order to be
analyzed, 2 applications may never terminate waiting for the delivery of
the files in a cache not large enough to keep them all. Resource
reservation or other mechanism other than the chain of files would be
something to consider in this case.
More information on how Root handle chained files is necessary.
Root makes available classes that add features to its i/o methods.
A complete interface with sam may be implemented using this technique on
the style of the d0 framework integration with sam. Issues of maintainance
of the software become important at this point, though, since Fermilab
doesn't have control over Root developments.
Some considerations on the use of Proof has been made as well. More
information is needed also in this case to understand how Proof can be
integrated/used with sam.
Gabriele Garzoglio will start gathering information about Root/Proof and
their possible interaction with sam.

3) Creating a new adapter for sam to the Portable Batch System (PBS):
A cluster of linux machines with PBS installed is available and
administered by the Dave Fagan's group. It may be used as a sam station
for analysis.

4) Staging of the output files from the jobs running on the cluster:
in order to debug an analisys program there is interest in getting all the
output files produced by the distributed jobs in a single location to
check them out conviniently. In this scenario, there is no need to store
the files permanently into sam, with the assumption that these test output
shouldn't be very large.

------------------------------------------------------------------
>           Gabriele Garzoglio, Computer Professional            <
>-------------------------------------+--------------------------<
> Home  : 52 West 60th St. Apt.209    | +1-(630)-725-9735 (Home) <
>         Westmont (IL), 60559, USA   | +1-(630)-840-6470 (Work) <
> Work  : FERMILAB-P.O.Box 500, MS 114| +1-(630)-840-3867 (Fax)  <
>         Batavia (IL), 60510, U.S.A. | +1-(630)-218-9562 (Pager)<
--------------------------------------+---------------------------