Notes on the discussion about running the d0 framework/sam on cluster of computers (Wed June 20, 2001). Present: Gabriele Garzoglio, Lee Lueking, Ruth Pordes, Igor Terekhov, Vicky White External input: Dave Fagan 1) Introducing the concept of phases in the job processing in sam: we discussed the possibility of implementing a sam command of the "submit" family, where the user specifies an input dataset and a series of scripts logically chained together such that the output of one script becomes the input for the next (piping). In this scenario, should the intermidiate output files of each script be written into sam as a new dataset? If we suppose that the output of the chain is a single file of filtered events, one should store the intermediate output files: in fact, in case of crashes, one wants to know how many intermediate files contributed to the single output file in order to compute luminosity. A different approach could be considering the whole process invalid if some nodes crashed, yet this is quite radical considering that the output, even if not complete, may be of interest. A drawback of storing information at every step is the need of a reliable network between the cluster and the station. In principle, since all the information for the various phases is processed within the cluster, such a reliable connection is not needed for processing the whole chain. Gliches in the network could be responsible for holding jobs that otherwise would start immediately, whithout having to wait for the output of the priviuos script to be stored and then redelivered to them via sam. We concluded that implementing a mechanism for transfering the intermidiate information at a deferred time would be desiriable. Some consideration has been made on what information should be included as meta data, in order to be able to reconstruct the generator of each intermefiate dataset. Storing the name/version of the processes up to that output dataset would be a piece of information; adding a data tier to characterize this new kind of dataset would be another; other information as well may be desirable. 2) The interaction of sam with Root: Many D0 collaborators will use Root to do analysis; furthermore, the Root format is one of the two chosen to store the D0 thumbnails. Creating infrastructures to manipulate Root's input/output via sam may be of interest. In principle, if the managing of processing phases was already implemented (point 1 here), one could produce a series of rootple in phase one of the process; stage them back to a common location (i.e using point 3 here); run a second phase that chains the rootple together and launch the root analysis on it; launch a phase 3 script that send to sam the output of the root analysis (a histogram?), storing it using a new data tier definition. Of course, this approach is very rough, yet an infrastructure to chain root files together without having the files physically present on disk is not available. Creating such an infrastructure would allow root to start the analysis on some files already present (chaining them together), while a delivery system (sam) could bring the files needed to complete the analysis. Being able to start a root analysis without having all the files present at the same time may also be dangerous because of dead locks: if the new files that arrive need to be chained to the others in order to be analyzed, 2 applications may never terminate waiting for the delivery of the files in a cache not large enough to keep them all. Resource reservation or other mechanism other than the chain of files would be something to consider in this case. More information on how Root handle chained files is necessary. Root makes available classes that add features to its i/o methods. A complete interface with sam may be implemented using this technique on the style of the d0 framework integration with sam. Issues of maintainance of the software become important at this point, though, since Fermilab doesn't have control over Root developments. Some considerations on the use of Proof has been made as well. More information is needed also in this case to understand how Proof can be integrated/used with sam. Gabriele Garzoglio will start gathering information about Root/Proof and their possible interaction with sam. 3) Creating a new adapter for sam to the Portable Batch System (PBS): A cluster of linux machines with PBS installed is available and administered by the Dave Fagan's group. It may be used as a sam station for analysis. 4) Staging of the output files from the jobs running on the cluster: in order to debug an analisys program there is interest in getting all the output files produced by the distributed jobs in a single location to check them out conviniently. In this scenario, there is no need to store the files permanently into sam, with the assumption that these test output shouldn't be very large. ------------------------------------------------------------------ > Gabriele Garzoglio, Computer Professional < >-------------------------------------+--------------------------< > Home : 52 West 60th St. Apt.209 | +1-(630)-725-9735 (Home) < > Westmont (IL), 60559, USA | +1-(630)-840-6470 (Work) < > Work : FERMILAB-P.O.Box 500, MS 114| +1-(630)-840-3867 (Fax) < > Batavia (IL), 60510, U.S.A. | +1-(630)-218-9562 (Pager)< --------------------------------------+---------------------------