HEPCAL comments David Adams August 4, 2003 Comments on HEPCAL RTAG Report General: Acronyms used before definition. This is confusing. It would help to move the glossary to the last pages. The document assumes that a replica of a dataset is restricted to a single SE. This simplifies distributed processing but is probably too restrictive. It will be useful to distribute large datasets and their processing over multiple sites. Once this threshold is crossed, the number of representations of a dataset increases dramatically because we can take variable-sized pieces of the dataset from each of site. There is not enough emphasis on the logical file which is arguably the atomic unit of grid processing. I agree strongly that the dataset is the appropriate interface for users. However efficient job scheduling requires matching CPU and data locations and the latter is more generally expressed in terms of the location of file replicas. 1 Glossary is in chap 9 (not 7) 2.6 WMS not defined. 2.6.2 I like the initial definition of interactive. The connection to communication channels needs clarification. Presumably this is in HEPCAL II. 3.2 VO not defined. 3.3 WN not defined. 3.5 Need section for files including PFN vs LFN My thoughts on datasets may be found at http://www.usatlas.bnl.gov/~dladams/dataset and in particuler the link "Datasets for the GRID". Another useful type of dataset is a well-defined part of another dataset such as an event selection or content restriction. 3.5.1 Dataset restricted to files? Should be possible for LDN ==> list of LFN's (not PFN's). In the discussion of a dataset having references to data in other datasets, it might be useful to introduce the notion of a "complete dataset" for which all components can be resolved internally. RSL not defined. For the case of dataset composed of files, we should distinguish 1. Virtual dataset with multiple reps (each a set of LFN's; rep can also be non-file, e.g. DB or virtual) 2. Logical dataset with mapping to a set of LFN's. 3. Physical dataset with mapping to a set of PFN's. 3.5.2 Algorithm needs definition. 3.5.4 The components are what I call "content". 3.5.5 Will most jobs really read from or write to catalogs? Might they have an overseer who takes this responsibility? Again, I think we should distinguish file replication from dataset replication. The former is one means to accomplish the latter. 3.8 Here and elsewhere, it would be useful to distinguish the file level from the dataset level. What about object replication, i.e. copying an object to a new LFN? Is it possible to find such an object using the original LFN and, for example, a mapping table in the new file? Datasets can be used to restrict the scope of the search to a practical scale. 4.1 I believe there is another level which is a more complete set of data output from reconstruction. Call it RECO. ESD is a summary of this. It may be too large to record for all events but might be interesting to store for a subset. 4.4 Presumably the goal of most queries on the dataset catalog is to identify a single dataset holding the data of interest. This dataset is then used as the input for analysis or production of another dataset. 4.5 The selected event ID's in conjuction with the original dataset form a new dataset. 4.6 May a snapshot of part of the conditions DB sensibly considered a dataset? I think so. UC#jobsplit A very important subset of job splitting and likely the dominant case involves splitting the datasets into su-datasets, processing these independently and concatenationg results, e.g. merging the output datasets from each job. This splitting and merging of datasets deserves dedicated use cases.