HEPCAL comments
David Adams
August 4, 2003

Comments on HEPCAL RTAG Report

General:
Acronyms used before definition. This is confusing. It would
help to move the glossary to the last pages.

The document assumes that a replica of a dataset is restricted
to a single SE. This simplifies distributed processing but is
probably too restrictive. It will be useful to distribute large
datasets and their processing over multiple sites.
Once this threshold is crossed, the number of representations
of a dataset increases dramatically because we can take
variable-sized pieces of the dataset from each of site.

There is not enough emphasis on the logical file which is
arguably the atomic unit of grid processing. I agree strongly
that the dataset is the appropriate interface for users.
However efficient job scheduling requires matching CPU and
data locations and the latter is more generally expressed in
terms of the location of file replicas.

1
Glossary is in chap 9 (not 7)

2.6
WMS not defined.

2.6.2
I like the initial definition of interactive. The connection to
communication channels needs clarification. Presumably this is
in HEPCAL II.

3.2
VO not defined.

3.3
WN not defined.

3.5
Need section for files including PFN vs LFN

My thoughts on datasets may be found at
http://www.usatlas.bnl.gov/~dladams/dataset
and in particuler the link "Datasets for the GRID".

Another useful type of dataset is a well-defined part
of another dataset such as an event selection or content
restriction.

3.5.1
Dataset restricted to files?

Should be possible for LDN ==> list of LFN's (not PFN's).

In the discussion of a dataset having references to data in
other datasets, it might be useful to introduce the notion of
a "complete dataset" for which all components can be resolved
internally.

RSL not defined.

For the case of dataset composed of files, we should distinguish
1. Virtual dataset with multiple reps (each a set of LFN's; rep
   can also be non-file, e.g. DB or virtual)
2. Logical dataset with mapping to a set of LFN's.
3. Physical dataset with mapping to a set of PFN's.

3.5.2
Algorithm needs definition.

3.5.4
The components are what I call "content".

3.5.5
Will most jobs really read from or write to catalogs? Might
they have an overseer who takes this responsibility?

Again, I think we should distinguish file replication from
dataset replication. The former is one means to accomplish the
latter.

3.8
Here and elsewhere, it would be useful to distinguish the file
level from the dataset level.

What about object replication, i.e. copying an object to a new
LFN? Is it possible to find such an object using the original
LFN and, for example, a mapping table in the new file? Datasets
can be used to restrict the scope of the search to a practical
scale.

4.1
I believe there is another level which is a more complete set of
data output from reconstruction. Call it RECO. ESD is a summary
of this. It may be too large to record for all events but might
be interesting to store for a subset.

4.4
Presumably the goal of most queries on the dataset catalog is
to identify a single dataset holding the data of interest. This
dataset is then used as the input for analysis or production of
another dataset.

4.5
The selected event ID's in conjuction with the original dataset
form a new dataset.

4.6
May a snapshot of part of the conditions DB sensibly considered
a dataset? I think so.

UC#jobsplit
A very important subset of job splitting and likely the dominant
case involves splitting the datasets into su-datasets,
processing these independently and concatenationg results, e.g.
merging the output datasets from each job. This splitting and
merging of datasets deserves dedicated use cases.