The DØ Computing Systems

Minutes of the Updates to edm Meeting

19 May 2000

Present: Marc Paterno, Amber Boehnlein, Jim Kowalkowski, Wyatt Merritt (notes), Herb Greenlee, Vicky White, Qizhong Li, Harry Melanson, David Adams, Walter Brown, Atlas kibitzer

1) Behavior change to TKey<T>:find

Problem to be addressed: Much user code uses the find method without thinking about whether there could be more than one of the objects requested, and relies on finding the "correct" one first, which is certainly not guaranteed by the edm.

Proposal A: Change the bahavior of find so that if there is > 1 match, it throws an Exception: TooManyMatches

Proposal B: Eliminate the find method and require everyone to use findAll, with appropriate selection implemented.

Pros for A
Will reveal incorrect usage of find (although only at runtime) and give a message pointing users to the problem
Pros for B
Incorrect usage of find will cease

Cons for B
So will correct use

Agreed: Proposal A is a better way to address the problem, and should be incorporated in the update.

2) Change to signature of findAll.

Problem being addressed: findAll returns a collection, which in principle can be large. It is not efficient and not a good model for users to follow.

Proposal: Change findAll so that instead of returning a collection, it accepts a handle to a collection made by the user, and returns the size of the collection. The original proposal was to hard code the input collection as Vector; a suggested amendment would make the input collection templated. It was not clear from the discussion whether the templated version could handle Sets, but at least it would accommodate List and Deque.

Pros: The implementation would be more efficient, and a better model for users. As Walter pointed out, it would also allow a convenient way of filling a single collection from successive findAll's, something which would only be rather awkwardly implemented with the current signature.

Con: It requires user code to change, wherever findAll is used -- but it was pointed out that findAll is used in only a few (~3?) places currently anyway (although this may increase after the implementation of the change to find behavior, as users who should have been using findAll move to it). Therefore, change (2) should be implemented, if at all, at the same time as (1) so the new findAll users go to the new interface!

The meeting did not reach a consensus; some thought that the actual efficiency gain in the current program will be negligible and therefore not worth any interface change, not even a small one. The counter arguments were the extra functionality of new collections and more flexibility to fill them, that changing to efficient code is worth doing even if only as a model of the right procedure, and that the interface change is indeed a very small amount of pain given the few instances of use.

This is so small on the scale of proposed changes that I would suggest we leave the decision to the edm implementers.

3) Splitting AbsChunk.

Problem being addressed: When we make subset data tiers, we will want to keep the provenance information for chunks whose data piece has been dropped for reasons of space, in order to allow us to reconstruct the algorithmic history of how the summary chunks were formed. This is an essential requirement for DSTs and microdsts.

Proposal: Split AbsChunk into two separately persistable pieces: one recording the provenance of the chunk, and one containing the actual chunk data.

Pros: This provides a clean and immediately obvious way of making subset data tiers with complete histories but incomplete data.

Cons: It makes existing data sets unreadable by new edm, and requires chunk insertion to change for all chunks.
           (There was considerable discussion of how to move into the new edm for the July Monte Carlo production release, but continue reco development using the old edm and make the reco switch for the September production release. If this could be accomplished, the only data which would have to be dropped completely are the MCC3 and earlier data, which have known geometry problems anyway and will certainly not need to be kept for physics analysis of real data.   We could have available a substantial new Monte Carlo sample by September to ensure that reco algorithm development (and physics group studies) would not be disabled by the switchover, except for the time required to get a clean reco build after the interface changes.   This time was estimated to be 2-4 weeks depending on the level of pessimism of the estimator.   If we cannot accomodate that
scenario, the switchover would cost more.)
           (Also note that the Event/Persistent Event split, proposed in the schema evolution discussion, has the same effect.)
           (It was also pointed out that changing chunk insertion might have some additional good side effects in the form of getting developers to revisit whether they are correctly filling their provenance information anyway [reviewers suspect that they frequently are not].)

4) Internal modifications

Problem being addressed: efficiency of operation of the event mapping and searching for chunks

Proposal: reorganize the maps in the Event

This reorganization can be more sweeping and have significantly greater efficiency benefits if it is not constrained to preserve backwards compatibility with old data. So, how much gets done here depends on the adoption of prior proposals. The users will not see interface changes directly from this proposal as it affects event internals. A point made here was that if the Event/Persistent event split is adopted, any future reorganization of these internals does NOT affect backwards compatibility with old data, which is one of the motivations for proposing that split.

5) Event merging

No changes proposed. Event merging can use existing edm features. However, designers of these packages should be urged in the edm usage talk to pay close attention to designing appropriate selectors for chunks that can result from an event merge.

6) deleteChunk

It was agreed that the method is needed; some (additional?) signatures were proposed but I did not note the details.

7) Completing the remaining placeholders in the Event.

EnvID was discussed last week in a prior meeting, and the design agreed to.

GenMachine is the last remaining placeholder, and we did not have time for much discussion. This should be revisited and a final design agreed before the edm update.

Additional notes from Jim Kowalkowski:

genMachine - Marc and I were thinking that ethernet address would be good,
    Harry indicated that this is difficult to work with and that node name could
    be good enough. Repeating this information in each chunk may be
    wasteful (disk use) and using some sort of indirection might be needed.
    Could the information in "uname -a" be used?

Herb wants ChunkID in new "AbsChunk" class in addition to having
it in the History class if the split happens.

It was suggested that History object can be cached in the chunk and that the
current interface to AbsChunk can remain unchanged.

release versions in chunks - useful but can be wasteful in terms of disk space.
    Same thing (indirection) could be used here as in the genMachine case.
    What should be recorded here? The preco/release tag? The library
    version tag? A reference to another object owned by the event that
    has this information in it?

Observation and questions -
    - History object format is fixed, not abstract, no user derived classes allowed
    - Can a history object be inserted into the event with data?
    - Would an event method "findAllSorted(...)" be useful in addition to findAll(...)
    - Would method "junkAllUnreferencedHistory()" be useful
    - Can framework job be constructed that reads event objects off disk, and converts
        them to the new "PersistentEvent" objects? This would involve messing
        with each chunk
    - With Option A (Event -> PersistentEvent), could the event be built
        correctly during construction? In other words, get rid of deleteChunk()
        and put list of chunk names, etc in constructor that need to be dropped.
    - split of data/history may improve search performance? Probably for
        that look only at things stored in history, since inheritance is not present.
    - Delayed conversion of history by D0om most likely not important
        because the history object would be small.

Pros for A Will reveal incorrect usage of find (although only at runtime) and give a message pointing users to the problem	Pros for B Incorrect usage of find will cease
	Cons for B So will correct use