1) Behavior change to TKey<T>:find
Problem to be addressed: Much user code uses the find method without thinking about whether there could be more than one of the objects requested, and relies on finding the "correct" one first, which is certainly not guaranteed by the edm.
Proposal A: Change the bahavior of find so that if there is > 1 match, it throws an Exception: TooManyMatches
Proposal B: Eliminate the find method and require everyone to
use findAll, with appropriate selection implemented.
Pros for A
Will reveal incorrect usage of find (although only at runtime) and give a message pointing users to the problem |
Pros for B
Incorrect usage of find will cease |
Cons for B
So will correct use |
Agreed: Proposal A is a better way to address the problem, and should be incorporated in the update.
2) Change to signature of findAll.
Problem being addressed: findAll returns a collection, which in principle can be large. It is not efficient and not a good model for users to follow.
Proposal: Change findAll so that instead of returning a collection, it accepts a handle to a collection made by the user, and returns the size of the collection. The original proposal was to hard code the input collection as Vector; a suggested amendment would make the input collection templated. It was not clear from the discussion whether the templated version could handle Sets, but at least it would accommodate List and Deque.
Pros: The implementation would be more efficient, and a better model for users. As Walter pointed out, it would also allow a convenient way of filling a single collection from successive findAll's, something which would only be rather awkwardly implemented with the current signature.
Con: It requires user code to change, wherever findAll is used -- but it was pointed out that findAll is used in only a few (~3?) places currently anyway (although this may increase after the implementation of the change to find behavior, as users who should have been using findAll move to it). Therefore, change (2) should be implemented, if at all, at the same time as (1) so the new findAll users go to the new interface!
The meeting did not reach a consensus; some thought that the actual efficiency gain in the current program will be negligible and therefore not worth any interface change, not even a small one. The counter arguments were the extra functionality of new collections and more flexibility to fill them, that changing to efficient code is worth doing even if only as a model of the right procedure, and that the interface change is indeed a very small amount of pain given the few instances of use.
This is so small on the scale of proposed changes that I would suggest we leave the decision to the edm implementers.
3) Splitting AbsChunk.
Problem being addressed: When we make subset data tiers, we will want to keep the provenance information for chunks whose data piece has been dropped for reasons of space, in order to allow us to reconstruct the algorithmic history of how the summary chunks were formed. This is an essential requirement for DSTs and microdsts.
Proposal: Split AbsChunk into two separately persistable pieces: one recording the provenance of the chunk, and one containing the actual chunk data.
Pros: This provides a clean and immediately obvious way of making subset data tiers with complete histories but incomplete data.
Cons: It makes existing data sets unreadable by new edm,
and requires chunk insertion to change for all chunks.
(There
was considerable discussion of how to move into the new edm for the July
Monte Carlo production release, but continue reco development using the
old edm and make the reco switch for the September production release.
If this could be accomplished, the only data which would have to be dropped
completely are the MCC3 and earlier data, which have known geometry problems
anyway and will certainly not need to be kept for physics analysis of real
data. We could have available a substantial new Monte Carlo
sample by September to ensure that reco algorithm development (and physics
group studies) would not be disabled by the switchover, except for
the time required to get a clean reco build after the interface changes.
This time was estimated to be 2-4 weeks depending on the level of pessimism
of the estimator. If we cannot accomodate that
scenario, the switchover would cost more.)
(Also
note that the Event/Persistent Event split, proposed in the schema evolution
discussion, has the same effect.)
(It was
also pointed out that changing chunk insertion might have some additional
good side effects in the form of getting developers to revisit whether
they are correctly filling their provenance information anyway [reviewers
suspect that they frequently are not].)
4) Internal modifications
Problem being addressed: efficiency of operation of the event mapping and searching for chunks
Proposal: reorganize the maps in the Event
This reorganization can be more sweeping and have significantly greater efficiency benefits if it is not constrained to preserve backwards compatibility with old data. So, how much gets done here depends on the adoption of prior proposals. The users will not see interface changes directly from this proposal as it affects event internals. A point made here was that if the Event/Persistent event split is adopted, any future reorganization of these internals does NOT affect backwards compatibility with old data, which is one of the motivations for proposing that split.
5) Event merging
No changes proposed. Event merging can use existing edm features. However, designers of these packages should be urged in the edm usage talk to pay close attention to designing appropriate selectors for chunks that can result from an event merge.
6) deleteChunk
It was agreed that the method is needed; some (additional?) signatures were proposed but I did not note the details.
7) Completing the remaining placeholders in the Event.
EnvID was discussed last week in a prior meeting, and the design agreed to.
GenMachine is the last remaining placeholder, and we did not have time for much discussion. This should be revisited and a final design agreed before the edm update.
Additional notes from Jim Kowalkowski:
genMachine - Marc and I were thinking that ethernet address would be
good,
Harry indicated that this is difficult to work with
and that node name could
be good enough. Repeating this information
in each chunk may be
wasteful (disk use) and using some sort of indirection
might be needed.
Could the information in "uname -a" be used?
Herb wants ChunkID in new "AbsChunk" class in addition to having
it in the History class if the split happens.
It was suggested that History object can be cached in the chunk and
that the
current interface to AbsChunk can remain unchanged.
release versions in chunks - useful but can be wasteful in terms of
disk space.
Same thing (indirection) could be used here as in
the genMachine case.
What should be recorded here? The preco/release
tag? The library
version tag? A reference to another object
owned by the event that
has this information in it?
Observation and questions -
- History object format is fixed, not abstract,
no user derived classes allowed
- Can a history object be inserted into the event
with data?
- Would an event method "findAllSorted(...)" be
useful in addition to findAll(...)
- Would method "junkAllUnreferencedHistory()" be
useful
- Can framework job be constructed that reads event
objects off disk, and converts
them to the new "PersistentEvent"
objects? This would involve messing
with each chunk
- With Option A (Event -> PersistentEvent), could
the event be built
correctly during construction?
In other words, get rid of deleteChunk()
and put list of chunk names,
etc in constructor that need to be dropped.
- split of data/history may improve search performance?
Probably for
that look only at things
stored in history, since inheritance is not present.
- Delayed conversion of history by D0om most likely
not important
because the history object
would be small.