Preliminary Report of the Working Group on Attribute Set Architecture Clifford Lynch 9/30/96 Due to time constraints it was not feasible to take this report through a review cycle with the participants from the Working Group prior to the ZIG meeting in Brussels; this report should be considered highly preliminary, reflecting only the author's view of the meeting's outcome. It will be reviewed with attendees, after which a final version reflecting that review will be posted to the Z3950IW list. Introduction It has been clear for some time that there are a number of structural and scaling problems with attribute sets in Z39.50. At the Gainesville, Florida ZIG meeting I prepared a working paper, Defining and Maintaining Attribute Sets for Use with the Z39.50 Protocol: A Discussion Paper (note: this paper will be attached as an appendix to the final version of this document), which summarized many of these problems. This paper was subsequently updated on September 3, 1996 to reflect the discussion at the Florida ZIG meeting and posted to the Z3950IW list. As a result of the Florida ZIG meeting, I was asked to convene a working group to make recommendations on how to resolve these problems. A call for participation was issued in August 1996 and the working group met at the Library of Congress in Washington, DC on September 9, 1996. This document summarizes the results of that meeting and the group's proposed plan for further work. Attendees at the Attribute Set Working Group Meeting Ray Denenberg, Library of Congress Larry Dixon, Library of Congress Manette Lazear, MITRE Clifford Lynch, UC Nassib Nassar, CNIDR Paul Ober, NIST Mark Piekenbrock, CAS Cecilia Preston Lou Reich, NASA Les Wibberley, CAS Joe Zeeman (Note: there were several other people who had planned to attend, but were unable to make this meeting due to weather -- Hurricane Fran -- or other scheduling problems). Working Assumptions The following assumptions guided our work: 1. Assume Z39.50 version 3; that is, attributes from multiple attribute sets can be intermixed in a single query. 2. Except for attributes to support protocol functions such as EXPLAIN, the ZIG should not be doing attribute set definition and maintenance; what is needed is an architecture and a set of guidelines to permit domain experts to define and manage attribute sets with only modest support from people with Z39.50 expertise. 3. While the generality of the current attribute set mechanisms is very useful, it's desirable to define some more constrained environment that permits reasonable extensibility while still allowing attribute set developers to build on a base of existing work. Further, it is useful to define some sort of "basic" or "core" set of attributes that are available to such attribute set developers. 4. In order to promote interoperability, it is desirable that attribute sets be defined much more tightly than has been the case in the past. Interactions among attributes from multiple sets need to have clearly defined semantics as well if the mixing of attributes from multiple attribute sets is to be really useful. In addition to these assumptions, which were broadly supported by the ZIG, the working group developed a number of other additional working assumptions which are documented later in this report. Attribute Set Classes and Properties within an Attribute Set Class The working group developed the idea that we should define an attribute set class. Attribute sets that were conformant to this class would have a common structure -- that is, they would contain the same attribute types -- and would have well-defined semantics when intermixed. There would be some idea of inheritance, of common behavior when attributes were omitted (i.e. default values) or repeated, and a common approach to extensibility for attribute sets that are conformant with this class. The working assumption was that the class being defined should cover the vast majority of current applications for which attribute sets have been defined, but that it should not be extensible in terms of attribute types; a new type would require the definition of a new class, and the mapping of existing sets to that new class. This attribute class need not cover every existing application, however, and we recognized a tradeoff between complexity and universality which should not be resolved in favor of absolute universal coverage of existing applications. Within this attribute class, there was a concept of a core attribute set that defined basic values for certain attribute types, and also probably a basic attribute set that provided a set of basic values for the USE attribute type (see below). These two attribute sets -- the basic attribute set and the core attribute set -- would provide a foundation for groups that wanted to define their own attribute sets. One issue that was discussed at some length was the representations of values within query terms. The key issue here is to what extent to use ASN.1 datatyping, as opposed to representing these as strings with explicit qualifiers (attributes) which help to describe the format of the string. It seems clear that using basic ASN.1 types such as integer are useful. Even dates turn out to be problematic due to the complexity of the basic ASN.1 date/time structure and the alternatives that are in popular use. We need to develop a clearer definition of when to use ASN.1 typing and when to use attribute-based qualifiers. Attribute Types within the Attribute Set Class The group developed a rough definition of the attribute types that would be contained within the Attribute Set Class under discussion as follows. Note that there are several unresolved issues in this list; in these cases design alternatives are discussed, and input is solicited. 1. Relation -- less than, greater than, equals, etc. Question: how to reflect datatyping in these -- automatic type conversion, overloading, operators for different datatypes, etc.? Probably need operators like lexically less than as well as the numeric operators. 2. Language -- the language of the value. Character-valued attribute based on tables of language codes. 3. Character set -- the character set of the value. 4. Content Authority -- used to indicate that a value is being specified relative to some authority list or source. Character-valued attribute. 5. Use -- the access point to which the value is being compared. 6. Scope of Use -- used to qualify USE type attributes; for example, place names within abstracts. Note: use and scope of use could be merged by permitting nested use attributes, essentially describing a path. Or scope of use could be eliminated completely be proliferating attributes such as "placename within abstract" as separate use attributes. The benefit to splitting them is that it is possible to constrain the scope of use attributes; not every use attribute need be applicable within any other use attribute. 7. Positioning -- left and right truncation, needs to completely match the element that the value is being compared to, etc. Note that there is some overlap here with regular expression mapping, which is not considered part of the positioning. 8. Expansion/interpretation -- this explains how to interpret the value within the term logically. It might include indications that the value is to be interpreted phonetically, that morphological variations are to be matched, that stemming is needed, that the term is a regular expression, etc. Case sensitivity or word sense disambiguation would also go here. 9. Format/structure -- this indicates how the value is to be interpreted syntactically. It would include the formats of names, the type of regular expression being used, and similar qualifications. Mainly used to quality structured data that isn't being explicitly encoded in ASN.1 for the values. Need to discuss whether this should be character valued. 10. Query Management and Execution -- this type of attribute is used for two way communication between client and server. Attributes of this type might include weights, counts, and stopword status. Of these various attribute types, we believe that at least the following need to be repeatable: positioning, expansion/interpretation, format/structure. Scope of use needs to permit nesting (see above). We are not certain whether content authority needs to be repeatable. Other observations, conclusions and assumptions We did not examine issues involved in attribute sets for sort and scan. It seems highly desirable not to proliferate specialized attribute sets for these, however. There needs to be a canonical method developed for mapping between use attributes and elements for attribute sets that want to exploit this correspondence. It's going to be desirable to quality operators with attributes. This may require yet one more attribute type. We did not explore this in much detail. Range search problems should be solved at the operator level within the type 1 query, not through attributes at the term level. This is something that needs to be considered as part of the version 4 work on the standard. Attributes should be used for two-way communication between clients and servers, and in particular to allow servers to provide more detail on how a search was interpreted. We have proposed a new type of attributes for this. ZSTARTS is using this approach. BIB-1 used too few attribute types, which was a major source of ambiguity. We have gone with a larger number of more narrowly defined attribute types. A light-weight extensibility method is needed, at least for use attributes. This might be most simply handled by allocating some part of the use attribute name space to local fields. Next Steps Assuming that this general approach seems reasonable after review by the ZIG, the attribute set group should meet about 2 more times before the end of the year to complete its work. This should include: Finalizing definition of the attribute types. Defining semantics of repeated or omitted attributes for each type. Defining a core set (probably of non-USE attributes) Defining a basic set of USE attributes Reality check by recoding STAS and perhaps other attribute sets. The working group notes that there seems to be considerable interest in a BIB-2 attribute set, which it believes should be developed outside of the auspices of the ZIG. Assuming that this is done under NISO auspices, the final output of the attribute set architecture working group should be available before this committee commences work (presumably about January/February 1997 given the NISO ballot cycle).