Defining and Maintaining Attribute Sets for Use with the Z39.50 Protocol: A Discussion Paper Clifford Lynch January 31, 1996; revised February 1, 1996 Introduction The proliferation of attribute sets for use with the Z39.50 protocol raises formidable interoperability and maintenance problems. This document reviews the history of attribute sets as a construct within the Z39.50 standard, identifies a number of the problems that have emerged, and offers various proposals and principles that might be used to address these problems. These are outlined for discussion at the Z39.50 implementors' group (ZIG) meeting in Gainesville, Florida in February 1996. The outcome of those discussions will provide guidance for the development of a revised statement of principles and design approaches, and (hopefully) clarify the possible need for revisions and clarifications in future versions of the Z39.50 standard, as well as helping to identify, what, if any supplementary documents will need to be developed to provide guidance on attribute set design and maintenance. The Evolution of Attribute Sets in Z39.50 Attribute sets define access points that are used in the construction of Z39.50 queries. In the early days of Z39.50 version 2 there was essentially one attribute set in wide use; this was called BIB-1 and was oriented towards the searching of databases of bibliographic records (though it has also been used with some success in conjunction with other, related, types of records, such as abstracting and indexing database records, and full text). It was assumed by the developers of the version 2 standard that as time went on additional attribute sets would be defined that met the needs of other communities dealing with non-bibliographic databases, such as databases of scientific or geo-spatial information. The design of Z39.50 version 2 was not informed by much experience with the development, use and maintenance of additional attribute sets, and numerous problems became evident as new attribute sets emerged. Among the most serious of these problems: 1. Duplication. In Z39.50 Version 2 a query is constructed using a single attribute set. Thus, it is necessary to duplicate common attributes needed in a range of applications with unique attribute set requirements (such as the relational attribute of equality) in each specialized attribute set, since one attribute set could not explicitly "extend" or be used as a supplement to another as a protocol construct. One attribute set could not be viewed as extending another because the protocol did not address interrelationships among attribute sets; one attribute set could not supplement another because of the limitation of the type 1 query, since attributes had to be drawn from a single attribute set for a given query. In practice, the ability of one attribute set to extend another was achieved by mapping the tag space of a base attribute set (normally BIB-1) directly into the tag space of another attribute set (as was done with STAS, the scientific and technical attribute set, for example); servers handled the semantic mappings by more or less ignoring the attribute set ID and assuming that if an attribute had the same tag (or perhaps the same tag modulo some offset) in two attribute sets it had the same semantics. This approach was undesirable for a number of reasons. Ongoing management of the tag space for the various attribute sets without duplication was a problem. The structure (classes of attributes) of the base set were imposed on the extending attribute set -- or perhaps more accurately, should have been imposed, since the meaning of a given class of attributes from one attribute set translated into the context of a different attribute class structure in an extending set were typically not explicitly defined. Definitions of semantics for attributes in the extended attribute set were also compromised because the rules for inheritance of semantics were only implied, not explicit, and two independent groups (the maintainers of the base and the extended attribute sets) could both alter the meaning of attributes independently and asynchronously: as a case in point, if the meaning of a BIB-1 attribute changed or was clarified, would this automatically propagate into all attribute sets that embedded BIB-1? As multiple attribute sets developed, the notions of base and extended attribute sets became more problematic; one might want to use the logical union of more than multiple attribute sets, including several which extended a common base set (perhaps with different embeddings of the base set into the extended attribute set tag spaces). Attribute sets moved away from a simple hierarchy of a base set and an extended set on a base set to a series of non-orthogonal extensions to a common base set. This led to a situation where it seemed as if it would be desirable for every attribute set to embed all (or at least many) other attribute sets, in effect returning the protocol to the use of a single global "flat" attribute set rather than a set of independently developed and maintained attribute sets, with all of the centralized maintenance problems that such a single (enormous) logical attribute set implies. 2. Interoperability problems due to attribute set proliferation. Even if an extended attribute set essentially embedded a well-known attribute set as a base attribute set, a Z39.50 client or server had no way of knowing this without out-of-band implementor agreements. The net effect was to proliferate partially redundant attribute sets without any automatic way to permit clients and servers to recognize that they contained well-known, familiar subsets of attributes that were part of an embedded base set. 3. Ambiguities in the semantics of attributes. The standard is unclear about the assumptions that the server should make when attributes are repeated or omitted. It is not even clear whether these semantics are specific to a given attribute set or whether they are more general properties of the type 1 query's semantics; put another way, it is not clear from the standard exactly what properties have to be specified in the definition of a new attribute set. Version 2 (and to some extent version 3) also suffer from problems related to ambiguities in the datatyping, representation and normalization of values for attributes. 4. Problems specific to the BIB-1 attribute set. The semantics of BIB-1, and the proper behavior of a server that supports BIB-1, are still not defined in rigorous detail in current versions of the standard; indeed, they are not defined at all in the version 2 standard. This led quickly to a series of glaring and sometimes appalling interoperability problems even among bibliographic clients and servers. It remains ambiguous as to whether the semantics of BIB-1 are really part of the Z39.50 standard, or whether they are currently informal implementor agreements which might be appropriately codified in some separate formal standard in future. In addition to the lack of good definition documents, BIB-1 suffered from the lack of a good scope statement (and hence gave rise to many debates about what attributes were appropriate to include or exclude for BIB-1). BIB-1 also suffered from a lack of consultation with the broad community concerned with bibliographic records, their interchange formats, and their retrieval; it was developed by the Z39.50 implementors' group, which has expertise in protocol development but much less comprehensive expertise in matters related to bibliographic records, interchange formats and retrieval. Version 3 of Z39.50 attempted to address some of these problems in light of growing experience with additional attribute sets. A key change in version 3 of the protocol is that a single query can contain attributes selected from multiple attribute sets; this avoids (or at least reduces) the need for one attribute set to embed another. Z39.50 version 3 also includes the first attempt at defining an EXPLAIN facility for the interchange of certain Z39.50-related metadata; this could, at least in theory, be used to convey information about the relationships among multiple attribute sets, which had to be established "out of band" (that is, outside the scope of protocol mechanisms) in version 2. There are still major problems with attribute sets in Version 3. BIB-1, while included in the Z39.50 version 3 specification, is still not as well defined as it should be, as discussed above. The introduction of the additional flexibility to intermix attributes from multiple attribute sets in version 3 actually introduced substantial new ambiguities into the standard. There is no useful guidance about the semantics of commingling attributes from multiple attribute sets in a single Z39.50 query; this problem is particularly serious when the attribute sets being intermixed are not "conformant" (i.e. the classes of attributes are not the same from one attribute set to another, as opposed to simply multiple attribute sets with identical attribute classes but, for example, different use attributes). Duplicated and omitted attribute semantics remain murky; the question of whether these are attribute-set specific, specific to the combination of attributes used in a given query, or part of type 1 query semantics consistently have still not been well addressed, and has now become a more serious issue, since it is effectively impossible for the developer of a single attribute set to define interactions between attributes in that set and attributes in other sets in a comprehensive fashion. Finally, it is unclear whether the attribute-value structure that is currently used in the type 1 query is sufficiently flexible to support the full range of requirements for z39.50 application environments; this question calls for critical re-examination in light of the work in areas such as advanced natural language and IR queries.. While there is growing experience with attribute sets beyond BIB-1, it remains unclear whether future applications will call for a limited number of attribute sets or a massive proliferation of attribute sets. There is one emerging school of thought that says that attribute sets should be tied to database data elements for various applications environments; within this viewpoint an attribute set (or at least a set of use attributes) becomes a natural dual to an SGML DTD, for example. Means of specifying these relationships have not been examined, and in some sense go beyond the current scope of Z39.50 but may be critical to the development of future Z39.50 applications. All of these problems are making the version 3 capability to use of multiple attribute sets within a single query very difficult to exploit. There is, I believe, a general sense within the Z39.50 implementor community that the current ad-hoc approach to attribute set definition and maintenance is rapidly approaching its limits and that some fundamental architectural modeling is needed to establish principles for the future development and management of attribute sets. Proposed Assumptions and Design Principles for Attribute Sets In addressing these attribute set problems, I propose the following principles and assumptions as a point of departure: 1. Work on attribute set architecture should assume the capabilities of version 3. Backwards compatibility and interoperability with version 2 and its limitation of a single attribute set per query should not confine our thinking. It will take a long time for new attribute sets to become widely established, particularly if they need to be defined (or redefined) and documented according to guidelines we set out; by the time this happens, version 3 implementations will be widely available. It is likely that as we re-think attribute set definition, virtually all existing attribute sets (including, for example, STAS as well as BIB-1) will require revisions. There is likely to be a protracted and complex conversion that will require consideration of a number of limited-term interoperability and backwards compatibility issues, but backwards compatibility should not drive our architectural model. 2. Other than special purpose attribute sets needed for the operation of the protocol (such as attribute sets used with EXPLAIN databases) it should not be the role of the Z39.50 implementors' group to define specific attribute sets; this should be done by groups with content expertise the areas in question, subject to broad architectural guidance from the ZIG about how to structure, register, manage and maintain an attribute set. The ZIG should not serve as a maintenance agency for attribute sets that do not play a structural role in the protocol. Further, it is desirable to create a situation in which various groups can independently develop and maintain attribute sets, with only minimal coordination (such as the assignment of attribute set ID) by the overall Z39.50 maintenance agency. One consequence of this principle is that we should plan to phase out the BIB-1 attribute set and replace it with a new attribute set for bibliographic applications developed by the bibliographic community under non-ZIG (perhaps NISO) auspices, or at least to transition the management of the current BIB-1 attribute set. My personal view is that BIB-1 is so profoundly compromised and so widely implemented that it would be easier to start the development of a new successor attribute set that is clearly different from BIB-1 (though much of the work and experience on BIB-1 would be important input to the development of such a successor attribute set). 3. We should recognize that as intellectual constructs attribute sets may have value beyond the context of Z39.50, and that they may (but do not necessarily) have complex and close relationships to data element dictionaries, record interchange formats, document structure definitions, metadata interchange structures and other areas. This would appropriately be the domain of the attribute set definition groups to establish, although general guidelines in this area that are established by the ZIG (perhaps working in conjunction with other groups) may be valuable. Conversely, ongoing work in defining data or metadata element sets in other contexts many naturally give rise to Z39.50 attribute sets suitable for use in conjunction with collections of data or metadata for that applications domain; facilitating such reuse should be a goal in developing an attribute set architecture. 4. We need to establish and very carefully document those properties of attribute sets and their semantics within type 1 queries that are properly part of the protocol as opposed to those that are within the scope of the attribute set itself as part of our guidance to attribute set developers. Within this effort we need to consider the issues that arise in intermixing of attributes from multiple sets within a query; the combinatorial explosion suggests to me that rules for intermixing attributes from multiple attribute sets, and perhaps for the semantics of duplicate and omitted attributes need to be part of the type 1 query definition. In cases where we don't know what to specify, we should be clear about what is undefined and recommend server actions that promote interoperability and predictable, comprehensible behavior across servers. As a complement to this effort, we should also specify explicitly what needs to be defined as part of an attribute set definition. 5. We should consider carefully whether the notion of an attribute set class is a useful one. An attribute set class is a template for attribute sets where each attribute set in the class has the same attribute set types and these types have common general semantics, although the specific attribute values will vary from one set to the next within the class, and attribute value semantics would be defined as part of the individual attribute set definitions. Within a class, we should be able to define meaningful inheritance rules and relationships among attribute sets, and also rules for determining the semantics of attributes from multiple sets within a single query. I believe that this will be very difficult to extend from one class of attribute sets to the next (although concepts of subclasses may be of some use here as well). It seems to me that we should consider the experience with STAS, for example, in developing our thinking on attribute set classes, and not automatically assume that BIB-1 is the correct template for defining a primary or default attribute set class. 6. We should also consider whether there are some specific attribute types (in particular use attributes, relational attributes, and perhaps normalization attributes) which might be architectural primitives that transcend specific attribute set classes and which might deserve definition and discussion within the standard itself. We can partition the world of attributes both by attribute type and by source attribute set. 7. As attribute set classes (and perhaps distinguished common attribute types) are established, one role of registry of a new attribute set (and thus of the overall Z39.50 maintenance agency) may be to identify it as member of an attribute set class, or to recognize that it is using distinguished common attribute types. Registry may also involve the definition of inheritance or mapping relationships, and this may have implications for the role of the maintenance agency for the standard (and/or for the attribute set in question). 8. We need to consider whether we want to develop an attribute set class or classes that make an explicit linkage between data elements in records and attributes in attribute sets; this is a potentially-commonplace application scenario. It may be appropriate to add support elsewhere within Z39.50, such as within the EXPLAIN database, to facilitate use of such attribute sets. 9. We need to consider whether the type 1 query, in conjunction with appropriate attribute sets (or attribute set classes) has sufficient semantic and syntactic flexibility to represent queries that involve extensible abstract datatype systems and extensible query operator systems (such as Postquel or SQL3). An inability to support the mapping of such queries (within the broad constraints that limit Z39.50 to retrieval operations) should be regarded as a warning indicator about limitations of the Z39.50 protocol to support the next generation of applications. It should be recognized that this type of extensible query language does raise some serious questions about semantic interoperability (since a client needs to know the semantics of the ADTs and operators for a given target database) which Z39.50 has tried to circumvent through the use of well-known attribute sets (the solution to this in the post-relational database world is standardized class libraries); while Z39.50 queries that are mappings from such post-relational query languages are going to have some interoperability limitations. the ability to perform the mapping does serve as a good benchmark for flexibility in our query definitions and attribute set architecture. 10. A clear definition of attribute set semantics (or attribute set class semantics) affords us an opportunity to make explicit the distribution of function between clients and servers in areas such as the normalization of values that accompany USE attributes. There are no "right" answers here, but one of the problems with the standard as it exists today is that it does not permit a client to be clear about the functions that it is taking responsibility for and those that it is delegating to the server within the processing of a Z39.50 query. We should use this opportunity to enrich and clarify the protocol functions that allow the identification of situations when responsibility for activities such as value normalization are being delegated to the server by the client; this should result in more predictable behavior as clients and servers interact, and improved interoperability as perceived by users of Z39.50 implementations.