Preliminary Report of the Working Group on Attribute Set Architecture

Clifford Lynch
9/30/96

Due to time constraints it was not feasible to take this report through a
review cycle with the participants from the Working Group prior to the ZIG
meeting in Brussels; this report should be considered highly preliminary,
reflecting only the author's view of the meeting's outcome. It will be
reviewed with attendees, after which a final version reflecting that review
will be posted to the Z3950IW list.

Introduction

It has been clear for some time that there are a number of structural and
scaling problems with attribute sets in Z39.50. At the Gainesville, Florida
ZIG meeting I prepared a working paper, Defining and Maintaining Attribute
Sets for Use with the Z39.50 Protocol: A Discussion Paper (note: this paper
will be attached as an appendix to the final version of this document),
which summarized many of these problems. This paper was subsequently
updated on September 3, 1996 to reflect the discussion at the Florida ZIG
meeting and posted to the Z3950IW list. As a result of the Florida ZIG
meeting, I was asked to convene a working group to make recommendations on
how to resolve these problems. A call for participation was issued in
August 1996 and the working group met at the Library of Congress in
Washington, DC on September 9, 1996. This document summarizes the results
of that meeting and the group's proposed plan for further work.

Attendees at the Attribute Set Working Group Meeting

Ray Denenberg, Library of Congress
Larry Dixon, Library of Congress
Manette Lazear, MITRE
Clifford Lynch, UC
Nassib Nassar, CNIDR
Paul Ober, NIST
Mark Piekenbrock, CAS
Cecilia Preston
Lou Reich, NASA
Les Wibberley, CAS
Joe Zeeman

(Note: there were several other people who had planned to attend, but were
unable to make this meeting due to weather -- Hurricane Fran -- or other
scheduling problems).

Working Assumptions

The following assumptions guided our work:

1. Assume Z39.50 version 3; that is, attributes from multiple attribute
sets can be intermixed in a single query.

2. Except for attributes to support protocol functions such as EXPLAIN, the
ZIG should not be doing attribute set definition and maintenance; what is
needed is an architecture and a set of guidelines to permit domain experts
to define and manage attribute sets with only modest support from people
with Z39.50 expertise.

3. While the generality of the current attribute set mechanisms is very
useful, it's desirable to define some more constrained environment that
permits reasonable extensibility while still allowing attribute set
developers to build on a base of existing work. Further, it is useful to
define some sort of "basic" or "core" set of attributes that are available
to such attribute set developers.

4. In order to promote interoperability, it is desirable that attribute
sets be defined much more tightly than has been the case in the past.
Interactions among attributes from multiple sets need to have clearly
defined semantics as well if the mixing of attributes from multiple
attribute sets is to be really useful.

In addition to these assumptions, which were broadly supported by the ZIG,
the working group developed a number of other additional working
assumptions which are documented later in this report.

Attribute Set Classes and Properties within an Attribute Set Class

The working group developed the idea that we should define an attribute set
class. Attribute sets that were conformant to this class would have a
common structure  -- that is, they would contain the same attribute types
-- and would have well-defined semantics when intermixed. There would be
some idea of inheritance, of common behavior when attributes were omitted
(i.e. default values) or repeated, and a common approach to extensibility
for attribute sets that are conformant with this class. The working
assumption was that the class being defined should cover the vast majority
of current applications for which attribute sets have been defined, but
that it should not be extensible in terms of attribute types; a new type
would require the definition of a new class, and the mapping of existing
sets to that new class.  This attribute class need not cover every existing
application, however, and we recognized a tradeoff between complexity and
universality which should not be resolved in favor of absolute universal
coverage of existing applications.

Within this attribute class, there was a concept of a core attribute set
that defined basic values for certain attribute types, and also probably a
basic attribute set that provided a set of basic values for the USE
attribute type (see below). These two attribute sets -- the basic attribute
set and the core attribute set -- would provide a foundation for groups
that wanted to define their own attribute sets.

One issue that was discussed at some length was the representations of
values within query terms. The key issue here is to what extent to use
ASN.1 datatyping, as opposed to representing these as strings with explicit
qualifiers (attributes) which help to describe the format of the string. It
seems clear that using basic ASN.1 types such as integer are useful. Even
dates turn out to be problematic due to the complexity of the basic ASN.1
date/time structure and the alternatives that are in popular use. We need
to develop a clearer definition of when to use ASN.1 typing and when to use
attribute-based qualifiers.

Attribute Types within the Attribute Set Class

The group developed a rough definition of the attribute types that would be
contained within the Attribute Set Class under discussion as follows. Note
that there are several unresolved issues in this list; in these cases
design alternatives are discussed, and input is solicited.

1. Relation -- less than, greater than, equals, etc. Question: how to
reflect datatyping in these -- automatic type conversion, overloading,
operators for different datatypes, etc.?  Probably need operators like
lexically less than as well as the numeric operators.

2. Language -- the language of the value. Character-valued attribute based
on tables of language codes.

3. Character set -- the character set of the value.

4. Content Authority -- used to indicate that a value is being specified
relative to some authority list or source. Character-valued attribute.

5. Use -- the access point to which the value is being compared.

6. Scope of Use -- used to qualify USE type attributes; for example, place
names within abstracts.

Note: use and scope of use could be merged by permitting nested use
attributes, essentially describing a path. Or scope of use could be
eliminated completely be proliferating attributes such as "placename within
abstract" as separate use attributes. The benefit to splitting them is that
it is possible to constrain the scope of use attributes; not every use
attribute need be applicable within any other use attribute.

7. Positioning -- left and right truncation, needs to completely match the
element that the value is being compared to, etc. Note that there is some
overlap here with regular expression mapping, which is not considered part
of the positioning.

8. Expansion/interpretation -- this explains how to interpret the value
within the term logically. It might include indications that the value is
to be interpreted phonetically, that morphological variations are to be
matched, that stemming is needed, that the term is a regular expression,
etc. Case sensitivity or word sense disambiguation would also go here.

9. Format/structure  -- this indicates how the value is to be interpreted
syntactically. It would include the formats of names, the type of regular
expression being used, and similar qualifications.  Mainly used to quality
structured data that isn't being explicitly encoded in ASN.1 for the
values. Need to discuss whether this should be character valued.

10. Query Management and Execution -- this type of attribute is used for
two way communication between client and server. Attributes of this type
might include weights, counts, and stopword status.

Of these various attribute types, we believe that at least the following
need to be repeatable: positioning, expansion/interpretation,
format/structure. Scope of use needs to permit nesting (see above). We are
not certain whether content authority needs to be repeatable.

Other observations, conclusions and assumptions

We did not examine issues involved in attribute sets for sort and scan. It
seems highly desirable not to proliferate specialized attribute sets for
these, however.

There needs to be a canonical method developed for mapping between use
attributes and elements for attribute sets that want to exploit this
correspondence.

It's going to be desirable to quality operators with attributes. This may
require yet one more attribute type. We did not explore this in much
detail.

Range search problems should be solved at the operator level within the
type 1 query, not through attributes at the term level.  This is something
that needs to be considered as part of the version 4 work on the standard.

Attributes should be used for two-way communication between clients and
servers, and in particular to allow servers to provide more detail on how a
search was interpreted. We have proposed a new type of attributes for this.
ZSTARTS is using this approach.

BIB-1 used too few attribute types, which was a major source of ambiguity.
We have gone with a larger number of more narrowly defined attribute types.

A light-weight extensibility method is needed, at least for use attributes.
This might be most simply handled by allocating some part of the use
attribute name space to local fields.

Next Steps

Assuming that this general approach seems reasonable after review by the
ZIG, the attribute set group should meet about 2 more times before the end
of the year to complete its work. This should include:

Finalizing definition of the attribute types.
Defining semantics of repeated or omitted attributes for each type.
Defining a core set (probably of non-USE attributes)
Defining a basic set of USE attributes
Reality check by recoding STAS and perhaps other attribute sets.

The working group notes that there seems to be considerable interest in a
BIB-2 attribute set, which it believes should be developed outside of the
auspices of the ZIG. Assuming that this is done under NISO auspices, the
final output of the attribute set architecture working group should be
available before this committee commences work (presumably about
January/February 1997 given the NISO ballot cycle).