ZIG meeting - day three

Friday, January 23, 1998
Orlando, Florida

Shuttle launch! We started late this morning because most of us were out late last night watching the shuttle launch. It was marvelous! As if the sun was coming up at 10:00 at night. A wave of light and low rumbling passed over us, followed by darkness punctuated by a slowly moving elongated fireball. The booster rockets dropped like tinsel from a Christmas tree, followed by the passing red lights of the tracer plane assigned to track them for recovery. Circles seemed to appear in the layers of cloud cover as the shuttle burned through them. The shuttle literally receded in the night sky in the shape of a star, twinkling until it disappeared. It was awesome.

Future ZIG meetings (Hinnebusch)

October 19-21, Madrid, Spain. (Thank you, Mark Pearce and Makx Dekkers/EFILA.)

14. Proposed De-Dup Service (Denenberg)

Denenberg: There is no urgency for this service, but need has been expressed for years. The goal of this discussion is to determine whether the proposal is moving in the right direction and to specify requirements. The current proposal is a rough draft that definitely needs to be revised.

C Lynch commented that de-dup was a very useful and powerful service, both general and fairly unique. The draft proposal goes farther than de-duping in the sense that it also provides a way of clustering a result set. He strongly supports the inclusion of something like this in the standard because it opens up a new set of applications. It is useful to look at a service like this at a high level before getting bogged down in details. How much interoperability will we have with a client contacting multiple servers and trying to get some control over this? There will be much variation from server to server. Complex parameters and detailed structures are not likely to be very useful because most servers won't be parameterized that way. We may be able to handle this with just text strings. We should keep interoperability in mind as we sort through the parameters, set criteria and provide examples.

Hinnebusch: interoperability between different servers is very important, but this could be heavily used with captive clients. Maybe we should define an external for private transmission. Wibberley: as long as the model is general enough to accommodate different applications, there will be some interoperability and some private stuff. The proposal covers basic needs. (They de-dup now, but not using Z39.50. They would like to be able to do their current De-dup functions within Z39.50.) He suggests the following additional requirements:

The ZIG agreed that De-dup would work against a set of one or more result sets and that the input result set name would be repeatable.

Allow three different high level options:
1. Identify and remove duplicates.
2. Identify and group duplicates
3. Create an answer set using only the duplicates. This option comes in several flavors: keep all copies of duplicated items (including the unique/representative record); delete the unique records and keep only the copies; keep one copy of each duplicated record.
Be able to express what the de-dup action is.

Cluster records are very important. Davidson: in the Universe project, the client gets only the cluster record that represents everything it found in the union catalog. Stovel: the RLG database is already clustered using a sophisticated algorithm, so users already get a cluster record that includes a repeatable field indicating each member in the cluster. D Lynch: we do this with A&I stuff; so when we de-dup we send a cluster record and users can say which database they prefer. Denenberg: is the cluster record choice here substantially different from maintaining one record and including the others as metadata? D Lynch: the cluster record needs to behave like other records.

Hinnebusch is doing a project that clusters the results of result sets. De-duping is maybe not the right term because they are doing true clustering. The server does not say that they are equal records, but members of the same class. If you do it as a single record, that's one solution. But you could potentially produce sets of result sets as the outcome of this.

Wibberley thinks there's a whole class of analysis and processing of result sets that could be addressed. Years ago we talked about filtering result sets, grouping, etc. The proposal could be generalized to something more than de-duping, but let's nail down de-duping first. He suggested a compromise on clustering and metadata: if we model this so that one of the options for de-dup was the creation of metadata which grouped stuff at the result set level, and then define a present using a cluster schema where GRS-1 is defined to carry metadata, we could bring the metadata back with the record. We need to use a schema that defines the cluster structure. Don't model the result set as a single cluster record, but each record in the result set is a cluster record with metadata. Draw on the metadata that was created as part of de-dup.

Loy proposed an alternative: one record at the result set level with result set level metadata. Denenberg is confused. Wibberley's goal is that the user can look at groupings of equivalent records and see the relationship between them. The model he proposes has metadata with each record in the result set, so that you could individually retrieve those records and look at their metadata. To get the cluster concept, we could retrieve by equivalence class and retrieve all records in that class. Hinnebusch paraphrases: every record in the result set indicates equivalence class; his concern with this is hierarchy.

Mazur assumed that this proposal addressed records in one language only. Hinnebusch: the client should tell the server what it wants if the server can de-dup across languages. USPTO has no way to get to Japanese information other than through abstracts. Hinnebusch and C Lynch: this proposal can handle that if the client can tell the server to do it and the server knows how to do it. Wibberley: we should be able to recognize equivalent records in different languages. The proposal allows you to specify criteria for de-duping. There are many regular expressions that indicate that there must be a certain level of equivalence to select for de-duping. The client and in some cases the user need to be able to specify the criteria for de-duping.

Debate about whether we can do this in a search or whether we need a separate service: Poul Henrik: relational databases already de-dup, so specifying selection criteria seems odd. C Lynch's counter-argument: one of the problems with de-duping in this kind of clustering is that it doesn't lend itself to simple selection criteria. He thinks having a function like this is a substantial advantage for Z39.50 over a simple distributed relational model. This almost subjective assessment of when things are duplicates is very useful. Poul Henrik: what's the difference between specifying selection criteria as another search, and the service proposed here? Just define the access point or search criteria. Levan: the problem is that the concept of a duplicate is in the eye of the beholder and cannot be accomplished algorithmically. The end user will not be able to construct this search. The selection criteria cannot be expressed in a Boolean query. D Lynch: it's not that there isn't an algorithm to do it; there isn't a Boolean query to do it.

Denenberg is still unclear about the metadata associated with clusters. D Lynch: the real question is whether, if you have two equivalent records, there are two records in the result set or one when you de-dup. If there are two, this approach means that the records will be kind of sorted by cluster. Hinnebusch: we've already defined Use attributes for this, so you don't have to have them dispersed; you can control the dispersal. D Lynch: yes, but if you haven't done that, cluster records could be all over the result set. Hinnebusch: so do a search for all items in that result set in that equivalence class.

Denenberg: people want to get the same number of records that they started with, but they're kind of sorted by equivalence class. They might want to determine which ones to throw out. The proposal doesn't address that. How does the user move through records from one equivalence class to the next equivalence class if there are many records in the result set? Should the metadata indicate how many records are in the class to serve as a pointer to the end of the class? That won't help if records in the result set are not sorted by class.

Hinnebusch: maybe we need to address clustering before de-duping. Levan: this is an easy switch in the request: I want nothing but cluster records, or I want the same number of records as I started with. Why are we fighting? Bull did a duplicate detection system that removes duplicates but also enables users to see all of the records. Stovel: is there a requirement to de-dup just a segment of a result set that we should allow for? (??): from the end user's point of view, if there is de-dup and records are tossed out, what if the record that is kept refers to items remote from the user, when there may be a closer copy?

Hinnebusch: the input and output result set may be the same.

Wibberley: users can specify the criteria for the priority of ordering of duplicates, e.g., by database. You can also say, "I only want to retain the top three duplicates in an equivalence class." Send all requirements to Denenberg, who will revise the proposal for discussion at the next meeting -- provided that we keep focused on de-dup service and don't expand to clustering. Stovel wants some time to think about this before we finalize a service. Denenberg should get all requirements first, then we'll review. C Lynch encourages us to focus on abstract de-dup functional requirements before getting into details of how to do it in Z39.50. D Lynch: we don't want new requirements to come up at the next meeting; we want them up-front. Hinnebusch: that's unrealistic and historically not how the ZIG operates. Send requirements that may be related to de-dup or clustering; we'll sort them out later. The first pass may not address more esoteric functionality.

15. Enhancements (Denenberg)

If we choose the path of (a) assigning new OIDs to existing definitions, there is no problem with the standard, but there are interoperability problems. If we (b) keep the same OIDs and modify the definition, there are fewer interoperability problems. Path (b) can be done as an implementors agreement (not a defect report, though that is more expedient). Denenberg plans to treat enhancements to public definitions as implementors agreements that are slated for integration into the next published version of the standard whenever that happens. Is path (b) an acceptable approach? Levan: yes. Stovel: we must look at this on a case-by-case basis. Denenberg agrees with Stovel.

Bull: it would be nice to have a mechanism for crafting future extensions into a service now. Denenberg: yes. Bull wants a machine bit for expansion in the new de-dup service. He wants a tiny external that may return a list of OIDs showing what services, attendant services, etc., are supported. (This is necessary because few implementations support Explain). Denenberg: that's beyond the scope of the current discussion.

Waldstein: we want the ability to be able to add more (the dot-dot-dot ASN.1 structure). Hammer: as someone who trains new programmers how to do Z39.50, he's worried about the future when all these implementors agreements are separated from the published standard. It could be an awful mess. Denenberg: is the concern the lack of formality? Hammer: in part. We want to know what we're working on (which may be distributed across many Z39.50 documents); also, will the protocol be a stack of hacks two years from now? Dekkers: it should be clear that the Maintenance Agency web page IS the standard.

Turner: the National Library of Canada serves as the Maintenance Agency for the ILL protocol, even though ISO hasn't made that official. They have the same issues for the ILL protocol and don't want to have to do an ISO ballot for every change. Her concern is that there be strong approval on the changes so that when we collect these changes and advance them at some future point, we don't want a problem when they finally go to ballot.

Stovel: can NISO be approached to publish something other than the standard? Will NISO publish our implementors agreements and make them available along with the printed standard? Denenberg: that may be worth looking into.

Poul Henrik: we have the same issue in Europe since EWOS ended. They've set up something called ISSS to address how to handle implementors agreements and publicly available (albeit pseudo official) standards. They refused to hand the EFILA copyright for ISSS over to ISO. The boundaries of the working groups are somewhat unclear. Maybe they should publish their documents as working papers for the ZIG.

Waldstein: maintenance and gathering of the documents is crucial for interoperability. Wibberley suggested and the ZIG agreed that the Maintenance Agency should (if possible) keep an HTML version of the standard with links to the implementors agreements at the appropriate places.

11. Service Description for Z39.50 Client (Denenberg)

16. eSpec and CIMI

CIMI approved the proposal.

Hammer is concerned with Denenberg's proposal because of its handling of nested database schema. The notion of just inserting a schema in the middle of a stream of tags seemed simplistic, but he has no concrete examples because they don't yet do nested records. He suggested using compSpec and an element set name to ask for what you want, but is willing to go along with the proposal if no one objects.

Levan is seeing development in the metadata community that would call for exactly what's in this proposal. For example, Dublin Core uses schema (which they call "namespace") and they sometimes change schema in the middle of a record.

Denenberg: this proposal isn't intended to be a switching schema; its only purpose is to qualify a tagtype (append tag type with schema identifier). The intention is to say which tagset this tagtype is referring to. The prose at the bottom of the proposal should be rewritten to clarify the scope of functionality and ASN.1 comments should be added. Loy: if the proposal is approved, will Denenberg provide the ASN.1?

Denenberg considers the proposal approved because no one objected. He will eventually annotate this change and treat it as an implementors agreement as described above.

17. Discovering or Negotiating Profiles (Dekkers)

Waldstein is not aware of anything in a profile that a client can't discover via error messages. Levan: you can't discover all behavior through diagnostics.

Denenberg: let's distinguish between discovering what profile a server supports and negotiating behavior. If the simple behavior is to find out (discover) what profile a server supports, do that through Explain. If we're talking about dynamically negotiating behavior, Denenberg feels similar to Waldstein: do this through negotiating services and reacting to diagnostics. We've never really discussed wanting to negotiate any behavior except character sets. Do we want OIDs for profiles, or can we use names (Explain)? Do we also want some class of OIDs for behavior?

Levan counter proposal: it would be trivial to negotiate OIDs on the Init request (rather than investing in the overhead of Explain). If the server doesn't know about it, it ignores it. The server returns which OID it will respond to. This makes it lightweight and outside of the standard. We can add it to the standard later on. Wibberley: the mechanism Levan suggests is straightforward, but having been down that path of "just doing it," it eventually comes back as work to be redone. Why can't we define a negotiation record that would simply support a set of OIDs that could be negotiated?

Denenberg: a reasonable temporary approach may be to put some text in the CIMI profile that specifically says what behavior you're agreeing to if you negotiate this OID. Dekkers: a temporary solution could be useful as an experiment. We don't yet assign OIDs to profiles; if we did that, we could experiment.

Bull: if we assign OIDs to profiles, the OIDs should specify the level of conformance since the goal is to negotiate behavior. CIMI has four levels of profile conformance. Zeeman is concerned about assigning OIDs to profiles. Levan: many profiles define default behaviors. Levan and Wibberley think OIDs for profiles is a good idea. Roby likes the idea of OIDs for each level of conformance.

Denenberg agreed to assign an OID class for profiles and the negotiation record. It is unclear whether the OID will specify the level of conformance. Wibberley: this essentially reintroduces application context and presentation back into the protocol.

18. Relevance Free-Text Search Implementors Agreement (Waldstein)

Is the use of phrase more equivalent to what web search engines do or is concept freetext more equivalent? Phrase implies that the ordering is significant; freetext does not. Waldstein's position is to give total control to the server; return records in the order you think best.

What if you specify a Use attribute? Ignore it?

C Lynch: being silent about what a Use attribute means here is not helpful. It makes it too vague to deal with. There's a whole series of conventions (unspecified search language) for web search engines. They treat quoted terms as phrases. They do support '+' and '-' as must or must-not appear. Is there an intent that these functions be supported if those conventions appear in the text that is passed -- to emulate web search engines? If that is the approach, doesn't it make it more sense to treat it something like CCL and treat it like a query type.

Waldstein hadn't realize that these conventions were so well established. No, he is not trying to emulate it at that level.

Wibberley: if the goal is to emulate web search engines, .... Alta Vista is close to a full-blown search engine. To make this more valuable, we should specify a Use attribute like Any. Stovel: we have a Use attribute for serverChoice; if we say that in combination with Relation and Structure attributes, this would make it clearer. M Taylor wants the Use attribute not to be given a single option.

Hinnebusch is still unclear on the purpose of the proposal. Waldstein wants to be able to mount external resources and not have to have his client create Booleans for a freetext search. He does not really care what Altavista, Yahoo, etc. does, but he wants low level functionality and servers to understand this combination of attributes.

D Lynch: to clarify the intent, this is not like today's web search engines, but a good old WAIS search. It is not a new query type. The WAIS profile already says how to do this. This may be the case, but Waldstein couldn't find his copy of the WAIS profile. Hinnebusch thinks the profile solves the problem but may require GRS-1. Waldstein does not want GRS-1 as a requirement. Waldstein will look at the WAIS profile. (The name "WAIS" is not in the public domain.) The ZIG cannot endorse this proposal until they review the WAIS profile.

The WAIS profile may have a list of required Use attributes. WAIS does imply relevance ranking.

19. HEADS-UP: W3C Working Group Activities (Levan)

The Schemas Working Group is defining schemas and working on a machine-readable way to recognize them.

The third group (to be created shortly) is the RDF Services Working Group, which will describe how you do search and retrieval of RDF records on the web. Levan may chair this working group. Z39.50 will not be the solution that this community adopts. It will probably be HTTP. A scope statement will be forthcoming, but we can assume that the mission is retrieval of RDF objects.

You must buy membership in W3C to participate in the development of the proposals and standards. There are about 20 players. Most active are Netscape, Microsoft and IBM. Significant people are involved (not just marketing people). The meetings are typically held via teleconference. Microsoft and Netscape each hosted face-to-face meetings. There is an active mailing list, deadlines on drafts, phone calls, etc. to keep work progressing.

Hinnebusch: where is this going? Dekkers: will this kill Z39.50? This will not be popular in the library community. It won't kill Z39.50, but it will make Z39.50 an application-area specific protocol. The instant this becomes available, the museum community will adopt it and Z39.50 will become secondary.

Bull: are the documents available now? Yes: www.w3c.org. Ask your W3C representatives for participation in RDF Services Working Group. The mission is to develop a searching mechanism for RDF objects; this is about assigning metadata to Internet objects in support of rating services (e.g., identify pornographic objects, etc.).

C Lynch: CNI is following this and finally joined W3C. The RDF work (initially assigning metadata to web objects) has been ambiguous about the scope of the search business. It's not obvious how far they intend to go. Does this necessarily have in scope the searching of databases produced by web indexes? Levan doesn't know the answer to that; it feels to him as if the model is going to be that all web servers will have a web query engine built into them and that users should be able to search rather than browse anything that has an RDF description.

Finnegan understands that this was meant to be a lightweight search of Dublin Core metadata. Levan: RDF has gotten more heavyweight. The model is that if we had started with GRS instead of MARC records, what would we have done differently. RDF records with metadata and namespace mechanisms in them are GRS records. Lightweight searching will be inadequate.