Ticket #27 (new enhancement)

Opened 1 year ago

Last modified 9 months ago

Use namespace tags to include other conventions in netcdf files without repeating them in CF

Reported by: benno Assigned to: cf-conventions@lists.llnl.gov
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: Cc:

Description

Strictly speaking, this is not an enhancement to CF -- it is a way of encoding additional information in a netcdf files without extending CF. It certainly would affect discussions for extending CF, and hopefully will elicit some discussion in the CF community. In any case, creating a ticket was suggested.

My suggestions are based on an evolution of how metadata is encoded in netcdf files: rather than simply allowing multiple conventions in netcdf files by a comma-separated list in the Conventions attribute, specify prefixes so that attributes can be explicitly labelled as belonging to a different convention. In particular, were I to want to use attributes from the convention used by WCS to describe projections, I would use a conventions tag in the file like,

Conventions = "CF-1.0, wcs=http://www.opengis.net/wcs/1.1/"

means the default convention is CF-1.0 (i.e the usual usage), but any attribute starting 'wcs:' belongs to the convention defined by the namespace http://www.opengis.net/wcs/1.1/ (which is how wcs or any other XML convention standardizes its labels). This would change the WGS 84 example in ticket 18 slightly, for example, to

 dimensions:
   lat = 18 ; // dummy values
   lon = 36 ;
 variables:
   double lat(lat) ;     // conventional definition
   double lon(lon) ;     // conventional definition
   float temp(lat, lon) ;
     temp:long_name = "temperature" ;
     temp:units = "K" ;
     temp:grid_mapping = "crs" ;
     temp:wcs\:gridCRS = "crs" ;
   int crs ;
     crs:grid_mapping_name = "lat_long_wgs1984" ;
     crs:wcs\:GridBaseCRS = "urn:ogc:def:crs:EPSG:6.0:4326" ;   // Use EPSG ID 4979 for 3D CRS. ID 4326 refers to 2D CRS.
     crs:crs_name = "WGS 84" ;
     crs:geodetic_datum_name = "World Geodetic System 1984" ;
     crs:longitude_of_prime_meridian = 0.0 ;
     crs:ellipsoid_name = "WGS 84" ;
     crs:semi_major_axis = 6378137.0 ;
     crs:inverse_flattening = 298.257223563 ;
 Conventions = "CF-1.0, wcs=http://www.opengis.net/wcs/1.1/"

where gridCRS and GridBasCRS are the tags used in WCS to characterize projects. In particular, it declares 'crs' to be both a CF convention grid_mapping, and a WCS convention gridCRS, and it adds an WCS attribute to gridCRS called GridBaseCRS, which is part of the WCS convention.

This keeps the redundancy out of CF, but allows alternate specifications to be put in a netcdf file. It also makes use of standards that other people maintain, i.e. OGC maintains the wcs namespace and list of concepts.

There are a number of reasons for doing it this way, listing of which would probably hide the essence of the proposal and should be put elsewhere. The short version is redundant representations belong to different conventions, and this is a framework for writing down the mappings between the different conventions, and for tagging datasets with them. This would allow creation of a system that could deliver the metadata in alternate representations, so that WCS applications could use WCS-style information, and CF applications could use CF information, regardless of the actual source of the data being analyzed.

Note that recent and near-term changes to netcdf make this possible -- we could not have made this choice two years ago. But netcdf version 3.6 libraries support ':' characters in attributes, and version 3.7 will support it in ncgen/ncdump, i.e. the '\:' escapes shown above will work in CDL.

Benno Blumenthal

Change History

03/29/08 17:29:52 changed by caron

I like the idea of namespaces, thanks for bring this out. Defining them in the Conventions attribute is ok, although a separate global attribute for them might also be a good idea, to keep older programs from barfing which don't look through comma separated tokens.

I don't like using the ":" char which needs to be escaped in CDL. Why not use another char that does not need escaping? For example, I think "@" is allowed in all versions of netcdf and hardly ever used:

crs:wcs@GridBaseCRS = "urn:ogc:def:crs:EPSG:6.0:4326"

instead of

crs:wcs\:GridBaseCRS = "urn:ogc:def:crs:EPSG:6.0:4326"

03/31/08 00:55:54 changed by bnl

I like this. A lot.

The escaping issue isn't obvious (to me). By pushing the escaping away from CDL are we pushing it somewhere else, and introducing a non-standard parsing scheme? (I appreciate Benno's comments against this in ticket:24 and will respond accordingly.) I guess it depends on where (in the software stack) one decodes wcs@GridBaseCRS ... but if it's not in the CF part of the stack then doing this could cause problems.

I would argue that we leave anything alone that follows the colon (which then belongs in another namespace), beyond netcdf/CDL escaping, which could (and should)be automatically dealt with in the CF stack ...

04/02/08 09:47:33 changed by jonathan

Dear Benno

Thanks for this idea. As my regular readers might expect I am more cautious about it than Bryan and John :-). I do think it is good to avoid redundancy, and to make use of other conventions where we can. However, I suspect that the situations where we could do that have quite exacting requirements. The convention we were depending on would have to be framed so that it could be used directly to supply a list of possible values for a netCDF attribute, and its intentions would have to be orthogonal to everything else in the CF convention so there weren't problems with overlaps and conflicts. Also, if it is supplying metadata that we regard as important and useful (and if it wasn't we wouldn't consider it anyway), I suppose we ought to have the same expectations of it in terms of intelligibility, clear definitions and self-descriptiveness that we do for CF metadata in general. Is that too demanding?

Cheers

Jonathan

04/08/08 07:59:29 changed by benno

Hello All,

My apologies for not responding earlier: for whatever reason, updates to this ticket are not coming through my subscription to the mailing list, so I was unaware. I suspect my mail server, but then again, I always suspect my mail server.

I share John's concern about the escaping, but I think keeping the colon (because that is what is used elsewhere) helps with clarity more than it hurts. And it is only a problem in older CDL -- the netcdf library already handles it, OPeNDAP has no problem with it. And I thought comma-separated (line-separated in Opendap) conventions in the Conventions tag were used elsewhere, but John would know best.

In response to Jonathan's comments, I would like to point out that namespaces address the metadata problem of delivering information to multiple audiences in an alternative way to the Microsoft approach of attempted domination. The example I cited above was WCS, redundant to part of CF. But alternatively, look at what Ted Habermann is doing to put Dublin Core and FGDC metadata into netcdf attributes http://galeon-wcs.jot.com/WikiHome/GALEON%20Phase2%20Main%20Page/Unidata%20OGC%20Interoperability%20Day%20Presentations/Metadata_Standards_and_netCDF_ppt___119021166608511_916828076687903?jot.downloadName=Metadata+Standards+and+netCDF.ppt. He has a convention that adds attributes like FGDC_Link, fgdc_publisher, dc_publisher, and an additional Metadata_Conventions tag "FGDC, DublinCore?". But, of course, no software understands this but his, despite FGDC and Dublin Core being widely used metadata standards. And having both a Conventions tag and a Metadata_Conventions tag is confusing, since it is all metadata. This namespace convention would put a standard machine-readable framework around such a construction, so that software could understand the conventions that the metadata belongs to, and even making the attributes a bit more human readable as well (since they would be perfectly consistent in the way they are connected to conventions).

Note that it is already possible to put attributes belonging to multiple conventions in a single netcdf file -- this proposal simply makes it possible to say which attribute belongs to which convention.

Benno

04/08/08 12:08:17 changed by benno

I would like to amend my last statement -- it did not come out the way I intended.

First of all, having a dominant standard is a great thing, the "Holy Grail" of metadata exchange, and whatever time spent that leads to such a thing is time well spent. I think it is really hard to achieve, and maintain, but the effort is essential. And data that is fully described by a widely-adopted standard will be the most universally and easily accessible.

But not all data are fully-described by standard metadata, possibly no data are (emphasis on the word "fully").

Netcdf already allows multiple conventions in a particular file (Russ supplied the reference which is at http://www.unidata.ucar.edu/software/netcdf/conventions.html). And netcdf already allows specifying a convention with a uri in the conventions tag (possibly predating namespaces in XML). I am just trying to make it clear which attribute goes with which convention in such a way that it is consistent with what the greater IT community is doing, in particular XML and RDF, formats that are used for a lot of metadata exchange.

Benno

(follow-up: ↓ 7 ) 04/09/08 04:09:43 changed by bnl

I guess I'm making the same point in other threads. But let me be direct.

  • Not everything we find important or useful can be encoded in the CF convention.
  • CF cannot and will not be the governing body for all the metadata folk want to put into their NetCDF files.
  • CF had better make it possible for CF to "play nicely" with metadata governed by other communities.

However, in supporting that, we do run the risk of data producers contradicting themselves in their attributes between conventions. C'est la vie. Is that not better than data providers not putting as much documentation as they can with their data?

So, the only question in my mind is how we do support something like this proposal, not whether we should do it? So I find myself disagreeing with Jonathan again. (I hasten to add that just because I'm disagreeing with Jonathan a lot right now, I have no less respect for his opinions, or the work that he's done that has got us here. Precisely the contrary, because I respect him so much I want to make sure that our points of disagreement are resolved).

(And given I support doing something like this, and can't improve it technically, I obviously support this one!)

(in reply to: ↑ 6 ) 04/09/08 08:31:50 changed by jonathan

Dear Bryan, Benno et al

* Not everything we find important or useful can be encoded in the CF convention. * CF cannot and will not be the governing body for all the metadata folk want to put into their NetCDF files. * CF had better make it possible for CF to "play nicely" with metadata governed by other communities.

I agree with these statements and I am not opposed to the principle of making use of other conventions, which is the point of this ticket. I was expressing caution about it, especially this:

we do run the risk of data producers contradicting themselves in their attributes between conventions.

I wouldn't find that a satisfactory situation. I think that if that arises we would have to specify, as part of CF, how the conflict or overlap should be resolved. That's why I suggested that another convention should only be adopted if it was "orthogonal" to CF i.e. dealing with something that CF didn't. If that is too exacting, we ought instead to say how they should work together.

Hence, I think we should consider individual cases as specific proposals, and say explicitly in CF which other conventions can be used with CF, and how they can be used, as a result of considering them individually (in trac tickets, just like any extension to CF). It's a bit less safe because subsequently the "other" convention might develop in a way which threw up problems when used with CF, so we would have to keep an eye out for that. Would you agree with such an approach?

Cheers

Jonathan

04/09/08 12:58:28 changed by bnl

I don't think conflicted duplication is satisfactory, but it's better to be in the position of trying to resolve conflicting information than having nothing to work with ...

... and if we tried to make all other conventions orthogonal to CF we would be on a hiding to nothing, and if we tried to identify only those parts of other conventions that we like and only allow those, then we would also be on a hiding to nothing (in both cases, in workload terms).

I guess the point we really should be careful about, and maybe which is exercising you, is where within a piece of the CF convention itself, we mandate using an external attribute. That's a rather different case and each needs to be explored in it's own right. Yes, that's where this proposal originated, but this proposal itself isn't addressing that use case. I think this proposal is rather more what it says in line one:

Strictly speaking, this is not an enhancement to CF -- it is a way of encoding additional information in a netcdf files without extending CF.

and I think most of arguments about orthonality and duplication are mostly relevant to cases where

It certainly would affect discussions for extending CF, ...

04/10/08 22:21:10 changed by graybeal

This is naive, but why isn't it a simple matter to say 'in case of conflicts, any netCDF/CF specification is the winner; between external conventions, the last statement is considered the winner' and be done with it?

I think this is a useful proposal. I'm not sure how realistic it is to keep CF pure and clean with respect to conflicts that could be introduced by unsavory programmers. But I admire you for trying!

04/11/08 15:55:42 changed by caron

A technical point about using the ":" in the attribute name. The only real problem with this is that older software might have problems with it, since technically the ":" is not part of the set of allowable chars (yet). It does have to be escaped in the CDL, but CDL is just a representation; theres no problem in the file format itself. So if the ":" in analogue to XML namespaces is more important than possible trouble with old code, im ok with it.

With regard to possible conflicts, in my opinion theres nothing to be done about it. The point/effect of this proposal is to allow arbitrary other semantics. I would just say, from the CF POV, something like "all CF semantics have to be correct and consistent, and shall not be modified by non-CF attributes".

04/11/08 15:59:56 changed by caron

Oh one more point: its conceivable that CF itself might in the future want to use this mechanism, in which case we would claim a namespace (and probably a prefix) and then I would agree with all of Jonathon's concerns and requirements for consistency, clarity, lack of conflicts, etc. for anything in that namespace.

(follow-ups: ↓ 13 ↓ 14 ↓ 15 ) 04/12/08 13:30:28 changed by caron

Ok, if you insist, maybe a few more thoughts.

1. I find this a bit cleaner:

 Conventions = "CF-1.0";
 namespaces ="wcs=http://www.opengis.net/wcs/1.1/"

than:

 Conventions = "CF-1.0, wcs=http://www.opengis.net/wcs/1.1/"

2. I find myself thinking about using a cf namespace for cf attributes, which I think adds a lot of readability:

   float temp(lat, lon) ;
     long_name = "temperature" ;
     units = "K" ;
     cf:grid_mapping = "crs";
     wcs:gridCRS = "crs" ;

(note i am using modified CDL, by not including the variable prefix and not escaping the namespace ":" ).

If this proposal is accepted, i would propose that a cf namespace be established for all existing cf attributes, optional use of course, so that

cf:grid_mapping = "crs"

and

:grid_mapping = "crs"

are equivalent.

3. Benno, I assume that these proposed namespaces are the same as XML namespaces? Eg, tools match on the namespace URI, and the prefix is arbitrary? Or is it intended to be something else, like URN namespaces (which I know little about) ?

(in reply to: ↑ 12 ) 04/14/08 08:02:07 changed by benno

Replying to caron:

John is again precisely on topic, and brings up some excellent points. My short response is that these are precisely the same as XML namespaces (which are the same as RDF namespaces), and yes, there needs to be a CF namespace as well. Since I think Conventions and namespaces are representing exactly the same thing in this context (a set of attributes/terms that are used to describe things in a defined context), I originally proposed continuing to use Conventions (you might want to use namespaces for something else, see below).

But let me explain.

There were a bunch of details I left out of the original proposal, because I thought the details would obscure the essence of the thing, and lose most of the audience, and kill the proposal. But John is quite right, the goal is to be exactly analogous to XML namespaces, mainly so that there can be a clean general semantic translation from netcdf to RDF/XML and back.

My starting point is two sections of the netcdf documentation talking about the Conventions tag. First is http://www.unidata.ucar.edu/software/netcdf/conventions.html, which states,

If present in a netCDF file, `Conventions' is a global attribute that is a character array for the name of the conventions followed by the file. Originally, these conventions were named by a string that was interpreted as a directory name relative to the directory /pub/netcdf/Conventions/ on the host ftp.unidata.ucar.edu.

This web page is now the preferred and authoritative location for registering a URI reference to a set of conventions maintained elsewhere. The FTP site will be preserved for compatibility with existing references, but authors of new conventions should submit a request to support-netcdf@unidata.ucar.edu for listing on this page.

It may be convenient for defining institutions and groups to use a hierarchical structure for general conventions and more specialized conventions. For example, if a group named XXX agrees upon a set of conventions for required attributes, attribute names, and netCDF representations for certain discipline-specific data structures, they may describing the agreed-upon conventions in a document associated with the name "XXX", and files that followed these conventions would contain the global attribute

    :Conventions = "XXX" ;

Later, if another group agrees upon some additional conventions for a specific subset of XXX data, for example time series data, the description of the additional conventions might be associated with the name "XXX/Time_series", and files that adhered to these additional conventions would use the global attribute

    :Conventions = "XXX/Time_series" ;

It is possible for a netCDF file to adhere to more than one set of conventions, even when there is no inheritance relationship among the conventions. In this case, the value of the `Conventions' attribute may be a single text string containing a list of the convention names, separated by blank space or commas, such as

    :Conventions = "XXX, YYY" ;

So netcdf as it currently stands uses a comma-separated list in the Conventions tag for multiple conventions.

The user guide has some additional important detail (http://www.unidata.ucar.edu/software/netcdf/guidef/guidef-13.html).

If present, 'Conventions' is a global attribute that is a character array for the name of the conventions followed by the dataset, in the form of a string that is interpreted as a directory name relative to a directory that is a repository of documents describing sets of discipline-specific conventions. This permits a hierarchical structure for conventions and provides a place where descriptions and examples of the conventions may be maintained by the defining institutions and groups. The conventions directory name is currently interpreted relative to the directory pub/netcdf/Conventions/ on the host machine ftp.unidata.ucar.edu. Alternatively, a full URL specification may be used to name a WWW site where documents that describe the conventions are maintained.

For me, this is really important. First of all, it already allows a URL to specify the convention, so specifying an XML namespace here is already supported. Secondly, it specifies how to construct an URI for a convention that is not specified as a URL, namely the convention CF-1.0 has the URI ftp://ftp.unidata.ucar.edu/pub/netcdf/Conventions/CF-1.0/. More particularly, if one had an attribute called attributename it would have a uri ftp://ftp.unidata.ucar.edu/pub/netcdf/Conventions/CF-1.0/attributename. This would let us make statements about this attribute in XML or RDF.

Unfortunately, I don't think that is exactly the URI we want to use to represent CF, and we (the netcdf community) have strayed from that ftp directory to a web page which does not make it clear (to me) how to construct a URI to represent a netcdf convention. But we could fix this, right? Some kind of netcdf convention registry that is explicit/machine readable.

As for this proposal, it changes the current netcdf Conventions attribute value by allowing prefixes so that the attributes can be explicitly attached to their conventions. The convention without a prefix tag would contain all the unprefixed attributes.

So John is quite right, the prefixes are determined by the Conventions tag. One of the beauties of this scheme (and why XML adopted it), is that software that does not know about namespaces just sees a list of conventions, and a whole bunch of attributes, which is pretty much the situation that the software was in originally, particularly if CF is used as the default (unprefixed) convention.

There is a subtlety in this: these namespaces only apply to attribute names, not to attribute values. This is precisely the same as a RDF/XML file: the namespace prefixes are only used on the attributes, not on the URIs that identify objects (though you can define XML entities for that). The one exception to that statement is that the base uri is used to abbreviate URIs for objects local to the file. Which is not to say that you cannot define a namespace that would apply to attribute values, but that is not what the Conventions tag is about, and it is not what this proposal is about.

I can't overemphasize how important I think this is. But I am afraid if I start talking about all the wonderful things RDF would let us write down, and how much it would help CF/netcdf, I will get a large off-topic conversation attached to this proposal, which would not help. This proposal is a first step towards clean metadata exchange between netcdf and XML/RDF. Establishing URI's is the second.

Thanks again, John,

Benno

P.S. Is your 'modified CDL' currently implemented? It certainly is clear.

(in reply to: ↑ 12 ) 05/07/08 07:58:40 changed by pbentley

Replying to caron:

1. I find this a bit cleaner:

 Conventions = "CF-1.0";
 namespaces ="wcs=http://www.opengis.net/wcs/1.1/"

than:

 Conventions = "CF-1.0, wcs=http://www.opengis.net/wcs/1.1/"


The possibility of using XML-type namespaces in netCDF was also on my mind when I was formulating tickets 9 and 18. Like you, John, I'm not too keen on the idea of overloading the Conventions attribute to encode namespace declarations. Instead, I'd envisaged using a separate netCDF variable specifically for this purpose, e.g.

char namespace_declarations ;
   :cf = "http://www.cfconventions.org/" ;
   :wcs = "http://www.opengis.net/wcs/1.1/" ;
   :dc = "http://www.dublincore.org/" ;
   ...

Which is not too dissimilar from the way we store CRS metadata in a separate grid mapping variable. However, I also like your suggestion of encoding namespace declarations in a single global attribute, though I think this may be hard for human-readers to decode if there are more than a couple of namespaces. Obviously this is not an issue if it is only ever parsed by software.

Even though I have some reservations about the way in which we're proposing to crowbar XML-style techniques into netCDF's simple name-value metadata mechanism, I realise that this is what we have to work with for the time being. Hence this proposal would get my vote in principle.

Regards,

Phil

(in reply to: ↑ 12 ) 05/29/08 17:22:43 changed by benno

Replying to caron:

1. I find this a bit cleaner:

  Conventions = "CF-1.0";
  namespaces ="wcs=http://www.opengis.net/wcs/1.1/"

than:

  Conventions = "CF-1.0, wcs=http://www.opengis.net/wcs/1.1/"

I've been pondering "cleaner" for some time, as well as Phil's idea phil , which reverses the usual role of attributes and variables. I would put both these ideas in the class of "don't mess with Conventions", and I have been thinking about implementation reasons why one should leave Conventions along. Before I give those reasons, I'd like to add another possibility for "don't mess with Conventions" schemes: simply borrow xml namespaces literally, i.e. an xmlns attribute sets the default namespace for the container it is in, and xmlns:xyz attribute sets the namespace for prefix xyz:. So to continue this example, we would say

   xmlns:wcs = "http://www.opengis.net/wcs/1.1/";

Now suppose for the moment that we have all decided on a URI for CF-1.0. It could be a constructed URI, i.e. there is a simple rule for constructing it that we all agree to, i.e. ftp://ftp.unidata.ucar.edu/pub/netcdf/Conventions/CF-1.0/. On the other hand, it could be a registry URI, meaning that we have agreed on a registry for Convention URIs, e.g. urn:cfns:cf-1.0: or whatever.

This proposal was that

 Conventions = "CF-1.0" ;

would set the default namespace, i.e. the CF1.0 URI. Equivalently, we could say

  xmlns = "urn:cfns:cf-1.0:" ; 

However

  1. that is not what Conventions means at present (Conventions="CF-1.0" current meaning is that some of the attributes might belong to the CF-1.0 convention)
  1. If we do use a registry for setting Convention-equivalent URIs, then any code that does the tranlation to prefixed attributes has to consult the registry.

So as far as the default namespace is concerned, we have for a CF-1.0 statement

  1. current Conventions, which has some of the attributes belonging to CF-1.0
  1. proposed Conventions, which has all the unprefixed attributes belonging to CF-1.0, and,
  1. xmlns, which explicitly states the URI for CF-1.0 that we have all agreed to.

From an implementation point of view, unless we agree on a construction method for Conventions to URIs, xmlns is very different from the proposed Conventions in that no external lookup is required to convent to explicit URIs.

As proposed, i.e. using Conventions to set the default URI, prevents us from writing clever programs that use the CF standard to figure which attributes in the file actually belong to CF-1.0 when the Conventions tag says the attributes are CF-1.0 (because the new standard says put them all in CF-1.0 because it is the default namespace). It is not clear whether this is a major issue or not, but it might be.

John's namespaces tag could also handle the default namespace, but he might consider the proposal sullied at this point (or at least less than clean) ...

Another reason one might want to not use Conventions is that one might want to tag variable names (Conventions being explicitly about attributes).

One more can-of-worms is if one would like to set name spaces for the values of attributes, i.e. to conveniently point to a concept. This requires knowing which attribute values are URI's and not plain strings -- possible to do in opendap because it has a URL type, but not if one started with a netcdf file, i.e. one needs to be careful to translate to URL from string when appropriate, but not possible in netcdf unless you

  1. want to say if it looks like a URI it is a URI, or
  2. are willing to look at the description (i.e. ontology) of the standard and thus deduce that it is an URI. Which would be external information, thus relatively messy.

Benno