From rsg1.er.usgs.gov!usgsnews@usgs.gov!news.er.usgs.gov!stc06.ctd.ornl.gov!fnnews.fnal.gov!muir.math.niu.edu!mp.cs.niu.edu!vixen.cso.uiuc.edu!newsfeed.internetmci.com!EU.net!Portugal.EU.net!news.rccn.net!news99.sunet.se!newsfeed.tip.net!news.seinf.abb. Fri Dec 1 16:44:06 1995 Article: 11865 of comp.text.sgml Path: rsg1.er.usgs.gov!usgsnews@usgs.gov!news.er.usgs.gov!stc06.ctd.ornl.gov!fnnews.fnal.gov!muir.math.niu.edu!mp.cs.niu.edu!vixen.cso.uiuc.edu!newsfeed.internetmci.com!EU.net!Portugal.EU.net!news.rccn.net!news99.sunet.se!newsfeed.tip.net!news.seinf.abb.se!nooft.abb.no!Norway.EU.net!nntp-oslo.UNINETT.no!nntp-trd.UNINETT.no!nac.no!ifi.uio.no!usenet From: Erik Naggum Newsgroups: comp.text.sgml Subject: Re: Content Model Blues Date: 01 Dec 1995 18:01:42 +0000 Organization: Naggum Software; +47 2295 0313 Lines: 106 Message-ID: <19951201T180141Z@archana.naggum.no> References: <4981lf$qcj@agate.berkeley.edu> NNTP-Posting-Host: naggum.no X-Newsreader: Gnus v5.1 [Joe Wells] | What is a bit confusing to me is why a declared content of RCDATA was | not used instead of a content model with an exclusion exception. Is | there some disadvantage to a declared content of RCDATA? (Yes, I know | about the problems with CDATA, I am asking about *R*CDATA.) the problem with RCDATA is that what looks like start-tags inside it are not errors, but are instead regarded as valid data, while any end-tag will terminate it and be an error (unless it is the right one, of course). CDATA and RCDATA introduce a different parsing context based on the semantics of the elements involved. we already know that parsing HTML is too hard for browser writers, so it doesn't take much to predict that this will cause numerous errors. what is needed is a mechanism to forbid any sub-elements in some element, not syntactically (CDATA and RCDATA), but semantically, but we don't have that in SGML. however -- note that this is necessary only when inclusions are involved, as PCDATA adequately fills this purpose in their absence. CDATA and RCDATA should be avoided. since inclusions should also avoided at all cost, this is not _actually_ a problem. please _forget_ that CDATA and RCDATA are in the language. please _forget_ that you can specify inclusion exceptions. think _very_ carefully before you allow exclusion exceptions. exceptions are just that, not the rule. we have had a discussion about context-sensitivity in SGML a while back (I still have an unfinished article on the topic in the queue), and some like to pretend that SGML presents a context-free grammar. sure, at some irrelevant abstraction, it does, but it isn't the grammar of _SGML_ that exhibits this property. in particular, one of the great HTML proponents, Dan Connolly, goes so far as to ridicule those who argue against the foolish and misguided belief that SGML is context-free. if SGML were context-free, the following questions would be easy to answer: Q: given an element FOO consisting of PCDATA, is this valid? my cat ate my TV/video remote control A: only if there is no enclosing NET-enabling start-tag. Q: given an element HEADER allowing data, is this valid?
Message-ID:
A: only if HEADER is declared RCDATA or CDATA. Q: given a leaf element TEXT, with only PCDATA contents, is this valid? my vet worried about Xyzzy's eating habits A: only if A appears in an enclosing element's inclusion exceptions. Q: can you move an element (start-tag, contents, end-tag) from one place in a document to another and _know_ that it will still be a valid document? A: for most users, and for programmers used to more well-defined languages, these are not the obvious answers. although C++ programmers are getting used to good variable names being usurped by other package's new class definitions or even a new keyword in the latest release of the compiler, these are not situations we like to consider "healthy design". locality of errors is a very important part of the sense of control that we like to bring to computers. SGML offers, for all its good design and considerably value, ways to create systems that break down if some butterfly flaps its wings somewhere far away. sort of like the WordPerfect 6.1 "enhancement" that automatically converts (c) to © (copyright sign), even in lists and references. if those who argue that SGML is really context-free would at least stay away from the constructs that break the context-free parts of SGML, we could, perchance, arrive at some consensus that SGML can be parsed without carrying a truckload of context around with us. if HTML is going to be the Great Benefit to SGML that some appear to think it already is, please do us all a favor and _avoid_ the parts of SGML that are broken. thank you. SGML was never intended to enforce semantic restrictions or constraints on the data, but rather to provide as much comfort as possible as early as possible, so that whatever was deferred to later would be much easier. in practice this means: if some element is (too) hard to specify in a content model, leave it more open and handle the complexity elsewhere. the various SGML parsers' error recovery is bad enough already if it is not to be relied upon to count elements in random order. please consider this: there is _no_ value in restricting a content model unless the users will perform validation _prior_ to letting the data loose on an unsuspecting application. in the case of HTML, we cannot expect that users will perform validation in any meaningful sense of the word, and we have therefore lost the single most valuable aspect of SGML: the ability of an application to trust its data (at least part of the way) and therefore to reduce the error recovery code. therefore, it is only of academic interest whether HTML will or will not "allow" some order of elements. it might even be counter-productive in a strong sense: without SGML's content models and error handling, applications _had_ to be written defensively, but with more trustworthy data (or at least apparently trustworthy data), this may cause nothing but a false sense of security, and another round of "what's the point with SGML?" SGML provides (or enables) some great benefits to information technology, but it also has some serious gotchas that we should not be proud of and should resist all temptations to use. since they are easily locatable, it is not hard to avoid them. unfortunately, SGML is defined with some of them "built in" to the core language, and no way to request their absence, so something else must be used to disable the bad features. common sense and mature restraint should do the trick. # -- suppose we actually were immortal. what is the opposite of living your life as if every day were your last?