From rsg1.er.usgs.gov!usgsnews@usgs.gov!news.er.usgs.gov!stc06.ctd.ornl.gov!fnnews.fnal.gov!muir.math.niu.edu!mp.cs.niu.edu!vixen.cso.uiuc.edu!newsfeed.internetmci.com!EU.net!Portugal.EU.net!news.rccn.net!news99.sunet.se!newsfeed.tip.net!news.seinf.abb. Fri Dec  1 16:44:06 1995
Article: 11865 of comp.text.sgml
Path: rsg1.er.usgs.gov!usgsnews@usgs.gov!news.er.usgs.gov!stc06.ctd.ornl.gov!fnnews.fnal.gov!muir.math.niu.edu!mp.cs.niu.edu!vixen.cso.uiuc.edu!newsfeed.internetmci.com!EU.net!Portugal.EU.net!news.rccn.net!news99.sunet.se!newsfeed.tip.net!news.seinf.abb.se!nooft.abb.no!Norway.EU.net!nntp-oslo.UNINETT.no!nntp-trd.UNINETT.no!nac.no!ifi.uio.no!usenet
From: Erik Naggum <erik@naggum.no>
Newsgroups: comp.text.sgml
Subject: Re: Content Model Blues
Date: 01 Dec 1995 18:01:42 +0000
Organization: Naggum Software; +47 2295 0313
Lines: 106
Message-ID: <19951201T180141Z@archana.naggum.no>
References: <4981lf$qcj@agate.berkeley.edu> <JBW.95Nov26144724@bigbird.bu.edu> <JBW.95Nov26213240@bigbird.bu.edu>
NNTP-Posting-Host: naggum.no
X-Newsreader: Gnus v5.1

[Joe Wells]

|   What is a bit confusing to me is why a declared content of RCDATA was
|   not used instead of a content model with an exclusion exception.  Is
|   there some disadvantage to a declared content of RCDATA?  (Yes, I know
|   about the problems with CDATA, I am asking about *R*CDATA.)

the problem with RCDATA is that what looks like start-tags inside it are
not errors, but are instead regarded as valid data, while any end-tag will
terminate it and be an error (unless it is the right one, of course).

CDATA and RCDATA introduce a different parsing context based on the
semantics of the elements involved.  we already know that parsing HTML is
too hard for browser writers, so it doesn't take much to predict that this
will cause numerous errors.  what is needed is a mechanism to forbid any
sub-elements in some element, not syntactically (CDATA and RCDATA), but
semantically, but we don't have that in SGML.  however -- note that this
is necessary only when inclusions are involved, as PCDATA adequately fills
this purpose in their absence.

CDATA and RCDATA should be avoided.  since inclusions should also avoided
at all cost, this is not _actually_ a problem.

please _forget_ that CDATA and RCDATA are in the language.  please _forget_
that you can specify inclusion exceptions.  think _very_ carefully before
you allow exclusion exceptions.  exceptions are just that, not the rule.

we have had a discussion about context-sensitivity in SGML a while back (I
still have an unfinished article on the topic in the queue), and some like
to pretend that SGML presents a context-free grammar.  sure, at some
irrelevant abstraction, it does, but it isn't the grammar of _SGML_ that
exhibits this property.  in particular, one of the great HTML proponents,
Dan Connolly, goes so far as to ridicule those who argue against the
foolish and misguided belief that SGML is context-free.

if SGML were context-free, the following questions would be easy to answer:

Q:  given an element FOO consisting of PCDATA, is this valid?
    <FOO>my cat ate my TV/video remote control</FOO>
A:  only if there is no enclosing NET-enabling start-tag.

Q:  given an element HEADER allowing data, is this valid?
    <HEADER>Message-ID: <JBW.95Nov26213240@bigbird.bu.edu></HEADER>
A:  only if HEADER is declared RCDATA or CDATA.

Q:  given a leaf element TEXT, with only PCDATA contents, is this valid?
    <TEXT>my vet worried about <A HREF="...">Xyzzy</A>'s eating habits</TEXT>
A:  only if A appears in an enclosing element's inclusion exceptions.

Q:  can you move an element (start-tag, contents, end-tag) from one place in
    a document to another and _know_ that it will still be a valid document?
A:  <fill in your answer here>

for most users, and for programmers used to more well-defined languages,
these are not the obvious answers.  although C++ programmers are getting
used to good variable names being usurped by other package's new class
definitions or even a new keyword in the latest release of the compiler,
these are not situations we like to consider "healthy design".  locality of
errors is a very important part of the sense of control that we like to
bring to computers.  SGML offers, for all its good design and considerably
value, ways to create systems that break down if some butterfly flaps its
wings somewhere far away.  sort of like the WordPerfect 6.1 "enhancement"
that automatically converts (c) to © (copyright sign), even in lists and
references.

if those who argue that SGML is really context-free would at least stay
away from the constructs that break the context-free parts of SGML, we
could, perchance, arrive at some consensus that SGML can be parsed without
carrying a truckload of context around with us.  if HTML is going to be the
Great Benefit to SGML that some appear to think it already is, please do us
all a favor and _avoid_ the parts of SGML that are broken.  thank you.

SGML was never intended to enforce semantic restrictions or constraints on
the data, but rather to provide as much comfort as possible as early as
possible, so that whatever was deferred to later would be much easier.  in
practice this means: if some element is (too) hard to specify in a content
model, leave it more open and handle the complexity elsewhere.  the various
SGML parsers' error recovery is bad enough already if it is not to be
relied upon to count elements in random order.

please consider this: there is _no_ value in restricting a content model
unless the users will perform validation _prior_ to letting the data loose
on an unsuspecting application.  in the case of HTML, we cannot expect that
users will perform validation in any meaningful sense of the word, and we
have therefore lost the single most valuable aspect of SGML: the ability of
an application to trust its data (at least part of the way) and therefore
to reduce the error recovery code.  therefore, it is only of academic
interest whether HTML will or will not "allow" some order of elements.  it
might even be counter-productive in a strong sense: without SGML's content
models and error handling, applications _had_ to be written defensively,
but with more trustworthy data (or at least apparently trustworthy data),
this may cause nothing but a false sense of security, and another round of
"what's the point with SGML?"

SGML provides (or enables) some great benefits to information technology,
but it also has some serious gotchas that we should not be proud of and
should resist all temptations to use.  since they are easily locatable, it
is not hard to avoid them.  unfortunately, SGML is defined with some of
them "built in" to the core language, and no way to request their absence,
so something else must be used to disable the bad features.  common sense
and mature restraint should do the trick.

#<Erik 3026829701>
-- 
suppose we actually were immortal.  what is the opposite of living your
life as if every day were your last?