what's in a DTD

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

what's in a DTD

Michael Sperberg-McQueen
My apologies to those who haven't been swimming in this particular
alphabet soup as long as I have; my note last week about formal
grammatical specifications of SGML and the TEI encoding scheme should
have made more allowances for the variety of prior knowledge among us.

Those of you who read and understood chapter 3 of the Guidelines may
wish to tune out now for about three to five paragraphs ...

'DTD' stands for 'document type definition' -- which is the SGML term
for (a) the formal specification of what elements may occur within a
document, their allowable combinations, and the 'attributes' they may or
must carry, together with (b) (informally specified) rules saying what
the elements mean, when they are to be used, etc.  'DTD' is commonly
used, however, to mean 'document type declaration' -- which is the SGML
term for just the formal part of the document type definition (part (a)
in the preceding definition).  Because the standard does not specify
clearly what must be or may be in a document type *definition* which
isn't in a *declaration*, the distinction appears to be rather
metaphysical, and I am not always real consistent in my usage of the
abbreviation.

The DTD contains declarations of SGML *elements*, SGML *attributes*, and
the *entities* to which one refers in the course of the document (or in
the course of the document type definition itself).  Optionally it may
also contain other declarations for SGML objects, but these other
objects are not used by the TEI DTDs.  The formal declaration of any
element specifies formally what forms the content of that element may
take; the element declarations thus resemble productions in a BNF-style
formal grammar of a language, the DTD itself resembles the grammar of a
language, and the set of documents which conform to a given DTD
resembles the set of strings which constitute a language.  A phonebook
entry might be defined as containing exactly one name, one
address, and one phone number, in that order:

    <!ELEMENT entry    (name, address, phone) >

For fuller explanation, see the Guidelines themselves, or Lou Burnard's
introduction to SGML, found in the TEI-L file server under the name EDJ2
MEMO (send a note to LISTSERV @ UICVM -- *not* repeat *not* to the list
itself -- containing the single line GET EDJ2 MEMO to get a copy of this
file).

'BNF' stands either for 'Backus Normal Form' or for 'Backus/Naur Form',
for John Backus and (possibly) Peter Naur, who worked on the committee
which developed Algol-60.  It is a formalism invented by Backus for the
specification of legal syntax in formal languages, and became widely
known after its use in defining Algol.

BNF is a specific technique for defining what are called 'context-free'
languages.  A BNF production defines a single term (given on the left)
as any of a series of alternative sequences of terms (given on the
right); each alternative sequence contains zero or more terms, which may
either be defined in the BNF itself ('non-terminals') or undefined
(primitives or 'terminal symbols').  E.g.

    phonebook-entry ::=  name address phone-number
    phone-number    ::=  digit digit digit '-' digit digit digit digit
    digit           ::= '0' | '1' | '2' | '3' | ... | '9'

OK, techies back with us now?  Fine.

The salient points, for the non-technical reader, are these:  both BNF
and the SGML DTD are methods of providing formal, machine-enforceable
specifications of legal sequences of things (characters, words, tokens,
in the BNF case; in the case of SGML, of elements).  They are roughly
similar in purpose, and fairly similar in notation, but the differences
in notation make a difference for some problems in software development.

One crucial difference should be pointed out.  (Warning:  technical
material ahead.  If your eyes glaze over when someone mentions formal
language theory, you may wish to tune out before you nod off and maybe
hit your head on your keyboard ...)

BNF grammars are usually written to allow programs to assign structure
to data streams in which that structure is not explicitly marked.  In
SGML, the beginning and end of each element are explicitly marked
already (unless one is using some kind of markup minimization, which
would mean one was not using the TEI interchange format) -- one may wish
to *validate* the structure specified in the document, and for that you
need the DTD, but if one just wishes to *represent* the structure found
in the document then one doesn't need the DTD -- one just needs to
recognize the start- and end-tags and build one's tree accordingly.
(For this, a BNF of the grammar of legal SGML tags may be used with a
parser generator ...)

Let's take a simple example.  A BNF might be written to allow the
processing of phonebook data looking something like this:

    Smith, John Q., 123 Southmoor, 323-4567
    Fabbro, Giovanno Q., 321 Wisconsin, 232-7654
    ...

and you need the BNF or some equivalent to recognize which parts of the
data are names, addresses, and phone numbers, which names, addresses and
phone numbers fit with each other into entries, and so on.

The fully marked-up form of this data in SGML might be something like
this:

    <phonebook>
    <entry>
        <name>Smith, John Q.,</name>
        <address>123 Southmoor,</address>
        <phone>323-4567</phone>
    </entry>
    <entry>
        <name><surname>Fabbro, Giovanni,</name>
        <address>321 Wisconsin,</address>
        <phone>232-7654</phone>
    </entry>
    </phonebook>

Since the names, addresses, phone numbers, and entries are already
explicitly marked here, a processing program can assign the right
structure to the data even without a DTD.  A DTD is needed only to
verify that the document is legal (e.g. to answer the question "is it
legal to omit the address?" ).  If you want to validate TEI documents
without using SGML-conformant software, you will need to worry about the
DTD and how to parse the specifications it contains and match them to
the document.  If you only want to process TEI documents, you may get by
with a lot less.  The DTD will be useful, in that case, primarily as a
check to see what combinations of tags you are likely to see, so your
program can be prepared to handle them correctly.  (Of course, you
will want to validate the documents formally at some point, otherwise
you are asking for unpleasant surprises.)

-Michael Sperberg-McQueen
 ACH / ACL / ALLC Text Encoding Initiative
 University of Illinois at Chicago