SGML and the TEI (notes on Michael Hart's comments)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

SGML and the TEI (notes on Michael Hart's comments)

Michael Sperberg-McQueen
(The following note is being cross-posted to GUTNBERG and TEI-L,
as was the original to which it replies.  Some reference is made to
other discussion on GUTNBERG, but no detailed knowledge of that
discussion is required.  Apologies for the cross-posting.  -CMSMcQ)

Subject:  SGML and the TEI

Thank you for your note about the TEI and SGML, which addresses a number
of important issues.  I'll take this opportunity to reply to your
comments, and to clarify some points where it appears clarification may
be worthwhile.

As I understand your posting, you make the following points.  Please
correct me if I've missed your meaning.

    1.  The TEI will be a wonderful thing if it encourages the
production of a vast library of easy to use electronic texts.
    2.  However, the TEI's recommendations are not universal, but
limited instead to the format defined by SGML.
    3.  SGML is too verbose:  it drowns the text under a mass of markup,
so SGML-marked text is difficult for humans to read.
    4.  If SGML codes occurred only at sentence, paragraph, or other
major boundaries, they would be easier to skip past in reading; but they
appear right in the middle of the text, and that makes the text harder
to read.
    5.  Facilities are needed, and should be provided, to strip the SGML
markup out of texts to allow those who don't want it to have the text
without it.
    6.  (Implicitly) Markup-free texts are the most universal electronic
texts possible.
    7.  (implicitly) Electronic texts should be distributed in
ASCII-only form.
    8.  Electronic texts should be usable on many different machines,
with many different programs, under many different operating systems.
    9.  (therefore) Electronic texts should be distributed free of
    10.  SGML should be documented openly and publicly.  (I'm not
entirely certain that this is your point when you say "SGML should
provide more of an open architecture as hardware developers use the
term."  You may mean instead:  SGML ought to work with the wide variety
of existing software for indexing, concording, etc.)


Well, on a number of points we agree very fully.  I hope -- everyone
associated with the TEI hopes -- that the TEI will indeed help encourage
the production of large quantities of useful electronic texts, and I am
gratified that you see that as a potential result of the work.  We
should not let the obvious divergence of our notions as to what makes
for useful electronic text blind us to the fact that in this basic
respect we are on the same side of the fence.

Your second point raises the crucial question of what one wants or
expects from a 'universal' method.  The TEI's goal is to provide
mechanisms for text encoding which may be used with texts of any
language, of any period, in any genre, and which are not limited to
specific machines, operating systems, pieces of software, or types of
application.  Your aims, if I understand them rightly, are similar.

The hitch is this:  no one method of encoding -- that is, neither any
specific markup scheme, nor the scheme of avoiding all markup as far as
that is possible -- can handle all of the goals above.  If one settles
for a specific markup scheme (as the TEI has done, using a markup scheme
based on SGML), then there will be some software which cannot exploit
the markup in the text.  (But note well:  this is not the same as saying
there is software which cannot read the text at all.  See further
below.)  If one eschews all markup (as you propose doing, and apparently
have done in your electronic texts), then one cannot possibly transcribe
texts in languages other than modern English, and one will have trouble
even with modern English.  (What does one do, for example, with the
mirror writing in Alice in Wonderland?  ASCII does not have mirror
writing, and adding a note saying "this is in mirror script" is markup.)

So to the charge that using SGML makes the TEI encoding scheme less than
universal, I plead not guilty, or at least not very.  One cannot encode
texts in languages other than English without having markup for
non-ASCII characters.  And for the community we all want to serve, being
able to handle languages other than English is an absolute, immovable,
non-negotiable requirement.  There is more Greek commonly available in
machine-readable form, if I count my bytes right, than there is literary
English.  This is related to the issue of ASCII-only text, which comes
up further below.

Your third and fourth points complain that SGML is unreadable, that it
drowns the text under a lot of markup.  To this, I say first:  it
depends.  If you mark the text up in detail, you will have a lot of
markup.  But an SGML markup which marks only the structural units most
often marked in existing electronic texts (e.g book, chapter, and
paragraph or canto, stanza, line) will not overwhelm the text at all.
And in those cases, the bulk of the markup will indeed appear at the
book, chapter, and paragraph or canto, stanza, and line boundaries.
Most examples of SGML show a somewhat more elaborate markup, not because
elaborate markup is inherent to SGML, but because elaborate markup shows
the advantages of SGML-style markup, as perceived by SGML's proponents,
very clearly.  (Specifically, SGML is capable of handling complex
structures somewhat more elegantly than most alternatives.)

Second, as Liam R. E. Quin points out in his posting, a good document
browser will hide the markup if you wish.  Any SGML-aware editor can do
that, and a lot of programs which aren't SGML-aware may be made to do it
with some ingenuity (though I won't vouch for Word Star!).

Third, a variety of techniques are available for making markup less
obtrusive by making some markup implicit.  (Marking paragraph breaks by
blank lines, for example.)  These are useful for local processing,
though they are too error-prone to be really desirable in interchanging
texts from one site to another.  This is a big and complicated issue,
which I am just going to skirt here.

Next, you say that facilities for stripping SGML markup out of texts are
much to be desired, for those that don't want any markup in their texts.
I agree, but am not sure anyone need provide any special software for
this, since any editor worth its salt should be able to find a left
angle bracket and delete until it finds the next right angle bracket.

(This, I submit, is one crucial difference between SGML and the internal
markup of something like WordStar or Word Perfect:  SGML markup is
explicitly delimited, and it is a simple matter to distinguish markup
from content, whether your purpose is to index or concord the file or to
strip out all the markup.)

It is with your next two points (that electronic texts should be
markup-free and ASCII-only) that we reach the nub of the matter.

First, a distinction:  markup-free texts have as their content only the
contents of the text itself.  (I'll pass by the theoretical difficulties
posed by that formulation, and assume that for at least some cases we
can agree on what "the contents of the text itself" are.)  ASCII-only
texts contain no characters not defined by the American national
standard ANSI X3.4.  Since most microcomputer word processors insert
their markup in the form of non-ASCII characters, it is easy to conflate
the notions of ASCII-only text and markup-free text.  But the two are
logically distinct, and it's an important distinction for any discussion
of SGML and the TEI.

SGML does not require any particular character set, and therefore (given
the proper declarations) SGML files may legally contain non-ASCII

The TEI recommendations for interchange of texts require conforming
texts to contain only a subset of the characters included in ASCII.

And therefore a TEI text is indeed ASCII-only, on any ASCII machine (and
EBCDIC-only on any EBCDIC machine), and can be read by any program
capable of reading ASCII (or EBCDIC) text files.  No text conforming to
the TEI, however, will be markup-free.  (The text itself may be, though
that's not recommended.  But all TEI-conforming texts must have a header
with bibliographic documentation saying what the file is, and the header
is separated from the text proper by markup.)

So in a very simple way, no software restrictions apply to TEI files:
if you can read this note with your software, you can read a TEI file
with it.  A great many people deal with SGML files now, using no special
software.  Special software, needless to say, makes many things a lot
easier, but none is required for TEI files the way it would be required,
say, for WordStar files or Word Perfect files.

In short, I agree that texts are best interchanged when they contain no
non-ASCII characters.  I can't agree, however, that they should contain
no markup, because most of the programs that most of us use cannot
recognize crucial points (like structural divisions) without some
explicit flag.  If I get a text from which the markup has been stripped
out, the first thing I must do, typically, is to put back into it markup
for the information I need my programs to have.  The notion that we
should each individually have to insert markup for (say) chapter
divisions in Moby Dick, or else live without them, just seems terribly
wasteful to me, as does the notion that we should have to re-format the
markup differently for every program we deal with.  (OCP wants to see <C
132>, Word Cruncher wants to see |c132, LaTeX wants to see \chapter{132}
-- enough already!)

Obviously no one has to put in markup if they don't want to.  But since
so many of us do work to enrich our electronic texts with structural and
other information, it would seem a shame not to have any mechanism
whereby we can share the results of our work.

I'm getting close to the end.  We agree whole-heartedly on the desire to
make electronic texts usable on a wide range of machines, with a wide
range of software.  I believe that ASCII-only interchange is a key to
this, and that insisting that the documents be markup-free as well adds
nothing and removes a lot.  A lot of programs, after all, work best when
they know something about the structure of the text.  And we tell them
about the structure of the text by using markup.  So I don't think
markup-free texts are the way to achieve the goal of broad usability.

Your final point, about open architecture, is an important one.  It is
absolutely true that for the free exchange of texts we would all like to
make possible, we do not want to focus on any proprietary scheme, any
more than on a scheme which requires a specific piece of software.
WordStar is ruled out because it's proprietary and not publicly
documented (as well as because of any intrinsic limitations we might
find in it); TeX and LaTeX are ruled out because they are specific
pieces of software, for a specific, highly specialized task.

That is why the TEI chose to base its scheme on SGML.  SGML is not
proprietary, but an international standard freely available to anyone.
The TEI guidelines similarly are publicly available and will never be a
proprietary scheme.  Neither SGML nor the TEI application of it require
any particular piece of software:  there is a range of SGML-conformant
and SGML-aware software in existence and under development, which
addresses a fairly wide range of intended uses.  There will be more,
because SGML is publicly documented.

So:  you are right.  Interchange schemes for electronic texts should be
based on an open (publicly documented, non-proprietary, general-purpose)
architecture.  And after several years of working on these issues, the
TEI has concluded that the best of the available open architectures for
electronic text is ... SGML.

(I should warn those new to SGML, however, that the standard itself
makes very heavy reading, because of the rather rigid restrictions
imposed by ISO on standards.  One of the books written to explain SGML
will be a lot easier going:  Martin Bryan and Eric van Herwijnen have
each written useful volumes, and I understand Charles Goldfarb has also
got a book out now, which I have not seen.)

This has gone on to some length, because the issues you raise are such
important ones.  I hope that this discussion can clarify the issues, and
I welcome your comments directed to that end.

Best regards,

-Michael Sperberg-McQueen
 ACH / ACL / ALLC Text Encoding Initiative
 University of Illinois at Chicago

Reply | Threaded
Open this post in threaded view

Re: SGML and the TEI (notes on Michael Hart's comments)

In regard to the exchange between Hart and Sperberg:  in distinguishing
between straight ASCII text and text without markup, Sperberg doesn't quite
make one of his points.  Straight ASCII text is, in fact, marked up, though
in a minimalist way.  CR, LF, HT, SP, and a few others are there to show the
spacing and line structure of the text.

What Hart objects to, among other things, is the presence of something more
than this minimalist markup.  But, of course, SGML markup isn't intended for
human eyes, anymore than WordStar's high order bit was.  One of the problems
with SGML tagging at the moment is a shortage (near absence) of serious word
processing tools based on it.