(The following note is being cross-posted to GUTNBERG and TEI-L,
as was the original to which it replies. Some reference is made to other discussion on GUTNBERG, but no detailed knowledge of that discussion is required. Apologies for the cross-posting. -CMSMcQ) Subject: SGML and the TEI Thank you for your note about the TEI and SGML, which addresses a number of important issues. I'll take this opportunity to reply to your comments, and to clarify some points where it appears clarification may be worthwhile. As I understand your posting, you make the following points. Please correct me if I've missed your meaning. 1. The TEI will be a wonderful thing if it encourages the production of a vast library of easy to use electronic texts. 2. However, the TEI's recommendations are not universal, but limited instead to the format defined by SGML. 3. SGML is too verbose: it drowns the text under a mass of markup, so SGML-marked text is difficult for humans to read. 4. If SGML codes occurred only at sentence, paragraph, or other major boundaries, they would be easier to skip past in reading; but they appear right in the middle of the text, and that makes the text harder to read. 5. Facilities are needed, and should be provided, to strip the SGML markup out of texts to allow those who don't want it to have the text without it. 6. (Implicitly) Markup-free texts are the most universal electronic texts possible. 7. (implicitly) Electronic texts should be distributed in ASCII-only form. 8. Electronic texts should be usable on many different machines, with many different programs, under many different operating systems. 9. (therefore) Electronic texts should be distributed free of markup. 10. SGML should be documented openly and publicly. (I'm not entirely certain that this is your point when you say "SGML should provide more of an open architecture as hardware developers use the term." You may mean instead: SGML ought to work with the wide variety of existing software for indexing, concording, etc.) ----- Well, on a number of points we agree very fully. I hope -- everyone associated with the TEI hopes -- that the TEI will indeed help encourage the production of large quantities of useful electronic texts, and I am gratified that you see that as a potential result of the work. We should not let the obvious divergence of our notions as to what makes for useful electronic text blind us to the fact that in this basic respect we are on the same side of the fence. Your second point raises the crucial question of what one wants or expects from a 'universal' method. The TEI's goal is to provide mechanisms for text encoding which may be used with texts of any language, of any period, in any genre, and which are not limited to specific machines, operating systems, pieces of software, or types of application. Your aims, if I understand them rightly, are similar. The hitch is this: no one method of encoding -- that is, neither any specific markup scheme, nor the scheme of avoiding all markup as far as that is possible -- can handle all of the goals above. If one settles for a specific markup scheme (as the TEI has done, using a markup scheme based on SGML), then there will be some software which cannot exploit the markup in the text. (But note well: this is not the same as saying there is software which cannot read the text at all. See further below.) If one eschews all markup (as you propose doing, and apparently have done in your electronic texts), then one cannot possibly transcribe texts in languages other than modern English, and one will have trouble even with modern English. (What does one do, for example, with the mirror writing in Alice in Wonderland? ASCII does not have mirror writing, and adding a note saying "this is in mirror script" is markup.) So to the charge that using SGML makes the TEI encoding scheme less than universal, I plead not guilty, or at least not very. One cannot encode texts in languages other than English without having markup for non-ASCII characters. And for the community we all want to serve, being able to handle languages other than English is an absolute, immovable, non-negotiable requirement. There is more Greek commonly available in machine-readable form, if I count my bytes right, than there is literary English. This is related to the issue of ASCII-only text, which comes up further below. Your third and fourth points complain that SGML is unreadable, that it drowns the text under a lot of markup. To this, I say first: it depends. If you mark the text up in detail, you will have a lot of markup. But an SGML markup which marks only the structural units most often marked in existing electronic texts (e.g book, chapter, and paragraph or canto, stanza, line) will not overwhelm the text at all. And in those cases, the bulk of the markup will indeed appear at the book, chapter, and paragraph or canto, stanza, and line boundaries. Most examples of SGML show a somewhat more elaborate markup, not because elaborate markup is inherent to SGML, but because elaborate markup shows the advantages of SGML-style markup, as perceived by SGML's proponents, very clearly. (Specifically, SGML is capable of handling complex structures somewhat more elegantly than most alternatives.) Second, as Liam R. E. Quin points out in his posting, a good document browser will hide the markup if you wish. Any SGML-aware editor can do that, and a lot of programs which aren't SGML-aware may be made to do it with some ingenuity (though I won't vouch for Word Star!). Third, a variety of techniques are available for making markup less obtrusive by making some markup implicit. (Marking paragraph breaks by blank lines, for example.) These are useful for local processing, though they are too error-prone to be really desirable in interchanging texts from one site to another. This is a big and complicated issue, which I am just going to skirt here. Next, you say that facilities for stripping SGML markup out of texts are much to be desired, for those that don't want any markup in their texts. I agree, but am not sure anyone need provide any special software for this, since any editor worth its salt should be able to find a left angle bracket and delete until it finds the next right angle bracket. (This, I submit, is one crucial difference between SGML and the internal markup of something like WordStar or Word Perfect: SGML markup is explicitly delimited, and it is a simple matter to distinguish markup from content, whether your purpose is to index or concord the file or to strip out all the markup.) It is with your next two points (that electronic texts should be markup-free and ASCII-only) that we reach the nub of the matter. First, a distinction: markup-free texts have as their content only the contents of the text itself. (I'll pass by the theoretical difficulties posed by that formulation, and assume that for at least some cases we can agree on what "the contents of the text itself" are.) ASCII-only texts contain no characters not defined by the American national standard ANSI X3.4. Since most microcomputer word processors insert their markup in the form of non-ASCII characters, it is easy to conflate the notions of ASCII-only text and markup-free text. But the two are logically distinct, and it's an important distinction for any discussion of SGML and the TEI. SGML does not require any particular character set, and therefore (given the proper declarations) SGML files may legally contain non-ASCII characters. The TEI recommendations for interchange of texts require conforming texts to contain only a subset of the characters included in ASCII. And therefore a TEI text is indeed ASCII-only, on any ASCII machine (and EBCDIC-only on any EBCDIC machine), and can be read by any program capable of reading ASCII (or EBCDIC) text files. No text conforming to the TEI, however, will be markup-free. (The text itself may be, though that's not recommended. But all TEI-conforming texts must have a header with bibliographic documentation saying what the file is, and the header is separated from the text proper by markup.) So in a very simple way, no software restrictions apply to TEI files: if you can read this note with your software, you can read a TEI file with it. A great many people deal with SGML files now, using no special software. Special software, needless to say, makes many things a lot easier, but none is required for TEI files the way it would be required, say, for WordStar files or Word Perfect files. In short, I agree that texts are best interchanged when they contain no non-ASCII characters. I can't agree, however, that they should contain no markup, because most of the programs that most of us use cannot recognize crucial points (like structural divisions) without some explicit flag. If I get a text from which the markup has been stripped out, the first thing I must do, typically, is to put back into it markup for the information I need my programs to have. The notion that we should each individually have to insert markup for (say) chapter divisions in Moby Dick, or else live without them, just seems terribly wasteful to me, as does the notion that we should have to re-format the markup differently for every program we deal with. (OCP wants to see <C 132>, Word Cruncher wants to see |c132, LaTeX wants to see \chapter{132} -- enough already!) Obviously no one has to put in markup if they don't want to. But since so many of us do work to enrich our electronic texts with structural and other information, it would seem a shame not to have any mechanism whereby we can share the results of our work. I'm getting close to the end. We agree whole-heartedly on the desire to make electronic texts usable on a wide range of machines, with a wide range of software. I believe that ASCII-only interchange is a key to this, and that insisting that the documents be markup-free as well adds nothing and removes a lot. A lot of programs, after all, work best when they know something about the structure of the text. And we tell them about the structure of the text by using markup. So I don't think markup-free texts are the way to achieve the goal of broad usability. Your final point, about open architecture, is an important one. It is absolutely true that for the free exchange of texts we would all like to make possible, we do not want to focus on any proprietary scheme, any more than on a scheme which requires a specific piece of software. WordStar is ruled out because it's proprietary and not publicly documented (as well as because of any intrinsic limitations we might find in it); TeX and LaTeX are ruled out because they are specific pieces of software, for a specific, highly specialized task. That is why the TEI chose to base its scheme on SGML. SGML is not proprietary, but an international standard freely available to anyone. The TEI guidelines similarly are publicly available and will never be a proprietary scheme. Neither SGML nor the TEI application of it require any particular piece of software: there is a range of SGML-conformant and SGML-aware software in existence and under development, which addresses a fairly wide range of intended uses. There will be more, because SGML is publicly documented. So: you are right. Interchange schemes for electronic texts should be based on an open (publicly documented, non-proprietary, general-purpose) architecture. And after several years of working on these issues, the TEI has concluded that the best of the available open architectures for electronic text is ... SGML. (I should warn those new to SGML, however, that the standard itself makes very heavy reading, because of the rather rigid restrictions imposed by ISO on standards. One of the books written to explain SGML will be a lot easier going: Martin Bryan and Eric van Herwijnen have each written useful volumes, and I understand Charles Goldfarb has also got a book out now, which I have not seen.) This has gone on to some length, because the issues you raise are such important ones. I hope that this discussion can clarify the issues, and I welcome your comments directed to that end. Best regards, -Michael Sperberg-McQueen ACH / ACL / ALLC Text Encoding Initiative University of Illinois at Chicago |
In regard to the exchange between Hart and Sperberg: in distinguishing
between straight ASCII text and text without markup, Sperberg doesn't quite make one of his points. Straight ASCII text is, in fact, marked up, though in a minimalist way. CR, LF, HT, SP, and a few others are there to show the spacing and line structure of the text. What Hart objects to, among other things, is the presence of something more than this minimalist markup. But, of course, SGML markup isn't intended for human eyes, anymore than WordStar's high order bit was. One of the problems with SGML tagging at the moment is a shortage (near absence) of serious word processing tools based on it. |
Free forum by Nabble | Edit this page |