note forwarded from Michael S. Hart

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

note forwarded from Michael S. Hart

Michael Sperberg-McQueen
This note from Michael S. Hart was diverted by the list server for
uninteresting technical reasons, but was intended for this list, as
well as for the GUTNBERG list which Mr. Hart moderates.
-CMSMcQ

--------

From:         "Michael S. Hart" <[hidden email]>
Subject:      Text Encoding and Decoding (Initiative)
X-To:         Text Encoding Initiative Discussion Group <[hidden email]>

This column is a response to the tenets of the Text Encoding
Initiative as perceived through various of their statements,
both in print and in the electronic media such as listserver
emailings on the order of Humanist.  If any readers have any
additional material which either supports or contradicts any
of the generalizations made herein, an email copy of them to
this address would be both appreciated and hopefully worth a
mention, if not inclusion, in a future column.

The Text Encoding Initiative (TEI) appears at first glance a
wonderful thing, a movement to encourage the production of a
vast library of easy to use electronic texts (etexts).  Text
encoding here, referring not so much to the actual encoding,
as it were, of printed matter into universal etexts, as to a
specific computer oriented language-the Standard Generalized
Markup Language (SGML).

This language does not so much translate the text into etext
which can only be read by computers, as it does additions to
the etexts, which point out various points of interest to an
army of scholars who are the target audience for such things
as a general rule.  SGML does not remove anything from etext
but it adds so much that it makes it difficult or impossible
for the normal reader to scan the material in the manner you
are likely to be scanning this column right now.

Instead of a universal format which can be read well by both
humans and computers with ease, various codes are entered by
the encoders right into the text, codes which do not appear,
as far as I have been able to determine, only at the ends of
sentences, paragraphs, pages, chapters or what have you.  If
this were the case, then humans could easily develop a sense
of reading procedures which would allow the eye and the mind
to easily skip over the notations, if a reader wanted to pay
attention to the text only.

I have mentioned this on several occasions to the members of
TEI which whom I am electronically acquainted, either via an
email link or via phone.  The responses have always been the
same:  This IS pure ASCII text and it doesn't need a method,
inclusive or exclusive to the TEI program, to strip it of an
interesting and useful set of added notation.

I predict that if and when SGML becomes widely spread, strip
features will be added not only to the authorized programs a
person might use to work with them, but also to the various,
and quite popular text search programs which include a strip
feature which removes the high bits from all WordStar terms.
Almost all programs now contain options which allow files to
be transported to other programs for other uses.  Unless TEI
is intentionally being narrow minded in the scope of people,
programs,and other, perhaps yet unforseen applications, they
will provide the most universal electronic texts possible.

I continue to request this feature, much in the manner which
others requested WordStar strippers for the odd characters a
normal text reader would see at the end of lines, paragraphs
and other locations.  These characters were not in the lower
ASCII set most of us refer to as pure ASCII, and while there
were not as many of them as in most SGML texts I have looked
over, they actually changed the last character in some lines
and paragraphs by adding an eighth bit, which was useful for
the WordStar program, but annoying to the reader, especially
when several logical choices were apparent to the reader, or
perhaps no apparently logical choices at all.

So far, in the world of electronic text, each provider seems
to be insisting pushing their own products at least as much,
if not more than the etexts themselves.  The policy includes
the inclusion of textual errors which allow identifications,
for the purposes of copyright protection, of electronic text
which would reside in the public domain if it were not for a
markup, page numbering, or other scheme to create artificial
but legal reasons for copyright protection.

Let us not see SGML and TEI be used in a similar manner of a
restrictive rather than open academic policy.

Thank you for your interest,

Michael S. Hart, Director, Project Gutenberg
National Clearinghouse for Machine Readable Texts

THESE NOTES ARE USUALLY WRITTEN AT A LIVE TERMINAL, AND THE
CHOICE OF WORDS IS OFTEN MEANT TO BE SUCH AS TO PROVOKE THE
GREATEST POSSIBLE RESPONSE SHORT OF BEING OFFENSIVE.  TRUTH
IN THESE NOTES IS OF GREAT CONCERN, THE FORM IS SECONDARY -
OTHER THAN THE TOKEN EFFORT OF JUSTIFIED RIGHT MARGINATION.

BITNET:  HART@UIUCVMD      INTERNET:  [hidden email]
(*ADDRESS CHANGE FROM *VME* TO *VMD* AS OF DECEMBER 18!!**)
(THE GUTNBERG SERVER IS LOCATED AT [hidden email])

NEITHER THE ABOVE NAMED INDIVIDUALS NOR ORGANIZATIONS ARE A
AN OFFICIAL REPRESENTATIVE OF ANY OTHER INSTITUTION NOR ARE
THE ABOVE COMMENTS MEANT TO IMPLY THE POLICIES OF ANY OTHER
PERSONS OR INSTITUTIONS, THOUGH OF COURSE WE WISH THEY DID.

Loading...