Since versions of sed (Stream EDitor) are available for nearly every
type of computer system, usually for free, perhaps sed scripts for
conversion into and out of TEI format would be a good, general
contribution to the people on this list. If we write too many machine-
specific translation programs, the Mac users, the MeSsyDOS users, the
Unix users, the VAX users, the Amiga users, the Atari ST users, the
Sinclair users ... in short, _someone_ will always be left out. Barring
sed scripts, awk scripts would be a good second choice, since gawk
(Gnu AWK) has also been ported to most CPUs, and is also free.
David Megginson suggests using sed or awk or ports thereof to translate
into and out of TEI. While this has its commendable sides, there are a
few trouble spots that are quite annoying even outside of TEI needs:
(1) The length of lines are restricted in both sed and awk.
(2) Both sed and awk operate on lines, which makes some parts of SGML
very difficult to describe and handle efficiently, and correctly.
(3) Neither handles 8-bit data very cleanly, be it binary or 8-bit text.
(4) Neither handles arbitrary binary data with context sensitive meaning,
such as found in many proprietary text representation systems.
(5) Both sed and awk are easy to use for simple tasks, but complex
problems get exponentially more complex to solve with sed, less so
(6) Both sed and awk are regular expression based. Regexps are powerful
yet get complex once you leave the character-orientation they have.
SGML is not character-oriented, but token-oriented, and use regular
expressions on tokens in the syntax. This can get arbitrarily
complex to represent in a character-based regular expression engine.
This is not to deride the value of awk or sed. I use awk to process
(not validate) simple SGML documents such as invoices and business
letters. I even used awk and sed to format and print an SGML document,
from SGML input to laser printer driving code output. It can be done,
but it usually requires multiple steps of sed and awk, and care must
be taken to "layer" the operations correctly so you handle everything.
Intermediate steps have to be designed. It's often easier to write up
something which builds on an SGML parser. There are a few SGML parsers
in the public domain, as well. NIST comes to mind.
Apropos on the topic of computer representations of text, I got a
chance to air my frustration with Macs today when talking to a graphic
designer and a typographer. They were so happy someone in the computer
business knew about typography and knew it was an artform you must
learn to master, not something which could spring out of a computer as
if it was instant knowledge. I got to plug SGML, telling them that
computer people could work with information, as they know, and the
typographers could work with the presentation, as they know, stressing
that each requires special knowledge, and that they could meet in a
language designed to separate the two. I think I got two new friends.
I have the C source code for the Gnu version of sed, and I would be
happy to mail it to anyone who would like it. There are, I think,
several MSDOS binary versions, and at least one for the Atari ST. Check
out your local BBS or archive site. For the Amiga, the best place to
look would the the Fred Fish (??) collection of free software. If there
is not a binary version for the Mac yet, a Mac user with Think C should
be able to port the program in an evening. Finally, sed comes as
standard issue with all Unix/Minix/Xenix etc. implementations.
If you would like a copy, send me mail at my Unix account, NOT to the
return address of this message.