[Again with his kind permission, I am posting the following set of
comments from David Stone. LB ]
Date: 19 Sep 90 1:41 PM
Your name: David Stone
Your postal address: 46 Pattison Lane
Your e-mail address (if any): Internet: [hidden email]
X.400: /C=GB/A=GOLD 400/P=PRIME/O=WILLEN R+D/S=STONE/G=DAVID/
Your occupation: Software Engineer
Your academic background: M.A.(Mathematics), M.Phil.(Linguistics)
Your immediate reactions to the Guidelines (please tick or cross
Relevance to your present concerns or interests: high x-medium low
Importance to your future research plans: high medium x-low
Comprehensibility/usability of current draft: x-high medium low
Detailed technical comment: [Please feel free to comment on any
part of the Guidelines or in general. Comments relating to spe-
cific parts of the text are most helpful if they include a sec-
tion or chapter number]
Here are my comments on draft version 1.0 of the TEI Guidelines dated
16 July 1990. They concentrate on character sets (chapter 3) merely because
that is an area in which I have some knowledge.
Please note that these remarks are entirely my personal responsibility;
they do not reflect the position of my employer Prime Computer at all,
which to my knowledge has no material interest in the TEI.
Some of my remarks are rather imperative; this is merely for brevity,
and is not meant to be rude in any way. All my remarks are of course
Before I start making detailed points, I should like to say that overall
the Guidelines seem to be good; I am sure you're right to use SGML, for
example. I was pleased to see the generally pragmatic approach being
followed which leaves the media of interchange unspecified, rather than
an ASN.1-based approach - for example - which would restrict you to OSI.
General comments on character sets and the Guidelines.
Firstly, I should like to restate the problem. The main problem is that
computer character sets in use do not have a repertoire at all adequate
for literary and linguistic use. A minor problem is that different
computers use different encodings for the characters they do provide.
If they had an adequate repertoire, computer character sets could be
used as sets of graphemes. But they don't; so we can't. Thus we need
to distinguish between graphemes and characters.
I would propose a framework of three levels: (1) the _grapheme
repertoire_; (2) the _interchange character repertoire_; and (3) the
_interchange character encoding_.
The Guidelines at present seem to be making distinctions like these, but
not consistently and explicitly. What I describe below makes the
distinctions clear (I hope).
The purpose of making these distinctions is to break down the problem
into manageable parts, and to separate the technology-dependent aspects
(what characters do computers support? how are they encoded?) from the
application-dependent (what language are we using? what script?).
To define the levels in more detail...
(1) the grapheme repertoire.
This is the set of characters naturally appropriate to the language in the
chosen method of writing, unrestrained by limitations of hardware or software.
This is roughly a set of "graphemes", where "grapheme" is as defined in
3.1.1 of the Guidelines.
No encoding or computer representation is specified at this level.
(2) the interchange character repertoire.
This is a set of characters using which the grapheme repertoire
is encoded. This set is chosen to be a lowest-common-denominator of
currently-available computer character repertoires: most probably at present
the nationally-invariant part of ISO 646, as the Guidelines suggest
No binary encoding is specified at this level.
(3) the interchange character encoding
This specifies the binary values using which the interchange character set
is encoded. The most obvious choice is ISO 646 encoding; but you
might wish to permit EBCDIC encoding as well.
The SGML standard calls this and the interchange character repertoire
together the "document character set" (ISO 8879, definitions 4.98).
These three levels fit nicely with SGML's. (1) corresponds with the SGML
abstract syntax; (2) with the SGML concrete syntax. At level (2) the SGML
and text proper are merged, and so the mapping to level (3) is common.
A diagram may be clearer:
| Abstract | level 1 | Abstract |
| syntax | | content |
| Concrete | level 2 | Interchange |
| syntax | | content |
level 3 | Interchange |
| encoding |
This way of looking the problem of character encoding has the advantage
of splitting up the problem and separating logically independent matters.
For example, the interchange repertoire and encoding are now seen as
separate; the Guidelines' "levels" (as in 3.2.7) associate them. One
might wish to use, for example, the ISO-646 invariant repertoire with
EBCDIC encoding. Or one might wish to use the ISO 6937-2 repertoire
with ISO (DP) 10646 encoding.
This separation has the advantage that users wanting to define a new
grapheme only have to define its mapping onto interchange characters,
and do not have to know any obscure numeric values.
It also allow the mappings between levels to be independently specified,
which is useful because I would expect the mapping from interchange
characters to encoding to be restricted to one or two well-known
mappings, whereas the mapping from graphemes to interchange characters
will vary depending very much on the document and its repertoire.
With this structure as a basis, I would suggest sections in the Guidelines
with outlines as follow:
The Writing Scheme
This, reworked to take account of what I have set above, would define:
- the grapheme repertoire;
- the interchange character repertoire;
- and the mapping from grapheme to interchange character(s).
(plus oddments like the WSD name, date of specification, as in 3.2.6
of the Guidelines.)
SGML character entity sets are very often a convenient way of declaring
the mapping of graphemes into interchange characters. See my detailed
comments below on p50, 3.2.12.
Some TEI-standard grapheme repertoires
This section would define, as WSDs, some commonly-used repertoires for
documents. These should cope with all normal European Latin-based
alphabets, Greek, Cyrillic and any others widely used by the target
It would also define their mappings onto the TEI-standard interchange
repertoires. I would propose that SGML entity names be used as much as
possible, rather than ad-hoc methods (e.g. use ü not u:). This
may be verbose but it will be more widely understood and will be
Defining your own grapheme repertoire
This section would explain how users who wished, for example, produce
a corpus of Easter Island writing, would specify the repertoire of characters
It would also say how the user should define their mappings onto the
TEI-standard interchange repertoires.
Interchange character repertoires
This would list the permitted interchange character repertoires. I
would suggest that only one or perhaps two be permitted, to ensure
maximum interchangeability of documents.
I would also suggest that users _not_ be allowed to extend these,
although future editions of the Guidelines might.
The front-runner is - as the Guidelines say - the nationally-invariant
subset of ISO 646.
As hardware and software improve it might in time be possible to define
This would specify the permitted encodings to be used for each of the
TEI-standard interchange repertoires using SGML CHARSET declarations.
These also should be kept as few as possible. For example, ISO 646 encoding
of ISO 646 characters.
The encoding of new-line/end-of-record also needs to be defined here.
Detailed comments on the Guidelines
p40, 3.1.1, footnote 2: You _are_ concerned with the control
character(s) used to represent end-of-line/end-of-record, as you mention
on p49. I think you should explicitly mention that TAB characters are
banned, because on Un*x systems people are very used to using them.
p41, 3.1.2: ISO 8859 is in nine, not eight, parts. Part 9 is very similar to
part 1, except that it caters for Turkish rather than Icelandic.
p41, 3.1.3: The wrong standard is quoted here. Replace "ISO Standard
7350..." by "ISO Standard 2375 Procedure for registration of escape
sequences". ISO 7350 refers solely to subrepertoires of ISO 6937. ECMA
acts as the registrar for ISO 2375; the NCC in Manchester for ISO 7350.
p43, 3.1.5 says that ways of "wrapping-up" the complete encoded document
are not addressed by the Guidelines. I don't know whether this is wise.
There _are_ commonly-used "wrappers" and whenever users have a chance
they should be encouraged to use them. Because I would imagine that
many users of the Guidelines are not computing specialists, it would be
helpful to inform them of these common wrappers. For example, tape
sizes, densities, formats and so on; SDIF parameters for X.400 mail use;
a value of the "Encoding:" keyword (see RFC 1154) for SMTP mail.
p45, 3.2.1: LANG attribute: I would propose that this change _only_
the grapheme repertoire and its mapping to the interchange character
repertoire. I would also propose that the interchange character set be
fixed within any document. Many computer systems (simple text editors
in particular) are incapable of handling context-sensitive character
p46, 3.2.2: Note that the Swiss sometimes substitute for '<' and '>',
contrary to ISO 646. '_' is occasionally replaced by a left arrow
p46, 3.2.2: If it is necessary to redefine the concrete SGML syntax,
I think the Guidelines should propose just one redefinition. It will
be very confusing if everybody invents their own.
p46, 3.2.2: I agree with the last paragraph, except that I would
suggest you insist on transliterations using one interchange character
for one grapheme, to avoid the problems described in 3.1.7. If there
are not enough interchange characters and no public entities include the
required graphemes, new public entity sets should be registered
according to ISO 9070 and endorsed by the TEI committee.
p46, 3.2.4: Have you considered defining a collating sequence of
graphemes (alphabetical order) within the Writing Set Declaration? This
may be useful for enabling automatic generation of indexes from
documents. The order of the interchange character encodings is almost
certainly not the most natural and normal for most languages.
p47, 3.2.4, item 2: This is one area in the Guidelines where I believe
the distinction between grapheme and interchange character has become
blurred. I propose that all definition of the interchange character set
(repertoire and encoding) be left to the SGML CHARSET declaration, so
that the WSD would just define a grapheme repertoire and its mapping to
interchange characters. Here is some proposed replacement text for 3.2.4
starting from item 2:
2. A specification of the meaning of each grapheme used. The
specification will include at least one of (a) or (b) below.
a. Reference to a public character entity, in Annex D of ISO 8879, or
registered according to ISO 9070;
b. A description of the grapheme, preferably both that by which native
speakers of the language in question refer to it, and (usually in
English) as used in ISO Character Set standards.
c. The unique sequence of interchange characters used to represent the
character. For entity references this will of course use SGML's &;
d. Certain special properties, such as being a diacritical mark.
Note that all the interchange characters will have been declared as
valid data characters by an SGML CHARSET declaration.
p49, 3.2.7: Conformance levels appear to me to be confused. They are a
mixture of specifying the interchange character repertoire and the
possible values of the interchange encoding, without specifying either
completely. I propose you replace this by a definition of an SGML
CHARSET corresponding to item 1, with a note that additional CHARSETs
may be permitted in the future.
p49, 3.2.7 item 1: To conform with ISO 646's recommendation, record
separation should be represented by CR LF (00/13 00/11) in that order.
I suggest you strongly recommend that senders use that convention, but
that receivers accept them as described here.
p49, 3.2.7: Specifying pack/unpack utilities separately seems to me
unnecessary. The writing system declaration should specify how the
graphemes required are mapped onto interchange characters. To have this
packing seems to introduce _another_ level into a scheme which already
has 3, as I have described. If different interchange repertoires are
permitted, then it may be useful to have programs to convert a document
from one to another, but such programs are already specified implicitly:
parse the document into graphemes using one writing scheme and re-encode
the graphemes into the other interchange character set using
the second writing scheme.
If somebody is using a local character set instead of a TEI-standard
interchange character set, the same method can be used to convert
from one to the other.
p50, 3.2.9: See my comment on p87.
p50, 3.2.10: The purpose of this section seems unclear. Is it to
define the interchange character set? If so, then 3.2.7 last paragraph
has already said that at present only one is valid (restricted ISO 646).
However, following sections suggest that it is concerned with defining
the grapheme repertoire. If so, then 3.2.10 should make clear that it
is only the _repertoire_ in the referenced standard that matters, not
p50, 3.2.12: Code. This is confusing: code suggests a number, whereas
the examples make clear that what is meant is one or more interchange
characters, and it is these which should be specified here.
p50, 3.2.12: Code: I propose that you deprecate the use of
multi-character graphemes, except entity references - see my comment on
p46, 3.2.2. These could cause problems for formatting programs which
are trying to work out, for example, whether to print transliterated
Cyrillic shch as two graphemes or one. If they must be permitted, a
warning should be included in the Guidelines to discourage users from
defining ambiguous encodings.
p50, 3.2.12: Entity names. Why not allow all entity names registered
according to ISO 9070, rather than merely those in (the non-normative)
annex D of ISO 8879? This would allow the TEI committee to
define and register those grapheme sets needed by the users in a
standardised and public way.
The first users to use a particular new set could propose it for registration,
to avoid later users inventing a different, incompatible, standard.
p51, 3.2.12: Why permit these methods of writing accented letters?
They reduce interchangeability. Insist on entity names (standardised
whenever possible) like "ü", as in the examples. If a language's
writing system really does require them to be separate graphemes (as p45
3.1.7 suggests is true for Greek, despite the practice of ISO 8859-7 and
ISO DP 10646), then I would suggest only DL be permitted, since this is
the universal practice of ISO standards (e.g. ISO 6937, ISO 5426).
(Well, strictly ISO 646 permits J with the BS - backspace - control
character as the join character, but this is little-used and later
standards have deprecated it.)
p51, 3.2.12: It might be convenient if there were another name given
for a grapheme, viz, the name used in ISO character set registration
(e.g. "capital E with ogonek" in ISO 6937), as well as the name used by
p51, 3.2.13: Are these "formal names" the values of the #PCDATA content
of the "standard" element? This isn't explicitly stated.
p51, 3.2.13: For the reason given in 3.1.2 and 3.1.4, "ASCII" should be
avoided as a formal name. It is too widely misunderstood.
p51, 3.2.13: The formal names should be consistent. Always prefix ISO
standards with "ISO": thus "ISO 8859-1" etc.
p51-2, 3.2.13: According to 3.1.1, "character set" is here used to mean
"repertoire", that is, the encoding is not here relevant. Thus, if
Latin_CP37 contains the same characters as ISO 8859-1 (even though the
encoding may be different), then one of them is redundant. See my comments
on p50, 3.2.10.
p87, 5.3.11: ISO/R 2014 has now been superseded by ISO 8601, which
specifies the writing of dates and times in all-numeric form. However,
it still permits yyyy-mm-dd, so only the reference needs changing. Note
that ISO 8601 also specifies a standard way of writing periods, thus:
start-date-time/end-date-time e.g. 1990-03-04/1990-08-01.
p188, 7.5: The bibliography should include the ODA standard (ISO 8613,
T.400 series, ECMA-101) and CCITT recommendations (X.400), since they
are referred to here.
p191, 7.5.3: Electronic mail address formats should include one for X.400.
The common format (surprisingly, unstandardised) looks like
/C=GB/ADMD=GOLD 400/PRMD=PRIME/O=WILLEN R+D/S=STONE/G=DAVID/
I hope these comments are of some use. Their quantity reflects the
current lamentable state of the computer character sets used, and
unfortunately compromise and ugliness is necessary for a workable
solution. Good luck with the next draft!
|Free forum by Nabble||Edit this page|