Comments on Guidelines from David Stone

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Comments on Guidelines from David Stone

Lou Burnard-7
[Again with his kind permission, I am posting the following set of
comments from David Stone. LB ]

Date:     19 Sep 90  1:41 PM
  Your name: David Stone
  Your postal address: 46 Pattison Lane
                       Woolstone
                       MILTON KEYNES
                       MK15 0AY
                       ENGLAND

  Your e-mail address (if any): Internet: [hidden email]
    X.400: /C=GB/A=GOLD 400/P=PRIME/O=WILLEN R+D/S=STONE/G=DAVID/

  Your occupation: Software Engineer
  Your academic background: M.A.(Mathematics), M.Phil.(Linguistics)
  Your immediate reactions to the Guidelines (please tick or cross
  one only):

  Relevance to your present concerns or interests:   high  x-medium    low
  Importance to your future research plans:          high    medium  x-low
  Comprehensibility/usability of current draft:    x-high    medium    low

  Detailed technical comment:  [Please feel free to comment on any
  part of the Guidelines or in general.  Comments relating to spe-
  cific parts of the text are most helpful if they include a sec-
  tion or chapter number]


Here are my comments on draft version 1.0 of the TEI Guidelines dated
16 July 1990. They concentrate on character sets (chapter 3) merely because
that is an area in which I have some knowledge.

Please note that these remarks are entirely my personal responsibility;
they do not reflect the position of my employer Prime Computer at all,
which to my knowledge has no material interest in the TEI.

Some of my remarks are rather imperative; this is  merely  for  brevity,
and is not meant to be rude in any way. All my remarks are of course
suggestions.

Before I start making detailed points, I should like to say that overall
the  Guidelines seem to be good; I am sure you're right to use SGML, for
example.  I was pleased to see  the  generally  pragmatic  approach  being
followed  which leaves the media of interchange unspecified, rather than
an ASN.1-based approach - for example - which would restrict you to OSI.


General comments on character sets and the Guidelines.
======================================================

Firstly, I should like to restate the problem.  The main problem is that
computer  character sets in use do not have a repertoire at all adequate
for literary and linguistic use.  A  minor  problem  is  that  different
computers use different encodings for the characters they do provide.

If they had an adequate repertoire, computer  character  sets  could  be
used  as  sets of graphemes.  But they don't; so we can't.  Thus we need
to distinguish between graphemes and characters.

I would  propose  a  framework  of  three  levels:   (1)  the  _grapheme
repertoire_;  (2)  the  _interchange  character repertoire_; and (3) the
_interchange character encoding_.

The Guidelines at present seem to be making distinctions like these, but
not  consistently  and  explicitly.   What  I  describe  below makes the
distinctions clear (I hope).

The purpose of making these distinctions is to break  down  the  problem
into  manageable parts, and to separate the technology-dependent aspects
(what characters do computers support?   how are they encoded?) from the
application-dependent (what language are we using?   what script?).

To define the levels in more detail...

(1) the grapheme repertoire.

This is the set of characters naturally appropriate to the language in the
chosen method of writing, unrestrained by limitations of hardware or software.

This is roughly a set of "graphemes", where "grapheme" is as defined  in
3.1.1 of the Guidelines.

No encoding or computer representation is specified at this level.

(2) the interchange character repertoire.

This is a set of characters using which the grapheme repertoire
is encoded. This set is chosen to be a lowest-common-denominator of
currently-available computer character repertoires: most probably at present
the nationally-invariant part of ISO 646, as the Guidelines suggest
in 3.2.2.

No binary encoding is specified at this level.

(3) the interchange character encoding

This specifies the binary values using which the interchange character set
is encoded. The most obvious choice is ISO 646 encoding; but you
might wish to permit EBCDIC encoding as well.

The SGML standard calls this and the interchange character repertoire
together the "document character set" (ISO 8879, definitions 4.98).


These three levels fit nicely with SGML's. (1) corresponds with the SGML
abstract syntax; (2) with the SGML concrete syntax. At level (2) the SGML
and text proper are merged, and so the mapping to level (3) is common.
A diagram may be clearer:


SGML                               Content
====                               =======

+------------------+               +----------------+
|  Abstract        |   level 1     |  Abstract      |
|  syntax          |               |  content       |
+------------------+               +----------------+
         |                                  |
         |                                  |
         |                                  |
         |                                  |
+------------------+               +----------------+
|  Concrete        |   level 2     |  Interchange   |
|  syntax          |               |  content       |
+------------------+               +----------------+
         |                                  |
         +----------------------------------+
                           |
                           |
                           |
                +---------------------+
level 3         |  Interchange        |
                |  encoding           |
                +---------------------+


This way of looking the problem of character encoding has the advantage
of splitting up the problem and separating logically independent matters.

For example, the interchange repertoire and encoding  are  now  seen  as
separate;  the  Guidelines'  "levels" (as in 3.2.7) associate them.  One
might wish to use, for example, the ISO-646  invariant  repertoire  with
EBCDIC  encoding.   Or  one  might wish to use the ISO 6937-2 repertoire
with ISO (DP) 10646 encoding.

This separation has the advantage that users wanting  to  define  a  new
grapheme  only  have  to define its mapping onto interchange characters,
and do not have to know any obscure numeric values.

It also allow the mappings between levels to be independently specified,
which  is  useful  because  I  would expect the mapping from interchange
characters to encoding  to  be  restricted  to  one  or  two  well-known
mappings,  whereas  the mapping from graphemes to interchange characters
will vary depending very much on the document and its repertoire.

With this structure as a basis, I would suggest sections in the Guidelines
with outlines as follow:

The Writing Scheme
==================

This, reworked to take account of what I have set above, would define:

- the grapheme repertoire;

- the interchange character repertoire;

- and the mapping from grapheme to interchange character(s).

(plus oddments like the WSD name, date of specification, as in 3.2.6
of the Guidelines.)

SGML character entity sets are very often a convenient way of  declaring
the  mapping  of graphemes into interchange characters.  See my detailed
comments below on p50, 3.2.12.

Some TEI-standard grapheme repertoires
======================================

This section would define, as WSDs, some commonly-used  repertoires  for
documents.   These  should  cope  with  all  normal European Latin-based
alphabets, Greek, Cyrillic and any others  widely  used  by  the  target
users.

It would also define their mappings onto  the  TEI-standard  interchange
repertoires.   I would propose that SGML entity names be used as much as
possible, rather than ad-hoc methods (e.g.  use ü  not  u:).   This
may  be  verbose  but  it  will  be  more  widely understood and will be
parsable unambiguously.

Defining your own grapheme repertoire
=====================================

This section would explain how users who wished, for example, produce
a corpus of Easter Island writing, would specify the repertoire of characters
needed.

It would also say how the user should define  their  mappings  onto  the
TEI-standard interchange repertoires.

Interchange character repertoires
=================================

This would list the  permitted  interchange  character  repertoires.   I
would  suggest  that  only  one  or  perhaps two be permitted, to ensure
maximum interchangeability of documents.

I would also suggest that users _not_ be allowed to extend these,
although future editions of the Guidelines might.

The front-runner is - as the Guidelines say - the nationally-invariant
subset of ISO 646.

As hardware and software improve it might in time be possible to  define
additional repertoires.

Interchange encodings
=====================

This would specify the permitted encodings to be used for each of the
TEI-standard interchange repertoires using SGML CHARSET declarations.

These also should be kept as few as possible. For example, ISO 646 encoding
of ISO 646 characters.

The encoding of new-line/end-of-record also needs to be defined here.



Detailed comments on the Guidelines
===================================

p40,  3.1.1,  footnote  2:   You  _are_  concerned  with   the   control
character(s) used to represent end-of-line/end-of-record, as you mention
on p49.  I think you should explicitly mention that TAB  characters  are
banned, because on Un*x systems people are very used to using them.

p41, 3.1.2: ISO 8859 is in nine, not eight, parts. Part 9 is very similar to
part 1, except that it caters for Turkish rather than Icelandic.

p41, 3.1.3:  The wrong standard is quoted here.  Replace  "ISO  Standard
7350..."  by  "ISO  Standard  2375  Procedure for registration of escape
sequences".  ISO 7350 refers solely to subrepertoires of ISO 6937.  ECMA
acts as the registrar for ISO 2375; the NCC in Manchester for ISO 7350.

p43, 3.1.5 says that ways of "wrapping-up" the complete encoded  document
are not addressed by the Guidelines.  I don't know whether this is wise.
There _are_ commonly-used "wrappers" and whenever users  have  a  chance
they  should  be  encouraged  to use them.  Because I would imagine that
many users of the Guidelines are not computing specialists, it would  be
helpful  to  inform  them  of  these common wrappers.  For example, tape
sizes, densities, formats and so on; SDIF parameters for X.400 mail use;
a value of the "Encoding:" keyword (see RFC 1154) for SMTP mail.

p45, 3.2.1:  LANG attribute:  I would propose that  this  change  _only_
the  grapheme  repertoire  and  its mapping to the interchange character
repertoire.  I would also propose that the interchange character set  be
fixed  within  any document.  Many computer systems (simple text editors
in particular) are incapable  of  handling  context-sensitive  character
display.

p46, 3.2.2:  Note that the Swiss sometimes substitute for '<'  and  '>',
contrary  to  ISO  646.   '_'  is  occasionally replaced by a left arrow
(non-standard).

p46, 3.2.2: If it is necessary to redefine the concrete SGML syntax,
I think the Guidelines should propose just one redefinition. It will
be very confusing if everybody invents their own.

p46, 3.2.2:  I agree with  the  last  paragraph,  except  that  I  would
suggest  you  insist on transliterations using one interchange character
for one grapheme, to avoid the problems described in  3.1.7.   If  there
are not enough interchange characters and no public entities include the
required  graphemes,  new  public  entity  sets  should  be   registered
according to ISO 9070 and endorsed by the TEI committee.

p46, 3.2.4:  Have  you  considered  defining  a  collating  sequence  of
graphemes  (alphabetical order) within the Writing Set Declaration?   This
may  be  useful  for  enabling  automatic  generation  of  indexes  from
documents.   The  order of the interchange character encodings is almost
certainly not the most natural and normal for most languages.

p47, 3.2.4, item 2:  This is one area in the Guidelines where I  believe
the  distinction  between  grapheme and interchange character has become
blurred.  I propose that all definition of the interchange character set
(repertoire  and  encoding)  be left to the SGML CHARSET declaration, so
that the WSD would just define a grapheme repertoire and its mapping  to
interchange characters. Here is some proposed replacement text for 3.2.4
starting from item 2:

===============================================================================

2. A specification of the meaning of each grapheme used. The
specification will include at least one of (a) or (b) below.

a. Reference to a public character entity, in Annex D of ISO 8879, or
registered according to ISO 9070;

b.  A description of the grapheme, preferably both that by which  native
speakers  of  the  language  in  question  refer  to it, and (usually in
English) as used in ISO Character Set standards.

c.  The unique sequence of interchange characters used to represent  the
character.   For  entity  references  this  will of course use SGML's &;
notation.

d. Certain special properties, such as being a diacritical mark.

Note that all the interchange characters  will  have  been  declared  as
valid data characters by an SGML CHARSET declaration.

===============================================================================

p49, 3.2.7:  Conformance levels appear to me to be confused.  They are a
mixture  of  specifying  the  interchange  character  repertoire and the
possible values of the interchange encoding, without  specifying  either
completely.   I  propose  you  replace  this  by a definition of an SGML
CHARSET corresponding to item 1, with a note  that  additional  CHARSETs
may be permitted in the future.

p49, 3.2.7 item 1:  To conform with  ISO  646's  recommendation,  record
separation  should  be represented by CR LF (00/13 00/11) in that order.
I suggest you strongly recommend that senders use that  convention,  but
that receivers accept them as described here.

p49, 3.2.7:  Specifying pack/unpack utilities  separately  seems  to  me
unnecessary.   The  writing  system  declaration  should specify how the
graphemes required are mapped onto interchange characters.  To have this
packing  seems  to introduce _another_ level into a scheme which already
has 3, as I have described.  If different  interchange  repertoires  are
permitted,  then it may be useful to have programs to convert a document
from one to another, but such programs are already specified implicitly:
parse the document into graphemes using one writing scheme and re-encode
the graphemes into the other interchange character  set  using
the second writing scheme.
If somebody is using a local character set instead of a TEI-standard
interchange character set, the same method can be used to convert
from one to the other.

p50, 3.2.9: See my comment on p87.

p50, 3.2.10:  The purpose of this  section  seems  unclear.   Is  it  to
define  the  interchange  character  set?   If so, then 3.2.7 last paragraph
has already said that at present only one is valid (restricted ISO 646).
However,  following  sections suggest that it is concerned with defining
the grapheme repertoire.  If so, then 3.2.10 should make clear  that  it
is  only  the  _repertoire_ in the referenced standard that matters, not
the encoding.

p50, 3.2.12:  Code.  This is confusing:  code suggests a number, whereas
the  examples  make  clear that what is meant is one or more interchange
characters, and it is these which should be specified here.

p50,  3.2.12:   Code:   I  propose  that  you  deprecate  the   use   of
multi-character  graphemes, except entity references - see my comment on
p46, 3.2.2.  These could cause problems for  formatting  programs  which
are  trying  to  work  out, for example, whether to print transliterated
Cyrillic shch as two graphemes or one.  If they  must  be  permitted,  a
warning  should  be  included in the Guidelines to discourage users from
defining ambiguous encodings.

p50, 3.2.12:  Entity names.  Why not allow all entity  names  registered
according  to  ISO 9070, rather than merely those in (the non-normative)
annex D of ISO 8879?   This would allow the TEI    committee  to
define  and  register  those  grapheme  sets  needed  by  the users in a
standardised and public way.
The first users to use a particular new set could propose it for registration,
to avoid later users inventing a different, incompatible, standard.

p51, 3.2.12:  Why permit these  methods  of  writing  accented  letters?
They  reduce  interchangeability.   Insist on entity names (standardised
whenever possible) like "&uuml;", as in the examples.  If  a  language's
writing system really does require them to be separate graphemes (as p45
3.1.7 suggests is true for Greek, despite the practice of ISO 8859-7 and
ISO  DP 10646), then I would suggest only DL be permitted, since this is
the universal practice of ISO standards  (e.g.   ISO  6937,  ISO  5426).
(Well,  strictly  ISO  646  permits  J with the BS - backspace - control
character as the join character,  but  this  is  little-used  and  later
standards have deprecated it.)

p51, 3.2.12:  It might be convenient if there were  another  name  given
for  a  grapheme,  viz,  the name used in ISO character set registration
(e.g.  "capital E with ogonek" in ISO 6937), as well as the name used by
native speakers.

p51, 3.2.13: Are these "formal names" the values of the #PCDATA content
of the "standard" element? This isn't explicitly stated.

p51, 3.2.13:  For the reason given in 3.1.2 and 3.1.4, "ASCII" should be
avoided as a formal name.  It is too widely misunderstood.

p51, 3.2.13:  The formal names should be consistent.  Always prefix  ISO
standards with "ISO":  thus "ISO 8859-1" etc.

p51-2, 3.2.13:  According to 3.1.1, "character set" is here used to mean
"repertoire",  that  is,  the  encoding  is not here relevant.  Thus, if
Latin_CP37 contains the same characters as ISO 8859-1 (even  though  the
encoding may be different), then one of them is redundant. See my comments
on p50, 3.2.10.

p87, 5.3.11:  ISO/R 2014 has now been  superseded  by  ISO  8601,  which
specifies  the writing of dates and times in all-numeric form.  However,
it still permits yyyy-mm-dd, so only the reference needs changing.  Note
that  ISO  8601  also specifies a standard way of writing periods, thus:
start-date-time/end-date-time e.g.  1990-03-04/1990-08-01.

p188, 7.5:  The bibliography should include the ODA standard (ISO  8613,
T.400  series,  ECMA-101)  and CCITT recommendations (X.400), since they
are referred to here.

p191, 7.5.3: Electronic mail address formats should include one for X.400.
The common format (surprisingly, unstandardised) looks like
/C=GB/ADMD=GOLD 400/PRMD=PRIME/O=WILLEN R+D/S=STONE/G=DAVID/


I hope these comments are of some  use.   Their  quantity  reflects  the
current  lamentable  state  of  the  computer  character  sets used, and
unfortunately compromise  and  ugliness  is  necessary  for  a  workable
solution.  Good luck with the next draft!

David Stone.