bit width (7-, 8-, 16-, 32-bit encodings ...)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

bit width (7-, 8-, 16-, 32-bit encodings ...)

Glenn Adams-2
   Date:         Fri, 5 Oct 90 16:59:39 CDT
   From: Michael Sperberg-McQueen 312 996-2477 -2981 <[hidden email]>

You mention two standards of representation: *interchange* and
*local processing*.  Since the issue of *interchange*, i.e., the
form and content of the information shared, is the primary concern
standard forming, it is at this level that issues of encoding
must be addressed.

   *Interchange* is a different matter, and where the issue of support will
   really come home, I think.  The simple fact is that many current
   network, tape, and other communications channels do not reliably handle
   7-bit data, let alone 8-, 16-, or 32-bit data.

Now, while issues of efficiency of transfer should figure strongly
in an *interchange* encoding, there must be some minimum level of
support that is assumed from the communication medium upon  which the
interchange occurs.  It would seem from your previous statement that
you don't assume much.  In fact you appear to assume two things:
(1) that the communication medium only supports 7-bit data; and (2)
that the communication medium is unreliable, even for 7-bit data.

It seems to me that this level of assumption is in fact vacuous, i.e.,
you assume nothing at all about the underlying medium.  This seems
to be much too weak a position and potentially defeating to the primary
goals of TEI.  Real networks (excluding BITNET) reliably support 8-bit
transfer with end to end checksums.  All applications using TCP/IP and
ISO/TP4 assume this level of support.  Why should TEI assume less?
Perhaps it's time too look beyond BITNET (pun intended) to the standard
of communications reliability expected elsewhere in interchange these
days.

   A Writing System Declaration describes the graphemes of a script,
   specifies how the graphemes are represented by bytes, names the language
   it is used to represent, and assigns an identifier to the combination of
   natural language and character-set/transliteration.

I find this definition very interesting and very close to a similar
characterization I have developed.  I am somewhat unclear, however, about
the "combination of natural language and character-set/transliteration"
though.  Could you give an example here?

Also, I find it useful to consider the definition of multiple encodings or
mappings from graphemes to bits.  Thus my definition would consist of the
triple < L, S, G > where L is the language being represented, S is a
Script, i.e., set of graphemes, and G (Gamma) is a mapping from S to N,
the set of non-negative integers.  I can think of at least two good reasons
for employing multiple Gammas for a single Script S:

(1) A single Script may contain more graphemes than L ever uses.  For example,
I am not aware of a single L that uses all the elements of ISO8859-1.  Or
say we include all of ISO8859-1 thru 4 in S, then we cannot represent these
identities with an 8-bit field unless unused elements are excluded from the
mapping, or alternatively, mapped to a special encoding which is
non-invertible.

(2) The assignment of integers to graphemes could be done in a way so as
to facilitate or optimize collation within L.

This mapping would be similar to a color lookup table on a pseudo color
display system where you encode only the used entries but each encoding
stands for an element from a larger set than the encodings can indicate.
Alternatively, the lookup table could be a direct one-to-one mapping to
each element of S.  This would be more like direct color systems.