Character Representations [was Text Editors for SGML conversion]

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Character Representations [was Text Editors for SGML conversion]

Glenn Adams-2
   Date:         Tue, 9 Oct 90 16:50:15 EDT
   From: Richard Ristow <AP430001%[hidden email]>

   + All markup, including representation of glyphs not in the sets, use
     a standard set of glyphs that are part of ALL the glyph sets

This implies that all markup must use English or perhaps some other language
that can be adequately represented using the graphs used by ASCII.
This certainly seems to obviate the desire to have authors work with TEI
text directly (if they don't happen to read English, etc.).  Presumably
TEI for these authors is now no different than a binary file generated
by a compiler.  Here the compiler must transform, i.e., transliterate or
substitute entity names for each non-English graphs, the editable text
into TEI conforming text.

Secondly, when you say that a standard set of glyphs are part of ALL the
glyph sets.  Are you suggesting that every glyph set contain ASCII graphs
as a subset?  Few of the standards, e.g., ISO8859-5 thru 8, do this.

   + All glyphs in ANY of the sets have representations determined by the
     TEI standard, using only the above shared set of glyphs.  (Therefore,
     text represented in any TEI standard is representable in any other
     standard, and automatic translation from any standard to any other
     is straightforward.)

Your use of the term glyph is a bit troubling.  It is a legacy of European
and Asian text processing that there is very close to a one to one mapping
between what we might call "characters" and "glyphs."  As a result there
has been a systematic conflation of the two notions represented by these
terms, viz., information and presentation, respectively.  This one to one
mapping breaks down in writing systems where wide spread use of one of the
following occur:

  - positional variation, i.e., context sensitive allographs
  - ligaturing
  - digraphs, trigraphs, etc.
  - non-monotonic rendering order

Good examples of these writing systems are Arabic, Devanagari and derivatives,
and Hangul.  In each of these languages, one does not want to represent glyphs
that can be deterministically generated.  Instead, an abstract form is
represented in the character set.  For example, if you look at ISO8859-6 or
ASMO Arabic character sets, only the independent forms are represented, not
the positional variants.

On the other hand, Hangul word processors typically represent each of the few
thousand ligatures with separate codepoints.  Similarly, ASCII uses separate
code points for the admixture of letter and case graphemes.  So, admittedly
the existing state of representations is a bit unclear.

I view the general transformation from information to presentation, i.e.,
rendering, as follows:

        Grapheme(s) -> Graph (or allograph) -> Glyph

Each of the two transformations represents knowledge not contained in the
grapheme as such.  The first function from grapheme to graph uses the
knowledge of the writing system to produce the positional variants, ligatures,
digraphs, ordering, etc.  The second function takes into account appearance
and style information in determining things like font family, face, size,
posture, weight, color, texture, etc.

In the case of European and Asian languages, we tend to think of a "character"
as a shorthand for a "graph" as in the scheme above.  Alternatively, we can
think of a "character" in this sense as an abbreviation for a string of
graphemes.  For example, an ASCII \101 stands for an Uppercase-A which is
really the combination of two graphemes: an independent letter grapheme "a"
and a bound case grapheme "upper".  And since glyphs are merely graphs with
style and appearance mixed in, referring to a "graph" as a "glyph" is
unambiguous.  Similarly in Korean word processors, the syllable graph "KIM"
is assigned a single codepoint in KSC5601.  But this graph is really an
abbreviation of three graphemes 'K', 'I', and 'M', which, interestingly
enough, are entered through the keyboard with three keystrokes.  We can
even imagine creating a new character set for English which has a codepoint
for each possible graphical syllable of the language, or even a codepoint
for each morpheme in the English language.  Obviously both of these latter
representations would seem both unnatural and inefficient.  However, this
is precisely how speakers of Korean and Chinese respectively tend to view
their own language and this is how they encode their language in computers.

To return to the case of Arabic, we see not the representation of graphs or
even grapheme bundles but of individual graphemes.  Furthermore, each of
these graphemes do not correspond to a particular graph (or glyph) (I'm
ignoring the fact that there also exist "independent" allographs of the
various graphemes.)  The rendering process then must select the appropriate
allograph which then corresponds to a glyph in a font that can be depicted.

The choice of whether to represent graphemes or graphs seems to have a
lot to do with how each culture thinks about the units and segmentation
of their language.  Issues of memory efficiency seem to have been less
important in influencing this choice.  However, certain problems arise
when abbreviations are represented.  For example, by encoding lowercase
and uppercase letters differently, we must normalize case for doing identity
comparisons.  Similarly, searching for parts of Korean syllables, say all
words that start with 'K' becomes a problem when you represent the
entire syllable as an atomic unit.