Text Editors for SGML conversion

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Text Editors for SGML conversion

Major
> She uses EMACS and a keyboard macro to do the conversion in one keystroke.
> If an editor doesn't provide this facility, then it isn't worth being called
> an editor.

Well put.

> On another matter, I have seen numerous instances of the phrase "SGML uses
> ASCII."  At the same time, I see statements that "SGML supports unrestricted
> text encodings in non-english text."  How is this possible?

<WITH-ACCENT-GRAVE>e<\WITH-ACCENT-GRAVE>     and
<RANDOM-HAN-CHARACTER>12037<\RANDOM-HAN-CHARACTER>

Are both plain ASCII, I have no idea if SGML does anything like this.
Postscript which can describe anything your printer can print is
plain ASCII.

Happy hacking! |  Just think of the world as public transport:
                        |    I can understand the grafitti, but
Major |    why do they run the exhaust into
[hidden email] |    the cabin?

Reply | Threaded
Open this post in threaded view
|

Re: Text Editors for SGML conversion

Eric van Herwijnen
On 8-bit and 16-bit text:

An SGML document has three parts: the SGML declaration, the DTD and
the document instance.

The SGML declaration does not have to be part of the document, and if
it is not, a default declaration somewhere on the system is used.

The SGML declaration defines the character sets which are used
for the syntax (among others, those '<', '</' and '>' that have caused
quite a bit of discussion lately), and those which are used in the
document instance. There is a mechanism for using any ISO registered
characterset. Indeed Martin Bryan has used SGML for coding Japanese
text. You can find examples in the technical report ISO 9753.

It also contains a mapping from one set to another, ie you can use a
7 bit characterset for the syntax and a 8 bit characterset for the
document.

The spirit of the standard is, however, not to have a parser figure out
what the document is coded in, but to have all parties which are to
use a given SGML application agree on the character sets which are used.

I hope this helps. Eric van Herwijnen

Reply | Threaded
Open this post in threaded view
|

Re: Text Editors for SGML conversion

Richard Ristow
In reply to this post by Major
>   >if this is ... the proposed solution, i.e., 7-bit encodings, then SGML
>   >will fail miserably:  it must unequivocally demand 8-bit encodings
>   >AKA ISO2022 at a minimum.
 ( ... )
> ... if SGML/TEI limits one to 7-bit encodings, then I'm going to unsubscribe
>from this list and forget about TEI and SGML.

A couple of postings have raised the issue of character encodings in TEI-SGML.
At the risk of putting words in the developers' mouths, I conjecture that it
originates in an unfortunate presentation causing a misunderstanding of the
TEI standards.  In a posting of 1 October, Michael Sperberg-McQueen writes:

> The TEI recommendations for interchange of texts require conforming
> texts to contain only a subset of the characters included in ASCII.
>
> And therefore a TEI text is indeed ASCII-only, ( ... )

Since "ASCII" is commonly understood as a certain mapping between glyphs and
7-bit patterns, it is not surprising that this or similar statements appear
to raise the character representation issue.  I understand the use of "ASCII"
for TEI purposes to be a little different, however:  it is shorthand for a
character set made up of abstract glyphs, chosen so that among other properties
+ All have recognized graphic representations, and graphics for all are
  present on all commonly used computer printers and terminals.  (I gather
  that a special case is the line-end, which has no graphic but is marked
  by the change to a new printer or terminal line.)
+ All have recognized representations in 7-bit ASCII, and all commonly used
  computer printers and terminals denoted as "ASCII devices" produce the
  standard graphic from the 7-bit ASCII representation of each glyph
  (this is the main sense in which they are "ASCII" characters).
+ All have recognized representations in EBCDIC.

The standard, I take it, requires that the text be entirely represented in
these abstract glyphs, NOT that it be in the coding convention "ASCII" --
otherwise, translation of a TEI-conforming text to EBCDIC would make it non-
conforming, which is elsewhere stated not to be the case.  (Actually, a
printed representation would then be technically non-conforming, as the
glyphs would be represented by graphics rather than by ASCII bit patterns.)

The issue of 7-, 8-, 16-bit or other representation in TEI is then, I think,
no issue at all.  I understand the TEI to be specifying NO representation,
but only the use of a certain set of abstract glyphs.

However, while I disagree that the issue of character representations
exists, there are clearly substantive issues behind it.  I see two major
ones, but am now risking putting words into the mouths of TEI's questioners
as well as its developers, and welcome clarification by both.

First, the set of graphics in the TEI standard set is inadequate for even
the major European languages based on the Roman alphabet, let alone the
rest of the world's languages.  I take it that the TEI developers have
recognized this from the start, and that SGML conventions are included
in the TEI standard so that expanded glyph sets as well as markup may be
represented in TEI-conforming files.

Second, even if the TEI glyph set and SGML conventions allow all text to
be represented, the small TEI glyph set makes such representation needlessly
clumsy compared to representations using larger glyph sets ("8-bit" and
"16-bit").  This is a real issue, and won't go away.  I would personally
support the TEI standard using a restricted glyph set, since conforming
files may be displayed and used (though not conveniently nor as originally
intended) on any computer equipment likely to be available.  However, clearly
the restricted representation is often clumsy, and the software to convert
it to a fuller representation on equipment with expanded character sets
may be so as well.  My expectation is the development of a set of TEI
standards using a variety of glyph sets, with the properties that

+ All markup, including representation of glyphs not in the sets, use
  a standard set of glyphs that are part of ALL the glyph sets
+ All glyphs in ANY of the sets have representations determined by the
  TEI standard, using only the above shared set of glyphs.  (Therefore,
  text represented in any TEI standard is representable in any other
  standard, and automatic translation from any standard to any other
  is straightforward.)

Standards will then be set up using the glyphs sets of such 8-bit and
16-bit coding schemes as become sufficiently established to make the
effort worth while.  The broad utility of such expanded-set standards,
however, will depend on there being graphic and machine-readable
representation of all glyphs in the standard that are as widely
recognized as the "7-bit ASCII" standard is today.