> She uses EMACS and a keyboard macro to do the conversion in one keystroke.
> If an editor doesn't provide this facility, then it isn't worth being called > an editor. Well put. > On another matter, I have seen numerous instances of the phrase "SGML uses > ASCII." At the same time, I see statements that "SGML supports unrestricted > text encodings in non-english text." How is this possible? <WITH-ACCENT-GRAVE>e<\WITH-ACCENT-GRAVE> and <RANDOM-HAN-CHARACTER>12037<\RANDOM-HAN-CHARACTER> Are both plain ASCII, I have no idea if SGML does anything like this. Postscript which can describe anything your printer can print is plain ASCII. Happy hacking! | Just think of the world as public transport: | I can understand the grafitti, but Major | why do they run the exhaust into [hidden email] | the cabin? |
On 8-bit and 16-bit text:
An SGML document has three parts: the SGML declaration, the DTD and the document instance. The SGML declaration does not have to be part of the document, and if it is not, a default declaration somewhere on the system is used. The SGML declaration defines the character sets which are used for the syntax (among others, those '<', '</' and '>' that have caused quite a bit of discussion lately), and those which are used in the document instance. There is a mechanism for using any ISO registered characterset. Indeed Martin Bryan has used SGML for coding Japanese text. You can find examples in the technical report ISO 9753. It also contains a mapping from one set to another, ie you can use a 7 bit characterset for the syntax and a 8 bit characterset for the document. The spirit of the standard is, however, not to have a parser figure out what the document is coded in, but to have all parties which are to use a given SGML application agree on the character sets which are used. I hope this helps. Eric van Herwijnen |
In reply to this post by Major
> >if this is ... the proposed solution, i.e., 7-bit encodings, then SGML
> >will fail miserably: it must unequivocally demand 8-bit encodings > >AKA ISO2022 at a minimum. ( ... ) > ... if SGML/TEI limits one to 7-bit encodings, then I'm going to unsubscribe >from this list and forget about TEI and SGML. A couple of postings have raised the issue of character encodings in TEI-SGML. At the risk of putting words in the developers' mouths, I conjecture that it originates in an unfortunate presentation causing a misunderstanding of the TEI standards. In a posting of 1 October, Michael Sperberg-McQueen writes: > The TEI recommendations for interchange of texts require conforming > texts to contain only a subset of the characters included in ASCII. > > And therefore a TEI text is indeed ASCII-only, ( ... ) Since "ASCII" is commonly understood as a certain mapping between glyphs and 7-bit patterns, it is not surprising that this or similar statements appear to raise the character representation issue. I understand the use of "ASCII" for TEI purposes to be a little different, however: it is shorthand for a character set made up of abstract glyphs, chosen so that among other properties + All have recognized graphic representations, and graphics for all are present on all commonly used computer printers and terminals. (I gather that a special case is the line-end, which has no graphic but is marked by the change to a new printer or terminal line.) + All have recognized representations in 7-bit ASCII, and all commonly used computer printers and terminals denoted as "ASCII devices" produce the standard graphic from the 7-bit ASCII representation of each glyph (this is the main sense in which they are "ASCII" characters). + All have recognized representations in EBCDIC. The standard, I take it, requires that the text be entirely represented in these abstract glyphs, NOT that it be in the coding convention "ASCII" -- otherwise, translation of a TEI-conforming text to EBCDIC would make it non- conforming, which is elsewhere stated not to be the case. (Actually, a printed representation would then be technically non-conforming, as the glyphs would be represented by graphics rather than by ASCII bit patterns.) The issue of 7-, 8-, 16-bit or other representation in TEI is then, I think, no issue at all. I understand the TEI to be specifying NO representation, but only the use of a certain set of abstract glyphs. However, while I disagree that the issue of character representations exists, there are clearly substantive issues behind it. I see two major ones, but am now risking putting words into the mouths of TEI's questioners as well as its developers, and welcome clarification by both. First, the set of graphics in the TEI standard set is inadequate for even the major European languages based on the Roman alphabet, let alone the rest of the world's languages. I take it that the TEI developers have recognized this from the start, and that SGML conventions are included in the TEI standard so that expanded glyph sets as well as markup may be represented in TEI-conforming files. Second, even if the TEI glyph set and SGML conventions allow all text to be represented, the small TEI glyph set makes such representation needlessly clumsy compared to representations using larger glyph sets ("8-bit" and "16-bit"). This is a real issue, and won't go away. I would personally support the TEI standard using a restricted glyph set, since conforming files may be displayed and used (though not conveniently nor as originally intended) on any computer equipment likely to be available. However, clearly the restricted representation is often clumsy, and the software to convert it to a fuller representation on equipment with expanded character sets may be so as well. My expectation is the development of a set of TEI standards using a variety of glyph sets, with the properties that + All markup, including representation of glyphs not in the sets, use a standard set of glyphs that are part of ALL the glyph sets + All glyphs in ANY of the sets have representations determined by the TEI standard, using only the above shared set of glyphs. (Therefore, text represented in any TEI standard is representable in any other standard, and automatic translation from any standard to any other is straightforward.) Standards will then be set up using the glyphs sets of such 8-bit and 16-bit coding schemes as become sufficiently established to make the effort worth while. The broad utility of such expanded-set standards, however, will depend on there being graphic and machine-readable representation of all glyphs in the standard that are as widely recognized as the "7-bit ASCII" standard is today. |
Free forum by Nabble | Edit this page |