accents, entities and size of text

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

accents, entities and size of text

VERONIS
I have heard at least two arguments against the SGML entities:

(1) they expand texts in a prohibitive way;
(2) they are difficult to read.

To test the validity of the arguments, I just re-coded Maupassant's Menuet
(French), and I thought that you might be interested in the results:

Type of coding                           # chars    expansion
--------------------------------------- --------- -----------
Original text with accents coded
in Macintosh set...........................9169............

Text with SGML with accents coded
with SGML-entities........................10593......115.5%

Text with accents coded a` la TeX
(e grave = \`e , etc.).....................9585......104.5%

I tried the second one because many people working on French use some kind of
home-made cooking of this kind. It seems to be the most compact ISO 646
representation one can find without too many ambiguities to solve.

The difference between this encoding and the supposedly very wasteful SGML
entity-coding is not very big. Nothing like multiplying the size of the text by
three or four. Therefore the first arguments doesn't hold (for French).

As far as the second argument is concerned, I have of course heard many times
the counter-argument that this type of encoding is not intended to be read by
humans, but should just serve the purpose of transmission. Unfortunately, most
people I know who work on French deal with these things at one time or another,
simply because nobody has yet the software to do all the necessary conversion.
This speaks strongly for the development of public domain software to perform
these tasks--I have the feeling that the success of the TEI depends in large
part of the availaibility of such software for free, or cheap.

Anyway, just for a test, here are the SGML and TeX-like versions of the same
fragment.

J' ai cinquante ans. J' étais jeune alors et j' étudiais le
droit. Un peu triste, un peu rêveur, imprégné d' une
philosophie mélancolique, je n' aimais guère les cafés
bruyants, les camarades braillards, ni les filles stupides. Je me levais
tôt; et une de mes plus chères voluptés était de me
promener seul, vers huit heures du matin, dans la pépinière du
Luxembourg.

J' ai cinquante ans. J' \'etais jeune alors et j' \'etudiais le droit. Un peu
triste, un peu r\^eveur, impr\'egn\'e d' une philosophie m\'elancolique, je n'
aimais gu\`ere les caf\'es bruyants, les camarades braillards, ni les filles
stupides. Je me levais t\^ot; et une de mes plus ch\`eres volupt\'es \'etait de
me promener seul, vers huit heures du matin, dans la p\'epini\`ere du
Luxembourg.

The second one is probably easier to read, but not really wonderful either.

Reply | Threaded
Open this post in threaded view
|

Re: accents, entities and size of text

Eric van Herwijnen
I completely agree. The space requirements for keeping text in SGML
compared to say, producing PostScript output are negligible.