ISO 646 & networks

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

ISO 646 & networks

In his very helpful note on "French characters and other 'special' characters"
MSMcQ says:

>If I sent a file containing JV's example, [with accents coded by SGML entities
>and using only ISO 646 characters, if I understand correctly] it would arrive
>anywhere on this net in readable form.

This raises another, related, question. Obviously, such a format would be much
safer, but there is still no guarantee that the file received at the other end
will be correct.

Networks have a strange behavior with (at least):

(1) lines longer than 80 characters (which are typically truncated, or wrapped);
(2) spaces at the end of lines (which are typically stripped off).

The result of (1) is that you can't send most texts in their original form.
You have to process them (or let the network do it in its own way) to make sure
that lines are < 80 character long.

A reasonable way to do that is to break between words, as close as possible to
the 80th character. But this means that there is usually a space or punctuation
at the end of the line. This space can very well be lost by virtue of (2)
above. Worse if there were several spaces. Worse if the text is not just
composed of "words", but contains various markup and interpretational material.

Therefore, when the receiver tries to rebuild the text, s/he has to reassemble
the lines, and I am not sure that re-inserting systematically a space is a good
idea, since it may cause other problems, by separating things which which were
not separated in the original text. Also, mutiple spaces would be reduced to

In fact, for these very reasons, texts encoded in the Microsoft Word's RTF
format do not travel very well (without more processing) over the networks,
although the character set is quite close to ISO 646.

The solution typically used by Macintosh users is to encode their texts not
with RTF, but with BinHex, which ensures (1) that ISO 646 is respected, but
also (2) that all transmitted lines are < 80 characters long, and (3) that no
space is lost. --of course, only a Mac (as far as I know) can decode the text
correctly at the other end. Uuencode/uudecode work in the rather similar way.

Has this problem been discussed within the TEI? is there any TEI-conformant

As a corollary, using Sed or Awk, as suggested earlier on this list, would not
be enough to ensure proper transmission (assumin that Sed and Awk would be
appropriate--see Erik Naggum's note). You would need to (1) Sed or Awk your
texts, (2) Binhex or uuencode them for transmission.

Jean Veronis           [hidden email]   [hidden email]