Comments on TEI Guidelines from Geoffrey Sampson

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Comments on TEI Guidelines from Geoffrey Sampson

Lou Burnard-7
[With his kind permission, I post the following set of comments
from Geoffrey Sampson, formerly Professor of Linguistics at the
University of Leeds and probably well known to many computational
linguists
             -Lou Burnard]

18 August 1990

I have read this document carefully (except for one or two
sections that seemed distant from my concerns) and with
great interest.  It is clearly heavily relevant to my work,
which involves intensive use (including exchange with other
groups) of various versions of large corpora of written and
spoken English (my group is currently working mainly with
the LOB, Brown, and London-Lund Corpora, in each case using
a range of different <q>editions</q> of the respective corpora
with different formats and different levels of linguistic
annotation).  The composers of this document have to my
mind done a remarkable job in marshalling a very diverse
range of detailed considerations into a coherent set of
proposals.  I should like to stress my admiration for this
achievement at the outset, in order to establish the
background against which I make some specific points which
must, unfortunately, be somewhat cold-waterish.

One problem I have is that although my research team would
seem to be a good example of the kind of group which would
use these guidelines, if they come to be generally
accepted, in practice I can't at this point see how we
could actually do so.  One difficulty is the very bulky and
hard-to-read format imposed by the TEI standards.  A lot of
our work involves files of sentences annotated with
labelled bracketings representing grammatical structure:  I
think if we changed our current format for these to a
TEI-conformant system, each file would become at least 35
times bigger (and there would be no room for them on our
discs), and they would become very difficult for the
researchers to understand and work with.  If one were to
say that the TEI-conformant versions would be created only
at the point where a file is exchanged with another group,
and for internal purposes we stick to our own format, then
I think the TEI format would be so psychologically
peripheral that in fact it would not get used at all.  And
probably a more serious difficulty is that the system is
sufficiently complex that any use of it would mean a
significant manpower drain:  one member of our group would
have to take charge of TEI conformance, and this work would
be quite a significant fraction of that researcher's
duties.  I don't see where we would find that manpower.
Most of these difficulties are of course only difficulties
because we are an academic research group running on a
shoestring.  But in fact our own is a fairly thick
shoestring; we have in the last few years been quite lucky
in securing research funding, and I think we have a more
comfortable resource situation than many comparable groups:
so, if we would find it hard to use these guidelines, many
others will also.  The problem about disc space for bulky
files may admittedly die away in due course as this aspect
of the technology makes a further quantum leap, but in the
foreseeable future it is a serious one.

The other major issue that strikes me is that the
guidelines are very strongly biassed towards the special
problems involved in literary texts, although for ordinary
natural language processing purposes I believe these
problems are not central.  (I appreciate of course that
this bias is a natural consequence of the balance of the
sponsoring bodies; nevertheless, the Initiative sets out to
provide for any computational use of natural-language text
- if the guidelines were intended <emph>only</emph> to be used in a
literary context, they would be of no relevance for my
group's work and it would be inappropriate for me to
comment on them.)  At one point in the draft, for instance
(sec. 5.2.4), I noticed a special comment about the
encoding of <q>epistolary novels</q> (a fairly abstruse concept
for many of us!), while I found little or no recognition
anywhere of the many special problems of spoken as opposed
to written material, although in the NLP spectrum speech is
surely as important or more important than all genres of
written language put together.  If some version of the TEI
is to play a useful role in the future, it would seem very
unfortunate if it does not appeal to those working with
spoken as well as those working with written texts, but the
current draft has nothing for the speech researchers.

It may be worth mentioning that some features of the
guidelines that I find problematical are undoubtedly
features of SGML, which the TEI has taken as given, rather
than features that the TEI has invented; however, I had no
prior knowledge of SGML, so I have criticized the
guidelines as a whole without paying attention to whether
some particular point was inherited from SGML.


Comments on specific points:

2.1.1.2, <q>document types</q>:  I find myself wondering where
this very rigorous concept would find an application in
practice, given the fairly anarchic formats found in most
kinds of real-life documents.

2.1.4.2:  <q>only the POEM element requires an end tag</q> -
doesn't the ANTHOLOGY element also require one?

2.1.4.3:  <q>#PCDATA</q>:  It is a pity to allot an important
role to a national character such as hash.  In many British
systems, pressing the hash key gives a pound sign.  My
Macintosh keyboard doesn't have the hash sign shown on any
key, though there is a way to get it by pressing two keys
at once.

2.1.5.1, middle of p. 20:  <q>notes or variants can appear at
any point in the content of a poem element</q>:  I wondered
whether this is recursive, so that they can occur anywhere
within an element of a poem element, but I'm not sure
whether or not this makes sense with respect to <q>element</q>
in the TEI sense - I probably need to read this section
again.

2.1.5.2, middle of p. 22:  I don't understand why it is
desirable to set the general system up in such a way that
one has to say that pages and stanzas are units of separate
superordinate structures ("anthology</q> and <q>p.anth</q>), while
the truth is (in this and, surely, many other cases) that
the lower-level units are independent ways of dividing up a
single superordinate structure.

2.1.6, p. 27, <q>POEMREF ID=Rose</q>:  This seems to be going
beyond the representation of text itself to create a
logical notation for showing certain aspects of the meaning
of the text.  I'm not sure what is gained; why not just
leave the reference in whatever form the writer used?
(Perhaps the answer is, if the document is produced by many
writers and the form of all cross-references is to be
decided once and for all by an editor, this offers a
neutral way to include the references in the writers' MSS.
But this seems rather different from the sort of purposes I
thought the TEI was designed to serve.)

3.1:  <q>every word belongs to a single language</q>:  even with
the proviso in brackets (and it would be a <emph>lot of</emph> proper
names), this comment seems over-simple.  The boundary
between being a word of language A used on occasion by a
speaker/writer of language B, and being a word of language
B that was borrowed from A, is a very blurry one (and the
nature of the boundary varies quite a lot depending on the
identity of A and B).

<q>phonetically-based writing systems</q>:  should read
<q>alphabetic systems</q> (syllabaries are also phonetically
based); and the term <q>calligraphic</q> in the same para is
misused.

3.1.1:  I did not follow the distinction between
<q>character</q> and <q>grapheme</q>.  (Both of them equally sound
like what in my book <emph>Writing Systems</emph> I call <q>graph</q>.  The
term <q>grapheme" is sometimes used in the literature with a
more specific sense that I think is not intended here:  for
instance, the two forms of lower-case Greek sigma may be
called allographs of a single grapheme.)

4.1.6:  I can see that this is realistic in the context of
editions of literary texts, where consecutive editions or
versions may perhaps exhibit only a small number of
differences.  But in the context of linguistic analysis
within natural language processing, it is unexceptional to
be producing consecutive versions of text files weekly or
daily which may contain annotational differences as dense
as, say, one per word; in this context, which is more
familiar to me, I feel that the guidelines are offering an
impossible ideal.

5.3.3:  I predict that someone who attempts to apply this
to real corpus data will soon hit cases which show that the
boundaries between inverted commas for reported speech,
<q>scare quotes</q>, etc. are actually blurred and not always
resolvable.

5.5.1, <tag>publ.city</tag>:  This is an Americanism, and one which
always seems puzzling to me even in American terms.  When
American forms identify the elements of an address to be
filled in as including <q>city</q>, my reaction is <q>But I don't
live in a city</q>; and although the meaning of <q>city</q> varies
transatlantically I didn't think Americans all lived in
<q>cities</q> either.  Why not <tag>publ.place</tag>, which has the added
advantage of not looking like the word <q>publicity</q>?

5.7.2:  I suppose it is unreasonably pedantic, even with
such literary colleagues, to suggest that <q>cross-reference</q>
should not be abbreviated <q>xref</q> since an X is a saltire
not a cross?  Yes, it is unreasonably pedantic.

5.8.1, <q>S-units</q>:  I have had to spend a lot of time
worrying about how to divide authentic corpus text up into
sentence-like segments in a consistent way.  It is a far
knottier problem than this section suggests (indeed,
ultimately I would think perfectly consistent standards are
unattainable).  I was particularly puzzled by the last
comment on p.103, which suggests that if you know that a
word-sequence is an S-unit you can predict its final
punctuation mark:  that is obviously not so, so I must have
misunderstood the comment.

8.1, p.130, end of middle para:  <q>work in progress on a set
of entity definitions for commonly used parts of speech and
other grammatical information...</q>:  I have done a huge
amount of work in this area and am planning to do a lot
more, if it is any use to you.

8.2.2, middle of p.138:  I was slightly startled to find
that the specimen tree illustrated was in dependency rather
than phrase-structure form, dependency notation being
definitely a minority taste in at least the
English-speaking linguistic world.  Presumably the TEI is
not systematically favouring dependency over PS trees, is
it?  (Personally I like dependency trees, but a lot of
linguists don't.)

6.2.4, p.140:  This example of a wordtagged sentence seems
to be intended to exemplify the LOB wordtags (indeed I see
the item <tag>level type=LOB</tag>), so you may as well get it
right:  the pronoun <q>I</q> would be PPlA, and <q>the</q>
should be ATI. (No variant of the LOB/Brown tagsets uses a tag ATD, and
those variants that include PN use it for a different sort
of pronoun.)

A.2, p.210:  We would regard it as a very bad idea in
constructing a new formatting of LOB-type material to
replace the reference systems based on lines with one based
on S-units, because the identity of S-units is hazy as
already mentioned, and in practice we find that as research
continues one often changes decisions about segmentation
into sentences; the reference system based on lines is a
conveniently constant system.  (Commonly, we would use
files which assign words <emph>both</emph> line-based reference numbers
<emph>and</emph> sentence-based reference numbers, but the former would
be regarded as the important, fundamental system.)  The
problem of sentence segmentation occurs in spades with
spoken material such as the London-Lund Corpus, of course,
where the concept of <q>sentence</q> is really not applicable at
all.
                                         TSI comments 0818/sh2