Disambiguation

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Disambiguation

Lou Burnard-7
I'm currently transforming some texts which were prepared by
someone else into something approaching TEI-conformant SGML, and
expect to be doing quite a lot of that sort of thing over the
next year or two. Most of the job is fairly straightforward --
what the person who prepared the text in the first place and what
the TEI proposes should be encoded are not a million miles apart
(if they were something would be seriously wrong somewhere) --
and involves little more than some string twiddlings, for which
languages like Icon or Snobol are perfect (I use the latter
though if I were young again I'd use the former). But
occasionally...

For example, here's a problem which has just turned up on which
I'd appreciate comments from the collective wisdom, and of which
I should like to warn the collective unconscious.

The texts on which I am working were originally prepared for a
concordance. Consequently, they have a very detailed reference
scheme (which I can handle) and also go to some trouble to
distinguish homographs. This is done by adding a coded suffix to
a fairly haphazard selection of words, some 8% of the total of
different words in the text I'm looking at, some 12% of the
running text length. For example, `associate' (noun) appears as
`associate$0$', `associate' (verb) as `associate$1$'. `$0' always
follows a noun (but not every noun, by any means), `$1' always
follows a verb and so on. The tag also distinguishes senses or
other subdivisions for some words: thus `ball', noun, in the
sense of a spherical object, appears as "ball$0#1$", and `ball'
as a social gathering as "ball$0#2$".

It's important to realise that these tags are not intended to
provide a full blown linguistic analysis -- there are only nine
categories, of which the last two are "idiom/fossil/collocation"
and "infinitive particle or mixed categories". They are only
there to distinguish homographs. `Bath' (as a proper name) gets a
tag to distinguish it from `bath' as a common noun -- but no
other place names are tagged. So neither the TEI tags for
linguistic analysis nor the tags for place names seem
appropriate.

My question is: what shall I do with these tags? There seem to be
four possibilities:

1a. Throw them away

1b. Ignore them i.e. just leave them in the text as funny looking
tokens which the application will have to sort out as best it can
(They will of course be documented in the TEI.Header, so what
more could you ask).

2. Tag the word or phrase to which they belong as a distinct
segment (I suppose the S tag would do for this), including their
value on a suitable attribute. Something like this:
     <s category='0#1'>ball</s>
This would involve defining a new attribute of course, with a
default value of `unspecified'.

3. Represent the word plus its disambiguating tag as an entity.
Something like &ball01; perhaps, which could be defined simply as
"ball", if the distinction is not be kept, or some other string
if it is.

1a.  seems a shame: for some applications (such as making a word
index) the disambiguating tags are very useful.

1b. is the easiest course of action but feels unwholesome

2. looks like overkill and moreover invites the question as to
why only some words or phrases get segmented in this way

3. would be easily the most satisfactory if there weren't quite
so many entities to define -- about 500 in all

Any ideas or counter-suggestions gratefully received.

Lou Burnard

(wearing Oxford Text Archive hat, rather than TEI one)

Reply | Threaded
Open this post in threaded view
|

Re: Disambiguation

David R. Chesnutt-2
   While making no claim to "wisdom" here, my personal opinion is that
you should simply leave the tags in place with documentation in the
header.  You have foreseen that the tags *may* be useful in some instances;
therefore, it does seem a shame to throw them away and the other choices
do seem like overkill.
   I also gather that you feel the tags could be easily removed by
potential users who feel they are irrelevant.  Thus, to leave the tags
in place would not make the text less useful.
   I suspect that most text that is transformed to TEI standards will
present similar problems.  In our original transcriptions of letters
which are published in the Laurens Papers, we mark the hyphenation of
words.  In the files used for typesetting, the markup is eliminated.
If I were transforming the letter files into TEI conformant text, I
would retain the hyphenation markup and probably let the "local"
markup (hyphen=/=ated) stand.
  In short, I vote for retention but without further markup.
I'll be interested to see what others have to say, because I am
actually working on the problem of converting some of our files
to TEI standards.

  Happy coding Lou... David Chesnutt

Reply | Threaded
Open this post in threaded view
|

Re: Disambiguation

Frank Wm. Tompa
In reply to this post by Lou Burnard-7
My interpretation of the TEI philosophy is that we wish to preserve
the data that is there, but not impose too many requirements on
providers of the data.  From Lou's list of possibilities, I then
vote for a variant of 2, which he states as:

    2. Tag the word or phrase to which they belong as a distinct
    segment (I suppose the S tag would do for this), including their
    value on a suitable attribute. Something like this:
         <s category='0#1'>ball</s>

but I would want to tag it more properly as 'noun, homograph 1'
by using more case-specific tags. I disagree with Lou's conclusion

    So neither the TEI tags for linguistic analysis nor the tags
    for place names seem appropriate.

The fact that not all words are tagged nor all place names marked
should not force us to water down the information that we have,
namely that some words are tagged and that some place names are marked
(and they are marked to distinguish their roles).

                                        Frank Tompa

Reply | Threaded
Open this post in threaded view
|

Re: Disambiguation

Robert A Amsler-2
In reply to this post by Lou Burnard-7
Clearly the disambiguation information needs to be recorded.
There are people for whom that will be the ONLY interesting
aspect of that text. Such disambiguation information can be used to test
parsers to determine whether they correctly can idenitify
the meaning of the words from their contexts.

I would think something like, <word pos=0 sense=1>ball</word>
would have to be used.

The verboseness of the encoding shouldn't be a factor.
I would suggest removing the tags from the words themselves
though, since this would appear to be markup and as such
shouldn't become embedded in another markup as text.