A set of <w> attributes for lightweight linguistic annotation

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

A set of <w> attributes for lightweight linguistic annotation

Martin Mueller

 

Today’s posting by Philip Stroebel  and the response by Michael Sperberg-McQueen prompts me to post a draft that Piotr Banski, Giuseppe Celano, Susanne Haaf, and I have been thinking. It is about a set of attributes for <w> that will support lightweight and inline linguistic annotation of historical corpora. I was particularly taken with Michael’s judicious (as always) analysis of the trade-offs between inline and stand-off markup.

 

Below is the draft

 

 

 

 

To: TEI List

From: Piotr Banski, Giuseppe Celano, Susanne Haaf, Martin Mueller

 

Coming from different projects that use linguistic annotation on a corpus-wide scale (DTA, EEBO-TCP, IDS,  Perseus), we think it would be useful to have a set of attributes for <w> that support lightweight linguistic annotation. We are putting this memo on the TEI list to explore whether other folks feel a similar need and whether the needs can be met by a reasonably compact set sufficiently capacious to meet many needs. Piotr Banski has an informal proposal for wordAttributes at https://github.com/LingSIG/wordAttributes.

 

Linguistic annotation is never a truly lightweight enterprise because it ‘explicitates’ for the machine at least some of the myriad rules and facts that a child tacitly brings to the task of making sense of simple utterances. It is meant to be processed rather than read.  In the context of linguistic annotation “lightweight” is a relative concept.  Roughly speaking, it inserts into a text some rudiments of readerly knowledge in a form that a machine can process, and it does so in a format that the guardians of the machine find tolerably easy to manage.

 

We are proposing five attributes: lemma, pos, reg, feats, and join. @feats is taken  from Universal Dependencies and supports more detailed morphological description than pos.  @join is taken from Morphological Annotation Framework, of which I believe Laurent Romary has been the spiritus rector.  It is useful for managing the absence of whitespace around punctuation and contracted forms.

 The following line from a linguistically annotated version of the TCP transcription of Fletcher’s “Rollo Duke of Normandy” illustrates the use of these attributes, except for @feats:

 

<l >
    <w lemma="come" pos="vvb" reg="Come"  xml:id="A00959-027-b-2140" join="left">Come</w>
    <w lemma="we" pos="pns" reg="we"  xml:id ="A00959-027-b-2150">we</w>
    <w lemma="be" pos="vvb" reg="are"  xml:id ="A00959-027-b-2160">are</w>
    <w lemma="stark" pos="av_j" reg="stark"  xml:id ="A00959-027-b-2170">starke</w>
    <w lemma="nought" pos="pi-x" reg="naught"  xml:id ="A00959-027-b-2180">nought</w>
    <w lemma="all" pos="d" reg="all"  xml:id ="A00959-027-b-2190">all</w>
    <pc  xml:id ="A00959-027-b-2200" join="left">;</pc>
    <w lemma="bad" pos="j" reg="bad"  xml:id ="A00959-027-b-2210">bad</w>
    <w  lemma="be" pos="vvz" reg="'s" join="left" xml:id ="A00959-027-b-2211">'s</w>
    <w lemma="the" pos="d" reg="the"  xml:id ="A00959-027-b-2220">the</w>
    <w lemma="best" pos="j-s" reg="best"  xml:id ="A00959-027-b-2230">best</w>
    <w lemma="on" pos="acp-av" reg="on"  xml:id ="A00959-027-b-2240">on</w>
    <w  lemma="we" pos="pno" reg="'s" join="left" xml:id ="A00959-027-b-2241">'s</w>
    <pc  xml:id ="A00959-027-b-2250" join="left">,</pc>
</l>

 

 

 

From the linguist’s perspective, the current TEI rules look odd. There is no POS attribute, which is the first thing a linguist would look for. There is, however, a lemma attribute. If one, why not the other?  In earlier correspondence on this topic Lou Burnard and Paul Schaffner asked whether  the needs of lightweight annotation could not be met by @ana,  @corresp,  @datcat, @function, @subtype, @lemma, @lemmaRef,  @type, @valueDataCat.  The answer to this is “yes , they can,”  but not in a manner that linguists will find intuitive or inviting. And if they don’t find the TEI sufficiently intuitive, they will go elsewhere.

 

There are many instances where the TEI has accommodated domain specific needs. We have <pb/> as syntactic sugar for <milestone unit = ‘page’/>.  Who “needs” <l> when <ab type=”verseline”> adequately expresses the encoder’s intention. And so on.

 

Ease of processing is another concern, especially in large-scale project. The <choice> mechanism is certainly one way of expressing alternation between an original and regularized spelling. But it has several disadvantages. It does not scale well. It creates ambiguities: do you put <choice> inside <w> or <w> inside choice?  And it creates a need for special handling. If you think of a text as a sequence of <w>  and <pc> tokens, it is simpler if all the metadata associated with a <w> element exist at the same level and can be handled in the same fashion. If you want to move annotation data from one environment (XML) to another (SQL), it is trivial to move attributes into columns. Moving a mix of elements and attributes into SQL is possible, but it requires more thought.

 

Can “reg” attributes hold all the data it needs to hold? According to Lou Burnard it was abolished “on the grounds that you might well want to include in its value some markup (e.g. a <g> element)

which would then not be correctly processed since attribute values may not contain markup.”  This is theoretically possible, but will be very rare in practice. So there may be some wisdom in maintaining a simpler alternative that will work wll most of the time

 

As Piotr put in his wordAttributes draft: The overall principle invoked here is the well-known KISS (for "keep it simple, scholar, or else you will end up alone in your armchair, while many corpus linguists adopt other annotation formats").

 

 

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: A set of <w> attributes for lightweight linguistic annotation

Eduard Drenth

In my post of yesterday I address exactly this 'lightweight and inline linguistic annotation' with a ready to use proposal. Perhaps you can take a look at that....


Eduard Drenth, Software Architekt


[hidden email]


Doelestrjitte 8

8911 DX  Ljouwert

+31 58 234 30 47


gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43




From: TEI (Text Encoding Initiative) public discussion list <[hidden email]> on behalf of Martin Mueller <[hidden email]>
Sent: Thursday, February 2, 2017 12:11 AM
To: [hidden email]
Subject: A set of <w> attributes for lightweight linguistic annotation
 

 

Today’s posting by Philip Stroebel  and the response by Michael Sperberg-McQueen prompts me to post a draft that Piotr Banski, Giuseppe Celano, Susanne Haaf, and I have been thinking. It is about a set of attributes for <w> that will support lightweight and inline linguistic annotation of historical corpora. I was particularly taken with Michael’s judicious (as always) analysis of the trade-offs between inline and stand-off markup.

 

Below is the draft

 

 

 

 

To: TEI List

From: Piotr Banski, Giuseppe Celano, Susanne Haaf, Martin Mueller

 

Coming from different projects that use linguistic annotation on a corpus-wide scale (DTA, EEBO-TCP, IDS,  Perseus), we think it would be useful to have a set of attributes for <w> that support lightweight linguistic annotation. We are putting this memo on the TEI list to explore whether other folks feel a similar need and whether the needs can be met by a reasonably compact set sufficiently capacious to meet many needs. Piotr Banski has an informal proposal for wordAttributes at https://github.com/LingSIG/wordAttributes.

 

Linguistic annotation is never a truly lightweight enterprise because it ‘explicitates’ for the machine at least some of the myriad rules and facts that a child tacitly brings to the task of making sense of simple utterances. It is meant to be processed rather than read.  In the context of linguistic annotation “lightweight” is a relative concept.  Roughly speaking, it inserts into a text some rudiments of readerly knowledge in a form that a machine can process, and it does so in a format that the guardians of the machine find tolerably easy to manage.

 

We are proposing five attributes: lemma, pos, reg, feats, and join. @feats is taken  from Universal Dependencies and supports more detailed morphological description than pos.  @join is taken from Morphological Annotation Framework, of which I believe Laurent Romary has been the spiritus rector.  It is useful for managing the absence of whitespace around punctuation and contracted forms.

 The following line from a linguistically annotated version of the TCP transcription of Fletcher’s “Rollo Duke of Normandy” illustrates the use of these attributes, except for @feats:

 

<l >
    <w lemma="come" pos="vvb" reg="Come"  xml:id="A00959-027-b-2140" join="left">Come</w>
    <w lemma="we" pos="pns" reg="we"  xml:id ="A00959-027-b-2150">we</w>
    <w lemma="be" pos="vvb" reg="are"  xml:id ="A00959-027-b-2160">are</w>
    <w lemma="stark" pos="av_j" reg="stark"  xml:id ="A00959-027-b-2170">starke</w>
    <w lemma="nought" pos="pi-x" reg="naught"  xml:id ="A00959-027-b-2180">nought</w>
    <w lemma="all" pos="d" reg="all"  xml:id ="A00959-027-b-2190">all</w>
    <pc  xml:id ="A00959-027-b-2200" join="left">;</pc>
    <w lemma="bad" pos="j" reg="bad"  xml:id ="A00959-027-b-2210">bad</w>
    <w  lemma="be" pos="vvz" reg="'s" join="left" xml:id ="A00959-027-b-2211">'s</w>
    <w lemma="the" pos="d" reg="the"  xml:id ="A00959-027-b-2220">the</w>
    <w lemma="best" pos="j-s" reg="best"  xml:id ="A00959-027-b-2230">best</w>
    <w lemma="on" pos="acp-av" reg="on"  xml:id ="A00959-027-b-2240">on</w>
    <w  lemma="we" pos="pno" reg="'s" join="left" xml:id ="A00959-027-b-2241">'s</w>
    <pc  xml:id ="A00959-027-b-2250" join="left">,</pc>
</l>

 

 

 

From the linguist’s perspective, the current TEI rules look odd. There is no POS attribute, which is the first thing a linguist would look for. There is, however, a lemma attribute. If one, why not the other?  In earlier correspondence on this topic Lou Burnard and Paul Schaffner asked whether  the needs of lightweight annotation could not be met by @ana,  @corresp,  @datcat, @function, @subtype, @lemma, @lemmaRef,  @type, @valueDataCat.  The answer to this is “yes , they can,”  but not in a manner that linguists will find intuitive or inviting. And if they don’t find the TEI sufficiently intuitive, they will go elsewhere.

 

There are many instances where the TEI has accommodated domain specific needs. We have <pb/> as syntactic sugar for <milestone unit = ‘page’/>.  Who “needs” <l> when <ab type=”verseline”> adequately expresses the encoder’s intention. And so on.

 

Ease of processing is another concern, especially in large-scale project. The <choice> mechanism is certainly one way of expressing alternation between an original and regularized spelling. But it has several disadvantages. It does not scale well. It creates ambiguities: do you put <choice> inside <w> or <w> inside choice?  And it creates a need for special handling. If you think of a text as a sequence of <w>  and <pc> tokens, it is simpler if all the metadata associated with a <w> element exist at the same level and can be handled in the same fashion. If you want to move annotation data from one environment (XML) to another (SQL), it is trivial to move attributes into columns. Moving a mix of elements and attributes into SQL is possible, but it requires more thought.

 

Can “reg” attributes hold all the data it needs to hold? According to Lou Burnard it was abolished “on the grounds that you might well want to include in its value some markup (e.g. a <g> element)

which would then not be correctly processed since attribute values may not contain markup.”  This is theoretically possible, but will be very rare in practice. So there may be some wisdom in maintaining a simpler alternative that will work wll most of the time

 

As Piotr put in his wordAttributes draft: The overall principle invoked here is the well-known KISS (for "keep it simple, scholar, or else you will end up alone in your armchair, while many corpus linguists adopt other annotation formats").

 

 

 

Loading...