Quantcast

tagging multiple lemmas to ambiguous words

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

tagging multiple lemmas to ambiguous words

Levi Damsma
Dear everyone,

I am currently lemmatising and POS-tagging an Old Frisian text (the elder 'Skeltanariucht' from MS Junius 49) for the Frisian Academy (Fryske Akademy), and am wondering how to cope with ambiguity: specifically, a word which could be lemmatised with two different lemma's according to how one interprets it.

An example: "tha banne" ('the ban/the summons') could be the dative case of a masculine "thi ban" or of a neuter "thet ban". This  word, "bon", appears as both masculine and neuter elsewhere in this text, so it is not possible to determine wether I have to tag the article in this example with the lemma "thet" or "thi". Ideally, in our online edition of this text, I want this word to link to both lemmata.

First I was thinking of something like <choice></choice> (which I now use for corrections with <sic> and <corr>), but maybe this is not what I want, because I want to show both lemmata, not switch views between them. I'd rather just find a way to add two lemmata in one <w>/word.

I do not have much experience with TEI, so maybe I am overlooking a very simple solution.
Does anybody have a good solution, or maybe just some thoughts which could point me in the right direction?

Thanks!
Levi
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tagging multiple lemmas to ambiguous words

Piotr Bański-2
Dear Levi,

I wonder how far in the process you are and how open to making your
markup a bit more complex (the alternative being to keep it simple and
just kludge the more complex cases).

And re(re,re)reading your message, I am actually not sure how detailed
you want to become and what your initial assumptions are, because it
seems that at the beginning you are talking about two lemmas of the noun
and then shift the focus to two lemmas of the article (my conjecture
being that each article lemma defines its own paradigm depending on its
gender)

I assume you do something like

<w lemma="thi" type="art" subtype="dat.masc">tha</w>

and would like to be able to signal the possibility of

<w lemma="thet" type="art" subtype="dat.neut">tha</w>


(and at this point, let me state that this alone seems water on the mill
for the suggestion of rationalizing simple w-level linguistic markup,
voiced recently on this list)

So you have two distinct ordered sets of values for a single word-sized
piece of text that you would like to express together. The way out that
I see is either to keep to the simple version and get a bit kludgy for
the complex cases by doing:

<w ana="#thi_thet">tha</w>

where "#thi_thet" identifies a place in the document where you list the
relevant feature complexes, and your processor knows that when it sees
the @ana attribute, it should do some special magic. Two remarks now:
1. that "place in the document" can be under <standoff> (a sibling of
<text>, approved by the Council long ago but still absent from the
online documentation)
2. a less kludgy version of the above would involve using @ana across
the board, on all <w> elements.

Or you get more complex by invoking ISO MAF (Morpho-syntactic annotation
framework)... [1][2]

[1]: http://www.iso.org/iso/catalogue_detail.htm?csnumber=51934
[2]: https://jtei.revues.org/523#tocto2n5

... and mapping it to TEI in some clever way. A clever way could again
involve the approved but still unofficial <standoff> element in the same
document, or a series of documents the way that e.g. the National Corpus
of Polish [3] did.

[3]: http://nlp.ipipan.waw.pl/TEI4NKJP/


Below, I paste a fragment of the file (warning: large!) that you can find at

http://nlp.ipipan.waw.pl/TEI4NKJP/example_all_levels_1M/ann_morphosyntax.xml

In the partially indented fragment below (I suggest pasting it into an
XML editor for highlighting), the "interps" fragment lists all possible
interpretations of the string "młodzi", while the "disamb" fragment
presents a result of automatic disambiguation.
"base" stands for lemma, "ctag" for part-of-speech, and "msd" for
morpho-syntactic description (we used the CES names for sentimental
reasons).

<seg corresp="ann_segmentation.xml#segm_1.2-seg" xml:id="morph_1.2-seg">
<fs type="morph">
    <f name="orth"><string>młodzi</string></f><!-- młodzi [5,6] -->
<f name="interps">
<fs type="lex" xml:id="morph_1.2.1-lex">
<f name="base"><string>młody</string></f>
<f name="ctag"><symbol value="adj"/></f>
<f name="msd">
<vAlt>
<symbol value="pl:nom:m1:pos" xml:id="morph_1.2.1.1-msd"/>
<symbol value="pl:voc:m1:pos" xml:id="morph_1.2.1.2-msd"/>
</vAlt>
</f>
</fs>
<fs type="lex" xml:id="morph_1.2.2-lex">
<f name="base"><string>młody</string></f>
<f name="ctag"><symbol value="subst"/></f>
<f name="msd">
<vAlt>
<symbol value="pl:nom:m1" xml:id="morph_1.2.2.1-msd"/>
<symbol value="pl:voc:m1" xml:id="morph_1.2.2.2-msd"/>
</vAlt>
</f></fs>
<fs type="lex" xml:id="morph_1.2.3-lex"><f
name="base"><string>młodzi</string></f><f name="ctag"><symbol
value="subst"/></f><f name="msd"><vAlt><symbol value="pl:nom:m1"
xml:id="morph_1.2.3.1-msd"/><symbol value="pl:voc:m1"
xml:id="morph_1.2.3.2-msd"/></vAlt></f></fs><fs type="lex"
xml:id="morph_1.2.4-lex"><f name="base"><string>młodzie</string></f><f
name="ctag"><symbol value="subst"/></f><f name="msd"><symbol
value="pl:gen:n" xml:id="morph_1.2.4.1-msd"/></f></fs><fs type="lex"
xml:id="morph_1.2.5-lex"><f name="base"><string>młódź</string></f><f
name="ctag"><symbol value="subst"/></f><f name="msd"><vAlt><symbol
value="sg:gen:f" xml:id="morph_1.2.5.1-msd"/><symbol value="sg:dat:f"
xml:id="morph_1.2.5.2-msd"/><symbol value="sg:loc:f"
xml:id="morph_1.2.5.3-msd"/><symbol value="sg:voc:f"
xml:id="morph_1.2.5.4-msd"/><symbol value="pl:gen:f"
xml:id="morph_1.2.5.5-msd"/></vAlt></f></fs></f>

<f name="disamb">
<fs feats="#an8003" type="tool_report">
<f fVal="#morph_1.2.1.1-msd" name="choice"/>
<f name="interpretation">
<string>młody:adj:pl:nom:m1:pos</string>
<!-- interpretation -->
</f></fs></f>

</fs></seg>


I am hopeful that some middle-ground examples from others on this list
are forthcoming.

HTH and best regards,

   Piotr



On 23/11/16 09:35, Levi Damsma wrote:

> Dear everyone,
>
> I am currently lemmatising and POS-tagging an Old Frisian text (the elder 'Skeltanariucht' from MS Junius 49) for the Frisian Academy (Fryske Akademy), and am wondering how to cope with ambiguity: specifically, a word which could be lemmatised with two different lemma's according to how one interprets it.
>
> An example: "tha banne" ('the ban/the summons') could be the dative case of a masculine "thi ban" or of a neuter "thet ban". This  word, "bon", appears as both masculine and neuter elsewhere in this text, so it is not possible to determine wether I have to tag the article in this example with the lemma "thet" or "thi". Ideally, in our online edition of this text, I want this word to link to both lemmata.
>
> First I was thinking of something like <choice></choice> (which I now use for corrections with <sic> and <corr>), but maybe this is not what I want, because I want to show both lemmata, not switch views between them. I'd rather just find a way to add two lemmata in one <w>/word.
>
> I do not have much experience with TEI, so maybe I am overlooking a very simple solution.
> Does anybody have a good solution, or maybe just some thoughts which could point me in the right direction?
>
> Thanks!
> Levi
>


--
Piotr Bański, Ph.D.
Senior Researcher,
Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tagging multiple lemmas to ambiguous words

Levi Damsma
Dear Piotr,

Thank you for your answer. We're thinking of your 'kludgy' option, I think it suits our needs.

<w ana="#thî_1_thet_1">tha</w>

referring to an interpretation group:

<interpGrp xml:id="thî_1_thet_1">
 <interp type="lemma">thî_1</interp>
 <interp type="lemma">thet_1</interp>
</interpGrp>

(The "_1" is to distinguish between different lemmata in our online Old Frisian dictionary.)

Best wishes,
Levi

________________________________________
Van: Piotr Banski <[hidden email]>
Verzonden: woensdag 23 november 2016 14:48
Aan: Levi Damsma; [hidden email]
Onderwerp: Re: tagging multiple lemmas to ambiguous words

Dear Levi,

I wonder how far in the process you are and how open to making your
markup a bit more complex (the alternative being to keep it simple and
just kludge the more complex cases).

And re(re,re)reading your message, I am actually not sure how detailed
you want to become and what your initial assumptions are, because it
seems that at the beginning you are talking about two lemmas of the noun
and then shift the focus to two lemmas of the article (my conjecture
being that each article lemma defines its own paradigm depending on its
gender)

I assume you do something like

<w lemma="thi" type="art" subtype="dat.masc">tha</w>

and would like to be able to signal the possibility of

<w lemma="thet" type="art" subtype="dat.neut">tha</w>


(and at this point, let me state that this alone seems water on the mill
for the suggestion of rationalizing simple w-level linguistic markup,
voiced recently on this list)

So you have two distinct ordered sets of values for a single word-sized
piece of text that you would like to express together. The way out that
I see is either to keep to the simple version and get a bit kludgy for
the complex cases by doing:

<w ana="#thi_thet">tha</w>

where "#thi_thet" identifies a place in the document where you list the
relevant feature complexes, and your processor knows that when it sees
the @ana attribute, it should do some special magic. Two remarks now:
1. that "place in the document" can be under <standoff> (a sibling of
<text>, approved by the Council long ago but still absent from the
online documentation)
2. a less kludgy version of the above would involve using @ana across
the board, on all <w> elements.

Or you get more complex by invoking ISO MAF (Morpho-syntactic annotation
framework)... [1][2]

[1]: http://www.iso.org/iso/catalogue_detail.htm?csnumber=51934
[2]: https://jtei.revues.org/523#tocto2n5

... and mapping it to TEI in some clever way. A clever way could again
involve the approved but still unofficial <standoff> element in the same
document, or a series of documents the way that e.g. the National Corpus
of Polish [3] did.

[3]: http://nlp.ipipan.waw.pl/TEI4NKJP/


Below, I paste a fragment of the file (warning: large!) that you can find at

http://nlp.ipipan.waw.pl/TEI4NKJP/example_all_levels_1M/ann_morphosyntax.xml

In the partially indented fragment below (I suggest pasting it into an
XML editor for highlighting), the "interps" fragment lists all possible
interpretations of the string "młodzi", while the "disamb" fragment
presents a result of automatic disambiguation.
"base" stands for lemma, "ctag" for part-of-speech, and "msd" for
morpho-syntactic description (we used the CES names for sentimental
reasons).

<seg corresp="ann_segmentation.xml#segm_1.2-seg" xml:id="morph_1.2-seg">
<fs type="morph">
    <f name="orth"><string>młodzi</string></f><!-- młodzi [5,6] -->
<f name="interps">
<fs type="lex" xml:id="morph_1.2.1-lex">
<f name="base"><string>młody</string></f>
<f name="ctag"><symbol value="adj"/></f>
<f name="msd">
<vAlt>
<symbol value="pl:nom:m1:pos" xml:id="morph_1.2.1.1-msd"/>
<symbol value="pl:voc:m1:pos" xml:id="morph_1.2.1.2-msd"/>
</vAlt>
</f>
</fs>
<fs type="lex" xml:id="morph_1.2.2-lex">
<f name="base"><string>młody</string></f>
<f name="ctag"><symbol value="subst"/></f>
<f name="msd">
<vAlt>
<symbol value="pl:nom:m1" xml:id="morph_1.2.2.1-msd"/>
<symbol value="pl:voc:m1" xml:id="morph_1.2.2.2-msd"/>
</vAlt>
</f></fs>
<fs type="lex" xml:id="morph_1.2.3-lex"><f
name="base"><string>młodzi</string></f><f name="ctag"><symbol
value="subst"/></f><f name="msd"><vAlt><symbol value="pl:nom:m1"
xml:id="morph_1.2.3.1-msd"/><symbol value="pl:voc:m1"
xml:id="morph_1.2.3.2-msd"/></vAlt></f></fs><fs type="lex"
xml:id="morph_1.2.4-lex"><f name="base"><string>młodzie</string></f><f
name="ctag"><symbol value="subst"/></f><f name="msd"><symbol
value="pl:gen:n" xml:id="morph_1.2.4.1-msd"/></f></fs><fs type="lex"
xml:id="morph_1.2.5-lex"><f name="base"><string>młódź</string></f><f
name="ctag"><symbol value="subst"/></f><f name="msd"><vAlt><symbol
value="sg:gen:f" xml:id="morph_1.2.5.1-msd"/><symbol value="sg:dat:f"
xml:id="morph_1.2.5.2-msd"/><symbol value="sg:loc:f"
xml:id="morph_1.2.5.3-msd"/><symbol value="sg:voc:f"
xml:id="morph_1.2.5.4-msd"/><symbol value="pl:gen:f"
xml:id="morph_1.2.5.5-msd"/></vAlt></f></fs></f>

<f name="disamb">
<fs feats="#an8003" type="tool_report">
<f fVal="#morph_1.2.1.1-msd" name="choice"/>
<f name="interpretation">
<string>młody:adj:pl:nom:m1:pos</string>
<!-- interpretation -->
</f></fs></f>

</fs></seg>


I am hopeful that some middle-ground examples from others on this list
are forthcoming.

HTH and best regards,

   Piotr



On 23/11/16 09:35, Levi Damsma wrote:

> Dear everyone,
>
> I am currently lemmatising and POS-tagging an Old Frisian text (the elder 'Skeltanariucht' from MS Junius 49) for the Frisian Academy (Fryske Akademy), and am wondering how to cope with ambiguity: specifically, a word which could be lemmatised with two different lemma's according to how one interprets it.
>
> An example: "tha banne" ('the ban/the summons') could be the dative case of a masculine "thi ban" or of a neuter "thet ban". This  word, "bon", appears as both masculine and neuter elsewhere in this text, so it is not possible to determine wether I have to tag the article in this example with the lemma "thet" or "thi". Ideally, in our online edition of this text, I want this word to link to both lemmata.
>
> First I was thinking of something like <choice></choice> (which I now use for corrections with <sic> and <corr>), but maybe this is not what I want, because I want to show both lemmata, not switch views between them. I'd rather just find a way to add two lemmata in one <w>/word.
>
> I do not have much experience with TEI, so maybe I am overlooking a very simple solution.
> Does anybody have a good solution, or maybe just some thoughts which could point me in the right direction?
>
> Thanks!
> Levi
>


--
Piotr Bański, Ph.D.
Senior Researcher,
Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany
Loading...