Representation of linguistic information and word and sentence alignments in XML files

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Representation of linguistic information and word and sentence alignments in XML files

Phillip Ströbel
Dear members of the TEI community,

I have several questions regarding the annotation of linguistic information in a corpus consisting of XML documents.

We would like to add parsing information to sentences. We represent a normal sentence like this:

<s id="a1-s6" lang="de">
        <w id="a1-s6-w1" lemma="d" pos="ART">Die</w>
        <w id="a1-s6-w2" lemma="weit" pos="ADJA">weiten</w>
        <w id="a1-s6-w3" lemma="," pos="$,">,</w>
        <w id="a1-s6-w4" lemma="öde" pos="ADJA">öden</w>
        <w id="a1-s6-w5" lemma="Gebiet" pos="NN">Gebiete</w>
        <w id="a1-s6-w6" lemma="," pos="$,">,</w>
        <w id="a1-s6-w7" lemma="d" pos="PRELS">die</w>
        <w id="a1-s6-w8" lemma="Kälte" pos="NN">Kälte</w>
        <w id="a1-s6-w9" lemma="," pos="$,">,</w>
        <w id="a1-s6-w10" lemma="d" pos="PRELS">der</w>
        <w id="a1-s6-w11" lemma="Schnee" pos="NN">Schnee</w>
        <w id="a1-s6-w12" lemma="sein" pos="VAFIN">sind</w>
        <w id="a1-s6-w13" lemma="mein" pos="PPOSAT">meine</w>
        <w id="a1-s6-w14" lemma="bevorzugt" pos="ADJA">bevorzugte</w>
        <w id="a1-s6-w15" lemma="Umgebung" pos="NN">Umgebung</w>
        <w id="a1-s6-w16" lemma=";" pos="$.">;</w>
 </s>

The "a1" in the sentence and word IDs corresponds the article in which a sentence occurs. Moreover, we have a lemma and a part-of-speech tag (pos) for every word.

Our first idea is to include the parsing information directly in the w-elements, which would look like this:

<s id="a1-s6" lang="de">
        <w deprel="DET" head="a1-s6-w5" id="a1-s6-w1" lemma="d" pos="ART">Die</w>
        <w deprel="ATTR" head="a1-s6-w5" id="a1-s6-w2" lemma="weit" pos="ADJA">weiten</w>
        <w deprel="-PUNCT-" head="a1-s6-w2" id="a1-s6-w3" lemma="," pos="$,">,</w>
        <w deprel="KON" head="a1-s6-w2" id="a1-s6-w4" lemma="öde" pos="ADJA">öden</w>
        <w deprel="NEB" head="a1-s6-w12" id="a1-s6-w5" lemma="Gebiet" pos="NN">Gebiete</w>
        <w deprel="-PUNCT-" head="a1-s6-w5" id="a1-s6-w6" lemma="," pos="$,">,</w>
        <w deprel="SUBJ" head="a1-s6-w12" id="a1-s6-w7" lemma="d" pos="PRELS">die</w>
        <w deprel="PRED" head="a1-s6-w12" id="a1-s6-w8" lemma="Kälte" pos="NN">Kälte</w>
        <w deprel="-PUNCT-" head="a1-s6-w8" id="a1-s6-w9" lemma="," pos="$,">,</w>
        <w deprel="DET" head="a1-s6-w11" id="a1-s6-w10" lemma="d" pos="PRELS">der</w>
        <w deprel="KON" head="a1-s6-w8" id="a1-s6-w11" lemma="Schnee" pos="NN">Schnee</w>
        <w head="root" id="a1-s6-w12" lemma="sein" pos="VAFIN">sind</w>
        <w deprel="DET" head="a1-s6-w15" id="a1-s6-w13" lemma="mein" pos="PPOSAT">meine</w>
        <w deprel="ATTR" head="a1-s6-w15" id="a1-s6-w14" lemma="bevorzugt" pos="ADJA">bevorzugte</w>
        <w deprel="PRED" head="a1-s6-w12" id="a1-s6-w15" lemma="Umgebung" pos="NN">Umgebung</w>
        <w deprel="-PUNCT-" head="a1-s6-w15" id="a1-s6-w16" lemma=";" pos="$.">;</w>
</s>

Every word has another word ID from the same sentences as its head (except for the root) and every w-element specifies the dependency relation to the head (except for the root).

Is this a practical solution?

Moreover, we would like to integrate sentence alignment information, since we are working with a multilingual corpus.

Our idea is to integrate this information directly into the s-elements (like we do for articles which are translated). This could look like in the following:

<s id="a1-s6" lang="de" alignment_targets="SAC-Jahrbuch_1990_fr.xml:a1-s5">

The alignment target specifies the article and sentence ID in the yearbook in the other language which corresponds to the current s-element. Or would it be better to provide this information in a link group or just link element and then have the words in another, separate element?

<s id="a1-s6" lang="de">
    <link xtype="1-2" xtargets="SAC-Jahrbuch_1990_fr.xml:a1-s5;SAC-Jahrbuch_1990_fr.xml:a1-s6"/>
    <words>
        <w deprel="DET" head="a1-s6-w5" id="a1-s6-w1" lemma="d" pos="ART">Die</w>
        <w deprel="ATTR" head="a1-s6-w5" id="a1-s6-w2" lemma="weit" pos="ADJA">weiten</w>
        <w deprel="-PUNCT-" head="a1-s6-w2" id="a1-s6-w3" lemma="," pos="$,">,</w>
        <w deprel="KON" head="a1-s6-w2" id="a1-s6-w4" lemma="öde" pos="ADJA">öden</w>
        ....
    </words>
</s>

Same for word alignment: Separate link groups or alignment information directly in the w-elements, like so:

<w deprel="DET" head="a1-s6-w5" id="a1-s6-w1" lemma="d" pos="ART" alignment_targets="SAC-Jahrbuch_1990_fr.xml:a1-s6-w1 SAC-Jahrbuch_1990_it.xml:a1-s5-w1>Die</w>

I'm looking forward to your suggestions.

Kind regards,

Philllip
Reply | Threaded
Open this post in threaded view
|

Re: Representation of linguistic information and word and sentence alignments in XML files

C. M. Sperberg-McQueen
> On Feb 1, 2017, at 4:01 AM, Phillip Ströbel <[hidden email]> wrote:
>
> Dear members of the TEI community,
>
> I have several questions regarding the annotation of linguistic information in a corpus consisting of XML documents.
>
> We would like to add parsing information to sentences. We represent a normal sentence like this:
>
> ..,.
> The "a1" in the sentence and word IDs corresponds the article in which a sentence occurs. Moreover, we have a lemma and a part-of-speech tag (pos) for every word.
>
> Our first idea is to include the parsing information directly in the w-elements, which would look like this:
>
> <s id="a1-s6" lang="de">
>        <w deprel="DET" head="a1-s6-w5" id="a1-s6-w1" lemma="d" pos="ART">Die</w>
>        <w deprel="ATTR" head="a1-s6-w5" id="a1-s6-w2" lemma="weit" pos="ADJA">weiten</w>
>        <w deprel="-PUNCT-" head="a1-s6-w2" id="a1-s6-w3" lemma="," pos="$,">,</w>
>        <w deprel="KON" head="a1-s6-w2" id="a1-s6-w4" lemma="öde" pos="ADJA">öden</w>
>        <w deprel="NEB" head="a1-s6-w12" id="a1-s6-w5" lemma="Gebiet" pos="NN">Gebiete</w>
>        <w deprel="-PUNCT-" head="a1-s6-w5" id="a1-s6-w6" lemma="," pos="$,">,</w>
>        <w deprel="SUBJ" head="a1-s6-w12" id="a1-s6-w7" lemma="d" pos="PRELS">die</w>
>        <w deprel="PRED" head="a1-s6-w12" id="a1-s6-w8" lemma="Kälte" pos="NN">Kälte</w>
>        <w deprel="-PUNCT-" head="a1-s6-w8" id="a1-s6-w9" lemma="," pos="$,">,</w>
>        <w deprel="DET" head="a1-s6-w11" id="a1-s6-w10" lemma="d" pos="PRELS">der</w>
>        <w deprel="KON" head="a1-s6-w8" id="a1-s6-w11" lemma="Schnee" pos="NN">Schnee</w>
>        <w head="root" id="a1-s6-w12" lemma="sein" pos="VAFIN">sind</w>
>        <w deprel="DET" head="a1-s6-w15" id="a1-s6-w13" lemma="mein" pos="PPOSAT">meine</w>
>        <w deprel="ATTR" head="a1-s6-w15" id="a1-s6-w14" lemma="bevorzugt" pos="ADJA">bevorzugte</w>
>        <w deprel="PRED" head="a1-s6-w12" id="a1-s6-w15" lemma="Umgebung" pos="NN">Umgebung</w>
>        <w deprel="-PUNCT-" head="a1-s6-w15" id="a1-s6-w16" lemma=";" pos="$.">;</w>
> </s>
>
> Every word has another word ID from the same sentences as its head (except for the root) and every w-element specifies the dependency relation to the head (except for the root).
>
> Is this a practical solution?

It looks practical to me, for what that’s worth, given that your
parsing information is a dependency tree and not a constituency tree.
But my view is based only on recent introspection; I think a structure
like this is natural for dependency trees, but I don’t have experience
with this or any other style of representation for dependency trees in
practice; I hope someone else will.

One point that may be of interest: the Prague Czech-English Dependency
Corpus treats the sentence-ending punctuation as the root of the
sentence; I think this may make it slightly simpler to check the tree
for wellformedness.  In the case of your sentence, that would involve
changes to:

    <w deprel=“…” head=“a1-s6-w16" id="a1-s6-w12"
          lemma="sein" pos="VAFIN">sind</w>
    <w deprel=“..." head="a1-s6" id="a1-s6-w16"
          lemma=";" pos="$.">;</w>


> Moreover, we would like to integrate sentence alignment information, since we are working with a multilingual corpus.
>
> Our idea is to integrate this information directly into the s-elements (like we do for articles which are translated). This could look like in the following:
>
> <s id="a1-s6" lang="de" alignment_targets="SAC-Jahrbuch_1990_fr.xml:a1-s5”>

I think you may do better to treat the value of @alignment_targets as a sequence of
one or more URIs (so:  change “:” to “#”).

>
> The alignment target specifies the article and sentence ID in the yearbook in the other language which corresponds to the current s-element. Or would it be better to provide this information in a link group or just link element and then have the words in another, separate element?
>
> <s id="a1-s6" lang="de">
>    <link xtype="1-2" xtargets="SAC-Jahrbuch_1990_fr.xml:a1-s5;SAC-Jahrbuch_1990_fr.xml:a1-s6"/>
>    <words>
>        <w deprel="DET" head="a1-s6-w5" id="a1-s6-w1" lemma="d" pos="ART">Die</w>
>        <w deprel="ATTR" head="a1-s6-w5" id="a1-s6-w2" lemma="weit" pos="ADJA">weiten</w>
>        <w deprel="-PUNCT-" head="a1-s6-w2" id="a1-s6-w3" lemma="," pos="$,">,</w>
>        <w deprel="KON" head="a1-s6-w2" id="a1-s6-w4" lemma="öde" pos="ADJA">öden</w>
>        ....
>    </words>
> </s>

I’d expect that to amount to six of one and a half-dozen of the other
(that is, for the two approaches to be about equivalent in their
advantages and disadvantages), but I may be wrong.

The separate link element might make it easier to define a map with
(language-name -> URI) pairs, to make it easier for software to know
where to find, say, the French equivalent, without having to rely on
parsing the URI for the equivalent.  (In general, constructing URIs so
that humans can understand them is a good idea, but designing the
markup and software so that the software can treat URIs as opaque
strings will give you a more robust design.)  That might tip the
balance.

>
> Same for word alignment: Separate link groups or alignment information directly in the w-elements, like so:
>
> <w deprel="DET" head="a1-s6-w5" id="a1-s6-w1" lemma="d" pos="ART" alignment_targets="SAC-Jahrbuch_1990_fr.xml:a1-s6-w1 SAC-Jahrbuch_1990_it.xml:a1-s5-w1>Die</w>
>
> I'm looking forward to your suggestions.

I don’t see decisive arguments either way here, but others may do so.

Some will suggest that putting alignment information onto the w
elements makes the markup “too heavy” or too “interpretive”.  It’s
true that with the alignment information in the w elements, the XML
source starts to get a bit harder to read, so you may need specialized
software to provide legible views.  But that’s probably not a good
argument for moving to stand-off alignment, since standoff alignment
make additional specialized software even more necessary.  On the
other hand, with standoff alignment it’s probably easier to lock the
base data to ensure it doesn’t get changed by accident, and to record
multiple competing alignments (which may be helpful inside your
workflow even if your final product will only present one alignment).



********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
[hidden email]
http://www.blackmesatech.com
********************************************