standardizing linguistic encoding

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

standardizing linguistic encoding

Eduard Drenth

Dear all,

Here in Holland we are developing a standard to encode linguistic and lemma information for various word situations using TEI. We have been trying several solutions (tei:fs/tei:f, tei:interp, tei:span, ...) and finaly chose for TEI customization which gives us standard xsd validation, editor support and a simple focused solution. For linguistic terminology we use as much as possible http://universaldependencies.org/.


We are curious as to what you think, see below for details. We hope this solution may be of use for those who want to encode linguistic information using TEI. Also this may help standardizing linguistic encoding in TEI.


If this all is worthwhile I would like to donate/publish the solution somewhere.


snippet customization:


            <schemaSpec ident="tdb" docLang="en" prefix="tei_" xml:lang="en">

                ..

                ..

                <classSpec type="atts" ident="att.linguistics" module="analytics">

                    <attList>
                        <attDef ident="linguistics" ns="http://www.fryske-akademy.org/grammar/1.0">
                            <desc>
                                documentation....
                            </desc>
                            <datatype maxOccurs="unbounded">
                                <dataRef key="teiata.enumerated"/>
                            </datatype>
                            <valList type="closed">
                                <valItem ident="Features.Abbr">
                                    <desc>Boolean feature. Is this an abbreviation?</desc>
                                </valItem>
                                <valItem ident="Features.Poss">
                                    <desc>Boolean feature of pronouns, determiners or adjectives. It tells whether the word is possessive.</desc>
                                </valItem>
                                <valItem ident="PronType.Prs">
                                    <desc>personal pronoun or determiner</desc>
                                </valItem>

                                 ..

                                 ..


example word encoding:


<tei:w fa:linguistics="Pos.NOUN " lemmaRef="inprogress://lemmasystem/Hollands/frik/1" lemma="frik">Frik</tei:w>


example split word encoding:


<tei:w xml:id="staet-op-176" rendition="#split">staet</tei:w>

<tei:w fa:linguistics="Pos.ADV " lemmaRef="inprogress://lemmasystem/Hollands/al/3" lemma="al">al</tei:w><tei:w>wringende</tei:w>

<tei:w xml:id="staet-op-179" rendition="#split">op</tei:w>

<tei:join result="w" scope="root" lemma="opstean" target="#staet-op-176 #staet-op-179" lemmaRef="inprogress://lemmasystem/Hollands/opstean/1" fa:linguistics="th-si-pa Pos.VERB "/>


example word consist of more lemma's (we don't use this yet....):


<tei:choice>
  <tei:orig>
    <tei:w
fa:linguistics=".....">aint</tei:w>
  </tei:orig>
  <tei:reg>
    <tei:w lemma="be"
fa:linguistics="...">am</tei:w>
    <tei:w lemma="not"
fa:linguistics="...">not</tei:w>
  </tei:reg>
</tei:choice>


Bye,


Eduard Drenth, Software Architekt


[hidden email]


Doelestrjitte 8

8911 DX  Ljouwert

+31 58 234 30 47


gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43

Reply | Threaded
Open this post in threaded view
|

Re: standardizing linguistic encoding

Piotr Bański-2
Dear Eduard, [also addressing Philip and actually all... ]

It's probably my conditioning as a member of various standardization
bodies that makes red lights flash in my head upon reading that you are
"developing a standard"... :-) I believe that standards are better
"developed" (or, more precisely, codified) on the basis of existing best
practices or other existing standards. As it is, I can observe that your
proposed encoding mixes up the level of tokens with the level of word
forms (in ISO MAF terminology[1]), and while it can be suitable for your
purposes, it is far from optimal in standardization terms.

[at this point, the camera pans out]

This year promises to be quite exciting for the TEI Linguistics SIG[2],
given that:

(1) ISO LMF [3] is up for renewal and restructuring, and that several
teams (among others, from ENeL, PARTHENOS, CLARIN, and LingSIG) are
currently working on various modules for it,

(2) ISO Tiger [4] is nearing publication (as in: weeks rather than
months) and opening a way for ISO TEIger, a TEI serialization of the ISO
model for syntactic encoding,

(3) there is a rising push for streamlining inline linguistic markup,
coming from, among others, Martin Mueller's Early Print Project, BBAW's
existing practice (presented by Susanne Haaf at various TEI meetings),
the Ancient Greek Dependency Treebank (represented in this mailing list
by Giuseppe Celano, I believe), and now we learn of Philip Ströbel's
project and yours. And there are others. A tiny reflex of that is
contained at the LingSIG GitHub space [5], which is only meant as a
_seed_ for collaborative effort rather than any personal statement.

Andreas Witt and I are thinking of how to address and channel this
boiling mass of initiatives. One possibility could be to target the
upcoming TEI Members Meeting[6] and have a focused pre-conference
workshop designed to formulate a very precise and very concrete proposal
for grammatical encoding synchronized across inline, standoff and
dictionary markup, a proposal that we could submit to the TEI Technical
Council at the end of the day. "The day" seems distant, but if we want
to have a serious proposal at the end of it, work should start about now.

May I invite all interested parties to join the Linguistics SIG mailing
list (by going to [7]) and GitHub space (by sending me, off-list, your
github username), and to, well, have a go at it... :-)

Best regards,

   Piotr

[1]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=51934
[2]: http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists
[3]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=68516
[4]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=62491
[5]: https://github.com/LingSIG/wordAttributes/wiki
[6]: http://members.tei-c.org/Events/meetings






On 02/01/17 16:51, Eduard Drenth wrote:

> Dear all,
>
> Here in Holland we are developing a standard to encode linguistic and
> lemma information for various word situations using TEI. We have been
> trying several solutions (tei:fs/tei:f, tei:interp, tei:span, ...) and
> finaly chose for TEI customization which gives us standard xsd
> validation, editor support and a simple focused solution. For linguistic
> terminology we use as much as possible http://universaldependencies.org/.
>
>
> We are curious as to what you think, see below for details. We hope this
> solution may be of use for those who want to encode linguistic
> information using TEI. Also this may help standardizing linguistic
> encoding in TEI.
>
>
> If this all is worthwhile I would like to donate/publish the solution
> somewhere.
>
>
> snippet customization:
>
>
>             <schemaSpec ident="tdb" docLang="en" prefix="tei_"
> xml:lang="en">
>
>                 ..
>
>                 ..
>
>                 <classSpec type="atts" ident="att.linguistics"
> module="analytics">
>
>                     <attList>
>                         <attDef ident="linguistics"
> ns="http://www.fryske-akademy.org/grammar/1.0">
>                             <desc>
>                                 documentation....
>                             </desc>
>                             <datatype maxOccurs="unbounded">
>                                 <dataRef key="teiata.enumerated"/>
>                             </datatype>
>                             <valList type="closed">
>                                 <valItem ident="Features.Abbr">
>                                     <desc>Boolean feature. Is this an
> abbreviation?</desc>
>                                 </valItem>
>                                 <valItem ident="Features.Poss">
>                                     <desc>Boolean feature of pronouns,
> determiners or adjectives. It tells whether the word is possessive.</desc>
>                                 </valItem>
>                                 <valItem ident="PronType.Prs">
>                                     <desc>personal pronoun or
> determiner</desc>
>                                 </valItem>
>
>                                  ..
>
>                                  ..
>
>
> example word encoding:
>
>
> <tei:w fa:linguistics="Pos.NOUN "
> lemmaRef="inprogress://lemmasystem/Hollands/frik/1"
> lemma="frik">Frik</tei:w>
>
>
> example split word encoding:
>
>
> <tei:w xml:id="staet-op-176" rendition="#split">staet</tei:w>
>
> <tei:w fa:linguistics="Pos.ADV "
> lemmaRef="inprogress://lemmasystem/Hollands/al/3"
> lemma="al">al</tei:w><tei:w>wringende</tei:w>
>
> <tei:w xml:id="staet-op-179" rendition="#split">op</tei:w>
>
> <tei:join result="w" scope="root" lemma="opstean" target="#staet-op-176
> #staet-op-179" lemmaRef="inprogress://lemmasystem/Hollands/opstean/1"
> fa:linguistics="th-si-pa Pos.VERB "/>
>
>
> example word consist of more lemma's (we don't use this yet....):
>
>
> <tei:choice>
>   <tei:orig>
>     <tei:w fa:linguistics=".....">aint</tei:w>
>   </tei:orig>
>   <tei:reg>
>     <tei:w lemma="be" fa:linguistics="...">am</tei:w>
>     <tei:w lemma="not" fa:linguistics="...">not</tei:w>
>   </tei:reg>
> </tei:choice>
>
>
> Bye,
>
>
> Eduard Drenth, Software Architekt
>
>
> [hidden email]
>
>
> Doelestrjitte 8
>
> 8911 DX  Ljouwert
>
> +31 58 234 30 47
>
>
> gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43
>

--
Piotr Bański, Ph.D.
Senior Researcher,
Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany
Reply | Threaded
Open this post in threaded view
|

Re: standardizing linguistic encoding

Piotr Bański-2
(with thanks to Lou...)

On 02/02/17 12:24, I wrote:
[...]

> May I invite all interested parties to join the Linguistics SIG mailing
> list (by going to [7])

[7]: https://listserv.brown.edu/archives/cgi-bin/wa?A0=TEI-LINGUISTICS
Reply | Threaded
Open this post in threaded view
|

Re: standardizing linguistic encoding

Eduard Drenth
In reply to this post by Piotr Bański-2
Thanks for your response! Standard in my case means practical, usable way for encoding linguistic information in corpora using TEI.

Indeed the theme is covered by https://github.com/LingSIG/wordAttributes/wiki. Good to know of this http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists as well.

We choose to continue along the choosen path, it doesn't deviate too much from uncustomized TEI, offers good support for editing and querying, satisfies our linguists, adheres to http://universaldependencies.org and is easy to convert to the 'real standard' when it is released.

Perhaps our approach can be useful input for https://github.com/LingSIG/wordAttributes, it is the result of quite extensive testing and discussing.

Eduard Drenth, Software Architekt

[hidden email]

Doelestrjitte 8
8911 DX  Ljouwert
+31 58 234 30 47

gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43

________________________________________
From: Piotr Bański <[hidden email]>
Sent: Thursday, February 2, 2017 12:24 PM
To: Eduard Drenth; [hidden email]; Phillip Ströbel
Subject: Re: standardizing linguistic encoding

Dear Eduard, [also addressing Philip and actually all... ]

It's probably my conditioning as a member of various standardization
bodies that makes red lights flash in my head upon reading that you are
"developing a standard"... :-) I believe that standards are better
"developed" (or, more precisely, codified) on the basis of existing best
practices or other existing standards. As it is, I can observe that your
proposed encoding mixes up the level of tokens with the level of word
forms (in ISO MAF terminology[1]), and while it can be suitable for your
purposes, it is far from optimal in standardization terms.

[at this point, the camera pans out]

This year promises to be quite exciting for the TEI Linguistics SIG[2],
given that:

(1) ISO LMF [3] is up for renewal and restructuring, and that several
teams (among others, from ENeL, PARTHENOS, CLARIN, and LingSIG) are
currently working on various modules for it,

(2) ISO Tiger [4] is nearing publication (as in: weeks rather than
months) and opening a way for ISO TEIger, a TEI serialization of the ISO
model for syntactic encoding,

(3) there is a rising push for streamlining inline linguistic markup,
coming from, among others, Martin Mueller's Early Print Project, BBAW's
existing practice (presented by Susanne Haaf at various TEI meetings),
the Ancient Greek Dependency Treebank (represented in this mailing list
by Giuseppe Celano, I believe), and now we learn of Philip Ströbel's
project and yours. And there are others. A tiny reflex of that is
contained at the LingSIG GitHub space [5], which is only meant as a
_seed_ for collaborative effort rather than any personal statement.

Andreas Witt and I are thinking of how to address and channel this
boiling mass of initiatives. One possibility could be to target the
upcoming TEI Members Meeting[6] and have a focused pre-conference
workshop designed to formulate a very precise and very concrete proposal
for grammatical encoding synchronized across inline, standoff and
dictionary markup, a proposal that we could submit to the TEI Technical
Council at the end of the day. "The day" seems distant, but if we want
to have a serious proposal at the end of it, work should start about now.

May I invite all interested parties to join the Linguistics SIG mailing
list (by going to [7]) and GitHub space (by sending me, off-list, your
github username), and to, well, have a go at it... :-)

Best regards,

   Piotr

[1]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=51934
[2]: http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists
[3]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=68516
[4]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=62491
[5]: https://github.com/LingSIG/wordAttributes/wiki
[6]: http://members.tei-c.org/Events/meetings






On 02/01/17 16:51, Eduard Drenth wrote:

> Dear all,
>
> Here in Holland we are developing a standard to encode linguistic and
> lemma information for various word situations using TEI. We have been
> trying several solutions (tei:fs/tei:f, tei:interp, tei:span, ...) and
> finaly chose for TEI customization which gives us standard xsd
> validation, editor support and a simple focused solution. For linguistic
> terminology we use as much as possible http://universaldependencies.org/.
>
>
> We are curious as to what you think, see below for details. We hope this
> solution may be of use for those who want to encode linguistic
> information using TEI. Also this may help standardizing linguistic
> encoding in TEI.
>
>
> If this all is worthwhile I would like to donate/publish the solution
> somewhere.
>
>
> snippet customization:
>
>
>             <schemaSpec ident="tdb" docLang="en" prefix="tei_"
> xml:lang="en">
>
>                 ..
>
>                 ..
>
>                 <classSpec type="atts" ident="att.linguistics"
> module="analytics">
>
>                     <attList>
>                         <attDef ident="linguistics"
> ns="http://www.fryske-akademy.org/grammar/1.0">
>                             <desc>
>                                 documentation....
>                             </desc>
>                             <datatype maxOccurs="unbounded">
>                                 <dataRef key="teiata.enumerated"/>
>                             </datatype>
>                             <valList type="closed">
>                                 <valItem ident="Features.Abbr">
>                                     <desc>Boolean feature. Is this an
> abbreviation?</desc>
>                                 </valItem>
>                                 <valItem ident="Features.Poss">
>                                     <desc>Boolean feature of pronouns,
> determiners or adjectives. It tells whether the word is possessive.</desc>
>                                 </valItem>
>                                 <valItem ident="PronType.Prs">
>                                     <desc>personal pronoun or
> determiner</desc>
>                                 </valItem>
>
>                                  ..
>
>                                  ..
>
>
> example word encoding:
>
>
> <tei:w fa:linguistics="Pos.NOUN "
> lemmaRef="inprogress://lemmasystem/Hollands/frik/1"
> lemma="frik">Frik</tei:w>
>
>
> example split word encoding:
>
>
> <tei:w xml:id="staet-op-176" rendition="#split">staet</tei:w>
>
> <tei:w fa:linguistics="Pos.ADV "
> lemmaRef="inprogress://lemmasystem/Hollands/al/3"
> lemma="al">al</tei:w><tei:w>wringende</tei:w>
>
> <tei:w xml:id="staet-op-179" rendition="#split">op</tei:w>
>
> <tei:join result="w" scope="root" lemma="opstean" target="#staet-op-176
> #staet-op-179" lemmaRef="inprogress://lemmasystem/Hollands/opstean/1"
> fa:linguistics="th-si-pa Pos.VERB "/>
>
>
> example word consist of more lemma's (we don't use this yet....):
>
>
> <tei:choice>
>   <tei:orig>
>     <tei:w fa:linguistics=".....">aint</tei:w>
>   </tei:orig>
>   <tei:reg>
>     <tei:w lemma="be" fa:linguistics="...">am</tei:w>
>     <tei:w lemma="not" fa:linguistics="...">not</tei:w>
>   </tei:reg>
> </tei:choice>
>
>
> Bye,
>
>
> Eduard Drenth, Software Architekt
>
>
> [hidden email]
>
>
> Doelestrjitte 8
>
> 8911 DX  Ljouwert
>
> +31 58 234 30 47
>
>
> gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43
>

--
Piotr Bański, Ph.D.
Senior Researcher,
Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany
Reply | Threaded
Open this post in threaded view
|

Re: standardizing linguistic encoding

Emmanuel NGUE UM
Hi,

I am an African-Based linguist, and I have been following much of the discussions going on over TEI mailing list.

I am not a TEI practitioner per se, but I am aware of the many application scenarios of this technology, including text corpora building.

A couple of months ago, I sent an e-mail around via TEI mailing list asking whether anyone knew of any TEI based/inspired framework for the encoding of prosodic phenomena such as tones, especially in African tone languages. I got one or two responses from members. Unfortunately these responses did not address my specific concern.

I wish to join on-going discussions about 'standardizing linguistic encoding', to bring to the fore of TEI standards development, the issue of "tone encoding".

For the sake of clarification and given that not every one is necessarily an expert in tone languages, let me explain by examples what tone is in African tone languages.

Given the followings tokens from Basaa, a bantu language spoken in Cameroon:

(1) hól : to sharpen

(2) hòl : to pay the dawry

(3) hôl (as in á hôl): let him sharpen
 
(4)hŏl : pay the dawry! (imperative)



2017-02-02 14:03 GMT+01:00 Eduard Drenth <[hidden email]>:
Thanks for your response! Standard in my case means practical, usable way for encoding linguistic information in corpora using TEI.

Indeed the theme is covered by https://github.com/LingSIG/wordAttributes/wiki. Good to know of this http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists as well.

We choose to continue along the choosen path, it doesn't deviate too much from uncustomized TEI, offers good support for editing and querying, satisfies our linguists, adheres to http://universaldependencies.org and is easy to convert to the 'real standard' when it is released.

Perhaps our approach can be useful input for https://github.com/LingSIG/wordAttributes, it is the result of quite extensive testing and discussing.

Eduard Drenth, Software Architekt

[hidden email]

Doelestrjitte 8
8911 DX  Ljouwert
<a href="tel:%2B31%2058%20234%2030%2047" value="+31582343047">+31 58 234 30 47

gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43

________________________________________
From: Piotr Bański <[hidden email]>
Sent: Thursday, February 2, 2017 12:24 PM
To: Eduard Drenth; [hidden email]; Phillip Ströbel
Subject: Re: standardizing linguistic encoding

Dear Eduard, [also addressing Philip and actually all... ]

It's probably my conditioning as a member of various standardization
bodies that makes red lights flash in my head upon reading that you are
"developing a standard"... :-) I believe that standards are better
"developed" (or, more precisely, codified) on the basis of existing best
practices or other existing standards. As it is, I can observe that your
proposed encoding mixes up the level of tokens with the level of word
forms (in ISO MAF terminology[1]), and while it can be suitable for your
purposes, it is far from optimal in standardization terms.

[at this point, the camera pans out]

This year promises to be quite exciting for the TEI Linguistics SIG[2],
given that:

(1) ISO LMF [3] is up for renewal and restructuring, and that several
teams (among others, from ENeL, PARTHENOS, CLARIN, and LingSIG) are
currently working on various modules for it,

(2) ISO Tiger [4] is nearing publication (as in: weeks rather than
months) and opening a way for ISO TEIger, a TEI serialization of the ISO
model for syntactic encoding,

(3) there is a rising push for streamlining inline linguistic markup,
coming from, among others, Martin Mueller's Early Print Project, BBAW's
existing practice (presented by Susanne Haaf at various TEI meetings),
the Ancient Greek Dependency Treebank (represented in this mailing list
by Giuseppe Celano, I believe), and now we learn of Philip Ströbel's
project and yours. And there are others. A tiny reflex of that is
contained at the LingSIG GitHub space [5], which is only meant as a
_seed_ for collaborative effort rather than any personal statement.

Andreas Witt and I are thinking of how to address and channel this
boiling mass of initiatives. One possibility could be to target the
upcoming TEI Members Meeting[6] and have a focused pre-conference
workshop designed to formulate a very precise and very concrete proposal
for grammatical encoding synchronized across inline, standoff and
dictionary markup, a proposal that we could submit to the TEI Technical
Council at the end of the day. "The day" seems distant, but if we want
to have a serious proposal at the end of it, work should start about now.

May I invite all interested parties to join the Linguistics SIG mailing
list (by going to [7]) and GitHub space (by sending me, off-list, your
github username), and to, well, have a go at it... :-)

Best regards,

   Piotr

[1]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=51934
[2]: http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists
[3]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=68516
[4]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=62491
[5]: https://github.com/LingSIG/wordAttributes/wiki
[6]: http://members.tei-c.org/Events/meetings






On 02/01/17 16:51, Eduard Drenth wrote:
> Dear all,
>
> Here in Holland we are developing a standard to encode linguistic and
> lemma information for various word situations using TEI. We have been
> trying several solutions (tei:fs/tei:f, tei:interp, tei:span, ...) and
> finaly chose for TEI customization which gives us standard xsd
> validation, editor support and a simple focused solution. For linguistic
> terminology we use as much as possible http://universaldependencies.org/.
>
>
> We are curious as to what you think, see below for details. We hope this
> solution may be of use for those who want to encode linguistic
> information using TEI. Also this may help standardizing linguistic
> encoding in TEI.
>
>
> If this all is worthwhile I would like to donate/publish the solution
> somewhere.
>
>
> snippet customization:
>
>
>             <schemaSpec ident="tdb" docLang="en" prefix="tei_"
> xml:lang="en">
>
>                 ..
>
>                 ..
>
>                 <classSpec type="atts" ident="att.linguistics"
> module="analytics">
>
>                     <attList>
>                         <attDef ident="linguistics"
> ns="http://www.fryske-akademy.org/grammar/1.0">
>                             <desc>
>                                 documentation....
>                             </desc>
>                             <datatype maxOccurs="unbounded">
>                                 <dataRef key="teiata.enumerated"/>
>                             </datatype>
>                             <valList type="closed">
>                                 <valItem ident="Features.Abbr">
>                                     <desc>Boolean feature. Is this an
> abbreviation?</desc>
>                                 </valItem>
>                                 <valItem ident="Features.Poss">
>                                     <desc>Boolean feature of pronouns,
> determiners or adjectives. It tells whether the word is possessive.</desc>
>                                 </valItem>
>                                 <valItem ident="PronType.Prs">
>                                     <desc>personal pronoun or
> determiner</desc>
>                                 </valItem>
>
>                                  ..
>
>                                  ..
>
>
> example word encoding:
>
>
> <tei:w fa:linguistics="Pos.NOUN "
> lemmaRef="inprogress://lemmasystem/Hollands/frik/1"
> lemma="frik">Frik</tei:w>
>
>
> example split word encoding:
>
>
> <tei:w xml:id="staet-op-176" rendition="#split">staet</tei:w>
>
> <tei:w fa:linguistics="Pos.ADV "
> lemmaRef="inprogress://lemmasystem/Hollands/al/3"
> lemma="al">al</tei:w><tei:w>wringende</tei:w>
>
> <tei:w xml:id="staet-op-179" rendition="#split">op</tei:w>
>
> <tei:join result="w" scope="root" lemma="opstean" target="#staet-op-176
> #staet-op-179" lemmaRef="inprogress://lemmasystem/Hollands/opstean/1"
> fa:linguistics="th-si-pa Pos.VERB "/>
>
>
> example word consist of more lemma's (we don't use this yet....):
>
>
> <tei:choice>
>   <tei:orig>
>     <tei:w fa:linguistics=".....">aint</tei:w>
>   </tei:orig>
>   <tei:reg>
>     <tei:w lemma="be" fa:linguistics="...">am</tei:w>
>     <tei:w lemma="not" fa:linguistics="...">not</tei:w>
>   </tei:reg>
> </tei:choice>
>
>
> Bye,
>
>
> Eduard Drenth, Software Architekt
>
>
> [hidden email]
>
>
> Doelestrjitte 8
>
> 8911 DX  Ljouwert
>
> <a href="tel:%2B31%2058%20234%2030%2047" value="+31582343047">+31 58 234 30 47
>
>
> gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43
>

--
Piotr Bański, Ph.D.
Senior Researcher,
Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany

Reply | Threaded
Open this post in threaded view
|

Re: standardizing linguistic encoding

Emmanuel NGUE UM
In reply to this post by Eduard Drenth
Hi,

I am an African-Based linguist, and I have been following much of the discussions going on over TEI mailing list.

I am not a TEI practitioner per se, but I am aware of the many application scenarios of this technology, including text corpora building.

A couple of months ago, I sent an e-mail around via TEI mailing list asking whether anyone knew of any TEI based/inspired framework for the encoding of prosodic phenomena such as tones, especially in African tone languages. I got one or two responses from members. Unfortunately these responses did not address my specific concern.

I wish to join on-going discussions about 'standardizing linguistic encoding', to bring to the fore of TEI standards development, the issue of "tone encoding".

For the sake of clarification and given that not every one is necessarily an expert in tone languages, let me explain by examples what tone is in African tone languages.

Given the followings tokens from Basaa, a bantu language spoken in Cameroon:

(1) hól : to sharpen

(2) hòl : to pay the dawry

(3) hôl (as in á hôl): let him sharpen
 
(4) hŏl : pay the dawry! (imperative)

In (1) through (4), the difference in meaning of these words is attributed to the difference in relative pitch level of the syllable: "high" in (1), "low" in (2), contour or two-level "low-high" in (3), contour or two-level "low-high" in (4).

While the semantics associated with tone levels in (1) and (2) is lexically encoded, the ones in (3) and (4) are complemented with grammatical information, namely hortative in (3) and imperative in (4), thus resulting in complex (contour) tone shapes in writing.

Tone representation in the above examples is graphical, and is meant to simply anchor pitch melody; this form of representation does not inform much about the semantics associated with a specific pitch level in and accross words. This is so mostly because pitch 'labels' (high, low, low-high, high-low) do not encode persistent meaning, but may instead trigger each and array of grammatical information such as tense, mood, aspect, negation, ect., depending on the context.

I personally believe that for better processeability and representation of textual information in tone langues, there is need for developping unambiguous encoding framework devoid of graphical representation of tones, and I believe TEI to be one possible response to this.

Because TEI is an open standard which is meant to be tailored to the specific needs of users, I think it is our responsiblity as Africanists and Bantuists, to raise TEI community's awarness about accounting for the specificities of the languages we are working on, when it comes to standardizing linguistic encoding.

Best

Emmanuel Ngué Um
Language Archivist for ALORA

2017-02-02 14:03 GMT+01:00 Eduard Drenth <[hidden email]>:
Thanks for your response! Standard in my case means practical, usable way for encoding linguistic information in corpora using TEI.

Indeed the theme is covered by https://github.com/LingSIG/wordAttributes/wiki. Good to know of this http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists as well.

We choose to continue along the choosen path, it doesn't deviate too much from uncustomized TEI, offers good support for editing and querying, satisfies our linguists, adheres to http://universaldependencies.org and is easy to convert to the 'real standard' when it is released.

Perhaps our approach can be useful input for https://github.com/LingSIG/wordAttributes, it is the result of quite extensive testing and discussing.

Eduard Drenth, Software Architekt

[hidden email]

Doelestrjitte 8
8911 DX  Ljouwert
<a href="tel:%2B31%2058%20234%2030%2047" value="+31582343047">+31 58 234 30 47

gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43

________________________________________
From: Piotr Bański <[hidden email]>
Sent: Thursday, February 2, 2017 12:24 PM
To: Eduard Drenth; [hidden email]; Phillip Ströbel
Subject: Re: standardizing linguistic encoding

Dear Eduard, [also addressing Philip and actually all... ]

It's probably my conditioning as a member of various standardization
bodies that makes red lights flash in my head upon reading that you are
"developing a standard"... :-) I believe that standards are better
"developed" (or, more precisely, codified) on the basis of existing best
practices or other existing standards. As it is, I can observe that your
proposed encoding mixes up the level of tokens with the level of word
forms (in ISO MAF terminology[1]), and while it can be suitable for your
purposes, it is far from optimal in standardization terms.

[at this point, the camera pans out]

This year promises to be quite exciting for the TEI Linguistics SIG[2],
given that:

(1) ISO LMF [3] is up for renewal and restructuring, and that several
teams (among others, from ENeL, PARTHENOS, CLARIN, and LingSIG) are
currently working on various modules for it,

(2) ISO Tiger [4] is nearing publication (as in: weeks rather than
months) and opening a way for ISO TEIger, a TEI serialization of the ISO
model for syntactic encoding,

(3) there is a rising push for streamlining inline linguistic markup,
coming from, among others, Martin Mueller's Early Print Project, BBAW's
existing practice (presented by Susanne Haaf at various TEI meetings),
the Ancient Greek Dependency Treebank (represented in this mailing list
by Giuseppe Celano, I believe), and now we learn of Philip Ströbel's
project and yours. And there are others. A tiny reflex of that is
contained at the LingSIG GitHub space [5], which is only meant as a
_seed_ for collaborative effort rather than any personal statement.

Andreas Witt and I are thinking of how to address and channel this
boiling mass of initiatives. One possibility could be to target the
upcoming TEI Members Meeting[6] and have a focused pre-conference
workshop designed to formulate a very precise and very concrete proposal
for grammatical encoding synchronized across inline, standoff and
dictionary markup, a proposal that we could submit to the TEI Technical
Council at the end of the day. "The day" seems distant, but if we want
to have a serious proposal at the end of it, work should start about now.

May I invite all interested parties to join the Linguistics SIG mailing
list (by going to [7]) and GitHub space (by sending me, off-list, your
github username), and to, well, have a go at it... :-)

Best regards,

   Piotr

[1]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=51934
[2]: http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists
[3]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=68516
[4]:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=62491
[5]: https://github.com/LingSIG/wordAttributes/wiki
[6]: http://members.tei-c.org/Events/meetings






On 02/01/17 16:51, Eduard Drenth wrote:
> Dear all,
>
> Here in Holland we are developing a standard to encode linguistic and
> lemma information for various word situations using TEI. We have been
> trying several solutions (tei:fs/tei:f, tei:interp, tei:span, ...) and
> finaly chose for TEI customization which gives us standard xsd
> validation, editor support and a simple focused solution. For linguistic
> terminology we use as much as possible http://universaldependencies.org/.
>
>
> We are curious as to what you think, see below for details. We hope this
> solution may be of use for those who want to encode linguistic
> information using TEI. Also this may help standardizing linguistic
> encoding in TEI.
>
>
> If this all is worthwhile I would like to donate/publish the solution
> somewhere.
>
>
> snippet customization:
>
>
>             <schemaSpec ident="tdb" docLang="en" prefix="tei_"
> xml:lang="en">
>
>                 ..
>
>                 ..
>
>                 <classSpec type="atts" ident="att.linguistics"
> module="analytics">
>
>                     <attList>
>                         <attDef ident="linguistics"
> ns="http://www.fryske-akademy.org/grammar/1.0">
>                             <desc>
>                                 documentation....
>                             </desc>
>                             <datatype maxOccurs="unbounded">
>                                 <dataRef key="teiata.enumerated"/>
>                             </datatype>
>                             <valList type="closed">
>                                 <valItem ident="Features.Abbr">
>                                     <desc>Boolean feature. Is this an
> abbreviation?</desc>
>                                 </valItem>
>                                 <valItem ident="Features.Poss">
>                                     <desc>Boolean feature of pronouns,
> determiners or adjectives. It tells whether the word is possessive.</desc>
>                                 </valItem>
>                                 <valItem ident="PronType.Prs">
>                                     <desc>personal pronoun or
> determiner</desc>
>                                 </valItem>
>
>                                  ..
>
>                                  ..
>
>
> example word encoding:
>
>
> <tei:w fa:linguistics="Pos.NOUN "
> lemmaRef="inprogress://lemmasystem/Hollands/frik/1"
> lemma="frik">Frik</tei:w>
>
>
> example split word encoding:
>
>
> <tei:w xml:id="staet-op-176" rendition="#split">staet</tei:w>
>
> <tei:w fa:linguistics="Pos.ADV "
> lemmaRef="inprogress://lemmasystem/Hollands/al/3"
> lemma="al">al</tei:w><tei:w>wringende</tei:w>
>
> <tei:w xml:id="staet-op-179" rendition="#split">op</tei:w>
>
> <tei:join result="w" scope="root" lemma="opstean" target="#staet-op-176
> #staet-op-179" lemmaRef="inprogress://lemmasystem/Hollands/opstean/1"
> fa:linguistics="th-si-pa Pos.VERB "/>
>
>
> example word consist of more lemma's (we don't use this yet....):
>
>
> <tei:choice>
>   <tei:orig>
>     <tei:w fa:linguistics=".....">aint</tei:w>
>   </tei:orig>
>   <tei:reg>
>     <tei:w lemma="be" fa:linguistics="...">am</tei:w>
>     <tei:w lemma="not" fa:linguistics="...">not</tei:w>
>   </tei:reg>
> </tei:choice>
>
>
> Bye,
>
>
> Eduard Drenth, Software Architekt
>
>
> [hidden email]
>
>
> Doelestrjitte 8
>
> 8911 DX  Ljouwert
>
> <a href="tel:%2B31%2058%20234%2030%2047" value="+31582343047">+31 58 234 30 47
>
>
> gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43
>

--
Piotr Bański, Ph.D.
Senior Researcher,
Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany

Reply | Threaded
Open this post in threaded view
|

Re: standardizing linguistic encoding

Piotr Bański-2
Hi Emmanuel,

It's great to hear from you. You may be pleased to hear about a set of
tools that can be used to encode the information you need very
precisely, and to attach that information to objects of any granularity
("standard" tokens, morphs, phrases) and any sort (orthographic,
prosodic, morphological). These tools have been defined jointly by the
TEI and ISO, and the TEI description is free and well-tested, and you
can read about it in the chapter on feature structures:

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html

(I'd suggest skipping 18.11 when reading this chapter for the first time)

These tools make it possible for you to describe practically any feature
matrices needed in linguistics. And the TEI has mechanisms for attaching
them to linguistic/textual objects.

If you'd like to pursue this further, you may be interested in joining
the TEI Linguistics SIG mailing list at

https://listserv.brown.edu/archives/cgi-bin/wa?A0=TEI-LINGUISTICS

where we have just talked about cases slightly more complex than the
kind of attribute-based markup discussed here.

Best regards,

   Piotr



On 02/03/17 17:38, Emmanuel NGUE UM wrote:

> Hi,
>
> I am an African-Based linguist, and I have been following much of the
> discussions going on over TEI mailing list.
>
> I am not a TEI practitioner per se, but I am aware of the many
> application scenarios of this technology, including text corpora building.
>
> A couple of months ago, I sent an e-mail around via TEI mailing list
> asking whether anyone knew of any TEI based/inspired framework for the
> encoding of prosodic phenomena such as tones, especially in African tone
> languages. I got one or two responses from members. Unfortunately these
> responses did not address my specific concern.
>
> I wish to join on-going discussions about 'standardizing linguistic
> encoding', to bring to the fore of TEI standards development, the issue
> of "tone encoding".
>
> For the sake of clarification and given that not every one is
> necessarily an expert in tone languages, let me explain by examples what
> tone is in African tone languages.
>
> Given the followings tokens from Basaa, a bantu language spoken in Cameroon:
>
> (1) hól : to sharpen
>
> (2) hòl : to pay the dawry
>
> (3) hôl (as in /á hôl/): let him sharpen
>
> (4) hŏl : pay the dawry! (imperative)
>
> In (1) through (4), the difference in meaning of these words is
> attributed to the difference in relative pitch level of the syllable:
> "high" in (1), "low" in (2), contour or two-level "low-high" in (3),
> contour or two-level "low-high" in (4).
>
> While the semantics associated with tone levels in (1) and (2) is
> lexically encoded, the ones in (3) and (4) are complemented with
> grammatical information, namely hortative in (3) and imperative in (4),
> thus resulting in complex (contour) tone shapes in writing.
>
> Tone representation in the above examples is graphical, and is meant to
> simply anchor pitch melody; this form of representation does not inform
> much about the semantics associated with a specific pitch level in and
> accross words. This is so mostly because pitch 'labels' (high, low,
> low-high, high-low) do not encode persistent meaning, but may instead
> trigger each and array of grammatical information such as tense, mood,
> aspect, negation, ect., depending on the context.
>
> I personally believe that for better processeability and representation
> of textual information in tone langues, there is need for developping
> unambiguous encoding framework devoid of graphical representation of
> tones, and I believe TEI to be one possible response to this.
>
> Because TEI is an open standard which is meant to be tailored to the
> specific needs of users, I think it is our responsiblity as Africanists
> and Bantuists, to raise TEI community's awarness about accounting for
> the specificities of the languages we are working on, when it comes to
> standardizing linguistic encoding.
>
> Best
>
> Emmanuel Ngué Um
> Language Archivist for ALORA
>
> 2017-02-02 14:03 GMT+01:00 Eduard Drenth <[hidden email]
> <mailto:[hidden email]>>:
>
>     Thanks for your response! Standard in my case means practical,
>     usable way for encoding linguistic information in corpora using TEI.
>
>     Indeed the theme is covered by
>     https://github.com/LingSIG/wordAttributes/wiki
>     <https://github.com/LingSIG/wordAttributes/wiki>. Good to know of
>     this http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists
>     <http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists> as well.
>
>     We choose to continue along the choosen path, it doesn't deviate too
>     much from uncustomized TEI, offers good support for editing and
>     querying, satisfies our linguists, adheres to
>     http://universaldependencies.org <http://universaldependencies.org>
>     and is easy to convert to the 'real standard' when it is released.
>
>     Perhaps our approach can be useful input for
>     https://github.com/LingSIG/wordAttributes
>     <https://github.com/LingSIG/wordAttributes>, it is the result of
>     quite extensive testing and discussing.
>
>     Eduard Drenth, Software Architekt
>
>     [hidden email] <mailto:[hidden email]>
>
>     Doelestrjitte 8
>     8911 DX  Ljouwert
>     +31 58 234 30 47 <tel:%2B31%2058%20234%2030%2047>
>
>     gpg:
>     https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43
>     <https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43>
>
>     ________________________________________
>     From: Piotr Bański <[hidden email]
>     <mailto:[hidden email]>>
>     Sent: Thursday, February 2, 2017 12:24 PM
>     To: Eduard Drenth; [hidden email]
>     <mailto:[hidden email]>; Phillip Ströbel
>     Subject: Re: standardizing linguistic encoding
>
>     Dear Eduard, [also addressing Philip and actually all... ]
>
>     It's probably my conditioning as a member of various standardization
>     bodies that makes red lights flash in my head upon reading that you are
>     "developing a standard"... :-) I believe that standards are better
>     "developed" (or, more precisely, codified) on the basis of existing best
>     practices or other existing standards. As it is, I can observe that your
>     proposed encoding mixes up the level of tokens with the level of word
>     forms (in ISO MAF terminology[1]), and while it can be suitable for your
>     purposes, it is far from optimal in standardization terms.
>
>     [at this point, the camera pans out]
>
>     This year promises to be quite exciting for the TEI Linguistics SIG[2],
>     given that:
>
>     (1) ISO LMF [3] is up for renewal and restructuring, and that several
>     teams (among others, from ENeL, PARTHENOS, CLARIN, and LingSIG) are
>     currently working on various modules for it,
>
>     (2) ISO Tiger [4] is nearing publication (as in: weeks rather than
>     months) and opening a way for ISO TEIger, a TEI serialization of the ISO
>     model for syntactic encoding,
>
>     (3) there is a rising push for streamlining inline linguistic markup,
>     coming from, among others, Martin Mueller's Early Print Project, BBAW's
>     existing practice (presented by Susanne Haaf at various TEI meetings),
>     the Ancient Greek Dependency Treebank (represented in this mailing list
>     by Giuseppe Celano, I believe), and now we learn of Philip Ströbel's
>     project and yours. And there are others. A tiny reflex of that is
>     contained at the LingSIG GitHub space [5], which is only meant as a
>     _seed_ for collaborative effort rather than any personal statement.
>
>     Andreas Witt and I are thinking of how to address and channel this
>     boiling mass of initiatives. One possibility could be to target the
>     upcoming TEI Members Meeting[6] and have a focused pre-conference
>     workshop designed to formulate a very precise and very concrete proposal
>     for grammatical encoding synchronized across inline, standoff and
>     dictionary markup, a proposal that we could submit to the TEI Technical
>     Council at the end of the day. "The day" seems distant, but if we want
>     to have a serious proposal at the end of it, work should start about
>     now.
>
>     May I invite all interested parties to join the Linguistics SIG mailing
>     list (by going to [7]) and GitHub space (by sending me, off-list, your
>     github username), and to, well, have a go at it... :-)
>
>     Best regards,
>
>        Piotr
>
>     [1]:
>     http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=51934
>     <http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=51934>
>     [2]: http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists
>     <http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists>
>     [3]:
>     http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=68516
>     <http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=68516>
>     [4]:
>     http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=62491
>     <http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=62491>
>     [5]: https://github.com/LingSIG/wordAttributes/wiki
>     <https://github.com/LingSIG/wordAttributes/wiki>
>     [6]: http://members.tei-c.org/Events/meetings
>     <http://members.tei-c.org/Events/meetings>
>
>
>
>
>
>
>     On 02/01/17 16:51, Eduard Drenth wrote:
>     > Dear all,
>     >
>     > Here in Holland we are developing a standard to encode linguistic and
>     > lemma information for various word situations using TEI. We have been
>     > trying several solutions (tei:fs/tei:f, tei:interp, tei:span, ...) and
>     > finaly chose for TEI customization which gives us standard xsd
>     > validation, editor support and a simple focused solution. For
>     linguistic
>     > terminology we use as much as possible
>     http://universaldependencies.org/ <http://universaldependencies.org/>.
>     >
>     >
>     > We are curious as to what you think, see below for details. We
>     hope this
>     > solution may be of use for those who want to encode linguistic
>     > information using TEI. Also this may help standardizing linguistic
>     > encoding in TEI.
>     >
>     >
>     > If this all is worthwhile I would like to donate/publish the solution
>     > somewhere.
>     >
>     >
>     > snippet customization:
>     >
>     >
>     >             <schemaSpec ident="tdb" docLang="en" prefix="tei_"
>     > xml:lang="en">
>     >
>     >                 ..
>     >
>     >                 ..
>     >
>     >                 <classSpec type="atts" ident="att.linguistics"
>     > module="analytics">
>     >
>     >                     <attList>
>     >                         <attDef ident="linguistics"
>     > ns="http://www.fryske-akademy.org/grammar/1.0
>     <http://www.fryske-akademy.org/grammar/1.0>">
>     >                             <desc>
>     >                                 documentation....
>     >                             </desc>
>     >                             <datatype maxOccurs="unbounded">
>     >                                 <dataRef key="teiata.enumerated"/>
>     >                             </datatype>
>     >                             <valList type="closed">
>     >                                 <valItem ident="Features.Abbr">
>     >                                     <desc>Boolean feature. Is this an
>     > abbreviation?</desc>
>     >                                 </valItem>
>     >                                 <valItem ident="Features.Poss">
>     >                                     <desc>Boolean feature of pronouns,
>     > determiners or adjectives. It tells whether the word is
>     possessive.</desc>
>     >                                 </valItem>
>     >                                 <valItem ident="PronType.Prs">
>     >                                     <desc>personal pronoun or
>     > determiner</desc>
>     >                                 </valItem>
>     >
>     >                                  ..
>     >
>     >                                  ..
>     >
>     >
>     > example word encoding:
>     >
>     >
>     > <tei:w fa:linguistics="Pos.NOUN "
>     > lemmaRef="inprogress://lemmasystem/Hollands/frik/1"
>     > lemma="frik">Frik</tei:w>
>     >
>     >
>     > example split word encoding:
>     >
>     >
>     > <tei:w xml:id="staet-op-176" rendition="#split">staet</tei:w>
>     >
>     > <tei:w fa:linguistics="Pos.ADV "
>     > lemmaRef="inprogress://lemmasystem/Hollands/al/3"
>     > lemma="al">al</tei:w><tei:w>wringende</tei:w>
>     >
>     > <tei:w xml:id="staet-op-179" rendition="#split">op</tei:w>
>     >
>     > <tei:join result="w" scope="root" lemma="opstean"
>     target="#staet-op-176
>     > #staet-op-179" lemmaRef="inprogress://lemmasystem/Hollands/opstean/1"
>     > fa:linguistics="th-si-pa Pos.VERB "/>
>     >
>     >
>     > example word consist of more lemma's (we don't use this yet....):
>     >
>     >
>     > <tei:choice>
>     >   <tei:orig>
>     >     <tei:w fa:linguistics=".....">aint</tei:w>
>     >   </tei:orig>
>     >   <tei:reg>
>     >     <tei:w lemma="be" fa:linguistics="...">am</tei:w>
>     >     <tei:w lemma="not" fa:linguistics="...">not</tei:w>
>     >   </tei:reg>
>     > </tei:choice>
>     >
>     >
>     > Bye,
>     >
>     >
>     > Eduard Drenth, Software Architekt
>     >
>     >
>     > [hidden email] <mailto:[hidden email]>
>     >
>     >
>     > Doelestrjitte 8
>     >
>     > 8911 DX  Ljouwert
>     >
>     > +31 58 234 30 47 <tel:%2B31%2058%20234%2030%2047>
>     >
>     >
>     > gpg:
>     https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43
>     <https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43>
>     >
>
>     --
>     Piotr Bański, Ph.D.
>     Senior Researcher,
>     Institut für Deutsche Sprache,
>     R5 6-13
>     68-161 Mannheim, Germany
>
>

--
Piotr Bański, Ph.D.
Senior Researcher,
Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany
Reply | Threaded
Open this post in threaded view
|

Re: standardizing linguistic encoding

Emmanuel NGUE UM
Hi Piotr,

Thank you very much for your informed advice about coding possibilities offered by TEI for lesser described languages, including tone languages of Africa which is my field of research.

I have gone through chapter 18 of the Guidelines, and though I find the "feature structure" element to provide a suitable framework matrice for granular encoding of linguistic information down to the level of a sound description, I did not come accross any such matrice framework for the encoding of tone-as-morpheme units. I am not implying that the overall TEI infrastructure lacks such a framework, but that perhaps no one has hitherto attempted to develop a standardized repertoire of labels for characterizing elements and attributes names suitable for the description of tone-related phenomena in African languages.

Because I firmly believe that a sustainable and workable response to the structural limitations of existing models of linguistic analysis (The Leipzig Glossing Rules for example) is likely to come from TEI, and because I am about to propose an alternative model based on TEI, I wish to make sure that my model does not conflict with another initiative or project which might be building up from a similar approach. Inasmuch as I will be publicly presenting this model soon, and that I am hoping to bring other Africanists and other linguistics to join in the initiative to further develop the model, it is of utmost importance that I get feedback from the TEI community, and much so from TEI's long-standing practionners.

Here is a sypnopsis of the model, without consideration for matters concerning nameSpace conformity and other structural constraints

Considering the following data from Basaa (a Bantu language spoken in Cameroon)

IPA transcription

mùt "person"

lɔ̀ "come"

mùt à ǹlɔ̂ "a person/somebody has come"

Morpheme-by-moprheme analysis following the Leizig Glossing Rules (LGR):

m-     ùt       à      ǹ  -  lɔ̂
CL1-person SM   PRES-come

The above analysis assumes  a one-grid description of the linguistic information, because it maps grammatical information with the word's building blocks. The analysis does not explicitely bring out the input of tones (graphically marked as accents in the data) as autonomous meaningful linguistic units, as propounded in the auto-segmental approach to tone analysis (Goldsmith, 1990).

Depending on the quality of grammatical description available for a given language, one might consider representing the linguistic information encoded by tones. To do so however, the LGR stipulates the clustering of glossing labels into binary or ternary bundles. Therefore, if I consider adding the linguistic information at the tone level, the previous analysis will look as follows:

m-     ùt              à           ǹ  -  lɔ̂
CL1-person      SM   PRES.PERF-PERF.come

Again assuming the linguistic analysis is correct (it might however subject to debate!), not only every tone element is not assigned an analysis (the case of the low tones on the roots "ùt" and " lɔ̂", and the subject marker "à"), but also assignment of tone glosses is not done consistently with regard to the sequence of string characters. In other words, in the analysis of the token "ǹlɔ̂" for example, linguistic information is fuzzily distributed, and no parsing algorithm can accurately account for structural mapping of linguistic information and linguistic form. This is partly so, I believe, because tone graphization in mainstream text layout is biaised towards a "suprasegmental" representation, not towards an auto-segmental representation. Consequently, tone is inherently treated as triggering secondary and sometimes unecessary information, in spite of the auto-segmental model having put them on a par with their segmental counterparts.

Luckily, TEI provides possible solution to this issue. In the following encoding model, I only represent word level information and tags. I have tentatively defined attributes whose description I provide in the first place:

 @type describes a word or morpheme type. Word types can be "nouns", "verbs", "adjectives", etc. Morpheme types can be "prefixes", "roots", "suffixes",  "segmental", "tonal", etc.
@gloss describes the semantic value of a morpheme. This may apply to tense, aspects, plural, noun class markers, etc. Gloss values are labels of linguistic information which conform to a standard glossing scheme such as the LGR.
@category describes a linguistic class to which a specific morpheme relates.  A morpheme may fit into the "lexical", "grammatical", "syntactic", or "prosodic" classes.
@nounClass describes an integer for the noun class of a noun prefix, following standard Bantu grammatical reconstructions for noun classes (Meeussen 1976, etc.)
@segmental describes segmental morphemes, that is word building blocks devoid of their tone associates.
@tonal describes tone morphemes.

The content of a tone element is represented by the unicode code point for the superscript whose shape depicts the pitch level of the tone. Thus, the accute accent standing for high tone is represented in the model by the unicode code point 02CA. However, for the sake of economy and consistency, the content of tone elements is assumed to have earlier been described in the XML schema as @high and @low entities with the value of each corresponding to its unicode code points.

Implementation of the model

<w type = "noun">
            <m type = "prefix" nouNclass ="1">m<\m>
            <m type = "root">
                        <m type = "segmental">ut<\m>
                        <m type = "tone" category = "lexical">&low<\m>
            <\m>

<\w>
<w type = "nounParticle">
            <m type = "subjectMarker">
                        <m type = "segmental">a<\m>
                        <m type = "tone" category = "grammatical">&low<\m>
            <\m>
<\w>

<w type = "verb">
            <m type = "prefix">
                        <m type = "segmental" gloss = "present">n<\m>
                        <m type = "tone" gloss = "perfect">&low<\m>
            <\m>
            <m> type= "root">
                        <m type = "segmental">lɔ<\m>
                        <m type = "tone" gloss = "perfect">&high<\m>
                        <m type = "tone" category = "lexical">&low<\m>
            <\m>
<\w>


Note: the metalanguage used in the model for attributes names, attributes values, and entity names is NOT yet standardized. I have used it for the sake of demonstration. Should the model appeals to the needs of other linguists and Africanists, I am hoping that we would organized ourselves into a group of experts, perhaps under the framework of existing TEI Expert Groups, and work towards standardizing the model and its related vocabulary and other structural aspects.

Looking forward to your feedback

Best,

Emmanuel Ngué Um (University of Yaoundé I - Cameroon)



2017-02-03 18:12 GMT+01:00 Piotr Bański <[hidden email]>:

>
> Hi Emmanuel,
>
> It's great to hear from you. You may be pleased to hear about a set of tools that can be used to encode the information you need very precisely, and to attach that information to objects of any granularity ("standard" tokens, morphs, phrases) and any sort (orthographic, prosodic, morphological). These tools have been defined jointly by the TEI and ISO, and the TEI description is free and well-tested, and you can read about it in the chapter on feature structures:
>
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html
>
> (I'd suggest skipping 18.11 when reading this chapter for the first time)
>
> These tools make it possible for you to describe practically any feature matrices needed in linguistics. And the TEI has mechanisms for attaching them to linguistic/textual objects.
>
> If you'd like to pursue this further, you may be interested in joining the TEI Linguistics SIG mailing list at
>
> https://listserv.brown.edu/archives/cgi-bin/wa?A0=TEI-LINGUISTICS
>
> where we have just talked about cases slightly more complex than the kind of attribute-based markup discussed here.
>
> Best regards,
>
>   Piotr
>
>
>
> On 02/03/17 17:38, Emmanuel NGUE UM wrote:
>>
>> Hi,
>>
>> I am an African-Based linguist, and I have been following much of the
>> discussions going on over TEI mailing list.
>>
>> I am not a TEI practitioner per se, but I am aware of the many
>> application scenarios of this technology, including text corpora building.
>>
>> A couple of months ago, I sent an e-mail around via TEI mailing list
>> asking whether anyone knew of any TEI based/inspired framework for the
>> encoding of prosodic phenomena such as tones, especially in African tone
>> languages. I got one or two responses from members. Unfortunately these
>> responses did not address my specific concern.
>>
>> I wish to join on-going discussions about 'standardizing linguistic
>> encoding', to bring to the fore of TEI standards development, the issue
>> of "tone encoding".
>>
>> For the sake of clarification and given that not every one is
>> necessarily an expert in tone languages, let me explain by examples what
>> tone is in African tone languages.
>>
>> Given the followings tokens from Basaa, a bantu language spoken in Cameroon:
>>
>> (1) hól : to sharpen
>>
>> (2) hòl : to pay the dawry
>>
>> (3) hôl (as in /á hôl/): let him sharpen
>>
>>
>> (4) hŏl : pay the dawry! (imperative)
>>
>> In (1) through (4), the difference in meaning of these words is
>> attributed to the difference in relative pitch level of the syllable:
>> "high" in (1), "low" in (2), contour or two-level "low-high" in (3),
>> contour or two-level "low-high" in (4).
>>
>> While the semantics associated with tone levels in (1) and (2) is
>> lexically encoded, the ones in (3) and (4) are complemented with
>> grammatical information, namely hortative in (3) and imperative in (4),
>> thus resulting in complex (contour) tone shapes in writing.
>>
>> Tone representation in the above examples is graphical, and is meant to
>> simply anchor pitch melody; this form of representation does not inform
>> much about the semantics associated with a specific pitch level in and
>> accross words. This is so mostly because pitch 'labels' (high, low,
>> low-high, high-low) do not encode persistent meaning, but may instead
>> trigger each and array of grammatical information such as tense, mood,
>> aspect, negation, ect., depending on the context.
>>
>> I personally believe that for better processeability and representation
>> of textual information in tone langues, there is need for developping
>> unambiguous encoding framework devoid of graphical representation of
>> tones, and I believe TEI to be one possible response to this.
>>
>> Because TEI is an open standard which is meant to be tailored to the
>> specific needs of users, I think it is our responsiblity as Africanists
>> and Bantuists, to raise TEI community's awarness about accounting for
>> the specificities of the languages we are working on, when it comes to
>> standardizing linguistic encoding.
>>
>> Best
>>
>> Emmanuel Ngué Um
>> Language Archivist for ALORA
>>
>> 2017-02-02 14:03 GMT+01:00 Eduard Drenth <[hidden email]
>> <mailto:[hidden email]>>:
>>
>>     Thanks for your response! Standard in my case means practical,
>>     usable way for encoding linguistic information in corpora using TEI.
>>
>>     Indeed the theme is covered by
>>     https://github.com/LingSIG/wordAttributes/wiki
>>     <https://github.com/LingSIG/wordAttributes/wiki>. Good to know of
>>     this http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists
>>     <http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists> as well.
>>
>>     We choose to continue along the choosen path, it doesn't deviate too
>>     much from uncustomized TEI, offers good support for editing and
>>     querying, satisfies our linguists, adheres to
>>     http://universaldependencies.org <http://universaldependencies.org>
>>     and is easy to convert to the 'real standard' when it is released.
>>
>>     Perhaps our approach can be useful input for
>>     https://github.com/LingSIG/wordAttributes
>>     <https://github.com/LingSIG/wordAttributes>, it is the result of
>>     quite extensive testing and discussing.
>>
>>     Eduard Drenth, Software Architekt
>>
>>     [hidden email] <mailto:[hidden email]>
>>
>>     Doelestrjitte 8
>>     8911 DX  Ljouwert
>>     +31 58 234 30 47 <tel:%2B31%2058%20234%2030%2047>
>>
>>     gpg:
>>     https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43
>>     <https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43>
>>
>>     ________________________________________
>>     From: Piotr Bański <[hidden email]
>>     <mailto:[hidden email]>>
>>     Sent: Thursday, February 2, 2017 12:24 PM
>>     To: Eduard Drenth; [hidden email]
>>     <mailto:[hidden email]>; Phillip Ströbel
>>
>>     Subject: Re: standardizing linguistic encoding
>>
>>     Dear Eduard, [also addressing Philip and actually all... ]
>>
>>     It's probably my conditioning as a member of various standardization
>>     bodies that makes red lights flash in my head upon reading that you are
>>     "developing a standard"... :-) I believe that standards are better
>>     "developed" (or, more precisely, codified) on the basis of existing best
>>     practices or other existing standards. As it is, I can observe that your
>>     proposed encoding mixes up the level of tokens with the level of word
>>     forms (in ISO MAF terminology[1]), and while it can be suitable for your
>>     purposes, it is far from optimal in standardization terms.
>>
>>     [at this point, the camera pans out]
>>
>>     This year promises to be quite exciting for the TEI Linguistics SIG[2],
>>     given that:
>>
>>     (1) ISO LMF [3] is up for renewal and restructuring, and that several
>>     teams (among others, from ENeL, PARTHENOS, CLARIN, and LingSIG) are
>>     currently working on various modules for it,
>>
>>     (2) ISO Tiger [4] is nearing publication (as in: weeks rather than
>>     months) and opening a way for ISO TEIger, a TEI serialization of the ISO
>>     model for syntactic encoding,
>>
>>     (3) there is a rising push for streamlining inline linguistic markup,
>>     coming from, among others, Martin Mueller's Early Print Project, BBAW's
>>     existing practice (presented by Susanne Haaf at various TEI meetings),
>>     the Ancient Greek Dependency Treebank (represented in this mailing list
>>     by Giuseppe Celano, I believe), and now we learn of Philip Ströbel's
>>     project and yours. And there are others. A tiny reflex of that is
>>     contained at the LingSIG GitHub space [5], which is only meant as a
>>     _seed_ for collaborative effort rather than any personal statement.
>>
>>     Andreas Witt and I are thinking of how to address and channel this
>>     boiling mass of initiatives. One possibility could be to target the
>>     upcoming TEI Members Meeting[6] and have a focused pre-conference
>>     workshop designed to formulate a very precise and very concrete proposal
>>     for grammatical encoding synchronized across inline, standoff and
>>     dictionary markup, a proposal that we could submit to the TEI Technical
>>     Council at the end of the day. "The day" seems distant, but if we want
>>     to have a serious proposal at the end of it, work should start about
>>     now.
>>
>>     May I invite all interested parties to join the Linguistics SIG mailing
>>     list (by going to [7]) and GitHub space (by sending me, off-list, your
>>     github username), and to, well, have a go at it... :-)
>>
>>     Best regards,
>>
>>        Piotr
>>
>>     [1]:
>>     http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=51934
>>     <http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=51934>
>>     [2]: http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists
>>     <http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists>
>>     [3]:
>>     http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=68516
>>     <http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=68516>
>>     [4]:
>>     http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=62491
>>     <http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=62491>
>>     [5]: https://github.com/LingSIG/wordAttributes/wiki
>>     <https://github.com/LingSIG/wordAttributes/wiki>
>>     [6]: http://members.tei-c.org/Events/meetings
>>     <http://members.tei-c.org/Events/meetings>
>>
>>
>>
>>
>>
>>
>>     On 02/01/17 16:51, Eduard Drenth wrote:
>>     > Dear all,
>>     >
>>     > Here in Holland we are developing a standard to encode linguistic and
>>     > lemma information for various word situations using TEI. We have been
>>     > trying several solutions (tei:fs/tei:f, tei:interp, tei:span, ...) and
>>     > finaly chose for TEI customization which gives us standard xsd
>>     > validation, editor support and a simple focused solution. For
>>     linguistic
>>     > terminology we use as much as possible
>>     http://universaldependencies.org/ <http://universaldependencies.org/>.
>>
>>     >
>>     >
>>     > We are curious as to what you think, see below for details. We
>>     hope this
>>     > solution may be of use for those who want to encode linguistic
>>     > information using TEI. Also this may help standardizing linguistic
>>     > encoding in TEI.
>>     >
>>     >
>>     > If this all is worthwhile I would like to donate/publish the solution
>>     > somewhere.
>>     >
>>     >
>>     > snippet customization:
>>     >
>>     >
>>     >             <schemaSpec ident="tdb" docLang="en" prefix="tei_"
>>     > xml:lang="en">
>>     >
>>     >                 ..
>>     >
>>     >                 ..
>>     >
>>     >                 <classSpec type="atts" ident="att.linguistics"
>>     > module="analytics">
>>     >
>>     >                     <attList>
>>     >                         <attDef ident="linguistics"
>>     > ns="http://www.fryske-akademy.org/grammar/1.0
>>     <http://www.fryske-akademy.org/grammar/1.0>">
>>     >                             <desc>
>>     >                                 documentation....
>>     >                             </desc>
>>     >                             <datatype maxOccurs="unbounded">
>>     >                                 <dataRef key="teiata.enumerated"/>
>>     >                             </datatype>
>>     >                             <valList type="closed">
>>     >                                 <valItem ident="Features.Abbr">
>>     >                                     <desc>Boolean feature. Is this an
>>     > abbreviation?</desc>
>>     >                                 </valItem>
>>     >                                 <valItem ident="Features.Poss">
>>     >                                     <desc>Boolean feature of pronouns,
>>     > determiners or adjectives. It tells whether the word is
>>     possessive.</desc>
>>     >                                 </valItem>
>>     >                                 <valItem ident="PronType.Prs">
>>     >                                     <desc>personal pronoun or
>>     > determiner</desc>
>>     >                                 </valItem>
>>     >
>>     >                                  ..
>>     >
>>     >                                  ..
>>     >
>>     >
>>     > example word encoding:
>>     >
>>     >
>>     > <tei:w fa:linguistics="Pos.NOUN "
>>     > lemmaRef="inprogress://lemmasystem/Hollands/frik/1"
>>     > lemma="frik">Frik</tei:w>
>>     >
>>     >
>>     > example split word encoding:
>>     >
>>     >
>>     > <tei:w xml:id="staet-op-176" rendition="#split">staet</tei:w>
>>     >
>>     > <tei:w fa:linguistics="Pos.ADV "
>>     > lemmaRef="inprogress://lemmasystem/Hollands/al/3"
>>     > lemma="al">al</tei:w><tei:w>wringende</tei:w>
>>     >
>>     > <tei:w xml:id="staet-op-179" rendition="#split">op</tei:w>
>>     >
>>     > <tei:join result="w" scope="root" lemma="opstean"
>>     target="#staet-op-176
>>     > #staet-op-179" lemmaRef="inprogress://lemmasystem/Hollands/opstean/1"
>>     > fa:linguistics="th-si-pa Pos.VERB "/>
>>     >
>>     >
>>     > example word consist of more lemma's (we don't use this yet....):
>>     >
>>     >
>>     > <tei:choice>
>>     >   <tei:orig>
>>     >     <tei:w fa:linguistics=".....">aint</tei:w>
>>     >   </tei:orig>
>>     >   <tei:reg>
>>     >     <tei:w lemma="be" fa:linguistics="...">am</tei:w>
>>     >     <tei:w lemma="not" fa:linguistics="...">not</tei:w>
>>     >   </tei:reg>
>>     > </tei:choice>
>>     >
>>     >
>>     > Bye,
>>     >
>>     >
>>     > Eduard Drenth, Software Architekt
>>     >
>>     >
>>     > [hidden email] <mailto:[hidden email]>
>>     >
>>     >
>>     > Doelestrjitte 8
>>     >
>>     > 8911 DX  Ljouwert
>>     >
>>     > +31 58 234 30 47 <tel:%2B31%2058%20234%2030%2047>
>>     >
>>     >
>>     > gpg:
>>     https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43
>>     <https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43>
>>     >
>>
>>     --
>>     Piotr Bański, Ph.D.
>>     Senior Researcher,
>>     Institut für Deutsche Sprache,
>>     R5 6-13
>>     68-161 Mannheim, Germany
>>
>>
>
> --
> Piotr Bański, Ph.D.
> Senior Researcher,
> Institut für Deutsche Sprache,
> R5 6-13
> 68-161 Mannheim, Germany