PoS tagging in <w> with @ana: pointer?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

PoS tagging in <w> with @ana: pointer?

Paolo Monella
Dear all,

I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file and
want to encode the result in attributes of <w>.

I searched the TEI-L archives and the Internet. I found that
MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output
(e.g. "adjective, positive genitive plural masculine"):

<w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>

I had tried this encoding:

<w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>

The main difference is that MorphAdorner prepends a "#" to the value of
@ana because this value should be a teidata.pointer [2].

In any case, also "#p-acp" is no valid pointer (no valid URI), so do you
think I should leave my encoding as it is, or prepend "#" as in
@ana="#4-S--------"?

Thank you,
Paolo

[1] See paragraph "Simplified TEI P5-like output" in
http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/
[2]
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html
Reply | Threaded
Open this post in threaded view
|

Re: PoS tagging in <w> with @ana: pointer?

Eduard Drenth
Perhaps this is of interest: https://bitbucket.org/fryske-akademy/tei-encoding (sometimes I have to refresh the page for layout) an odd customization to use universal dependencies based pos and features.

At the Fryske Akademy we are also working on a tagger using this customization, tagger not ready yet though.

Eduard Drenth, Software Architekt

[hidden email]

Doelestrjitte 8
8911 DX  Ljouwert
+31 58 234 30 47
+31 62 094 34 28 (privé)

Op freed bin ik frij
https://www.fryske-akademy.nl/~edrenth/
https://bitbucket.org/fryske-akademy/
https://workflow-fryske-akademy.atlassian.net/


gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43

________________________________________
From: TEI (Text Encoding Initiative) public discussion list <[hidden email]> on behalf of Paolo Monella <[hidden email]>
Sent: Tuesday, January 2, 2018 9:11 PM
To: [hidden email]
Subject: PoS tagging in <w> with @ana: pointer?

Dear all,

I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file and
want to encode the result in attributes of <w>.

I searched the TEI-L archives and the Internet. I found that
MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output
(e.g. "adjective, positive genitive plural masculine"):

<w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>

I had tried this encoding:

<w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>

The main difference is that MorphAdorner prepends a "#" to the value of
@ana because this value should be a teidata.pointer [2].

In any case, also "#p-acp" is no valid pointer (no valid URI), so do you
think I should leave my encoding as it is, or prepend "#" as in
@ana="#4-S--------"?

Thank you,
Paolo

[1] See paragraph "Simplified TEI P5-like output" in
http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/
[2]
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html
Reply | Threaded
Open this post in threaded view
|

Re: PoS tagging in <w> with @ana: pointer?

Piotr Bański
In reply to this post by Paolo Monella
Dear Paolo,

Please have a look at the proposal addressing this at
https://github.com/TEIC/TEI/issues/1670

It avoids the "POS-in-@ana" issue, and provides arguments for that. You
will also see there a list of projects that use the proposed format,
some of them based on MorphAdorner.

The practical question for you now, I guess, is either to keep the
existing TEI skeleton and disobey the @ana datatype or adopt the changes
we have suggested in the ticket and put the POS information where it
belongs, hoping that the Council will address the issue before the end
of the world. It's a gamble... :-)

Best wishes,

   Piotr


On 01/02/18 21:11, Paolo Monella wrote:

> Dear all,
>
> I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file
> and want to encode the result in attributes of <w>.
>
> I searched the TEI-L archives and the Internet. I found that
> MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output
> (e.g. "adjective, positive genitive plural masculine"):
>
> <w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>
>
> I had tried this encoding:
>
> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>
> The main difference is that MorphAdorner prepends a "#" to the value
> of @ana because this value should be a teidata.pointer [2].
>
> In any case, also "#p-acp" is no valid pointer (no valid URI), so do
> you think I should leave my encoding as it is, or prepend "#" as in
> @ana="#4-S--------"?
>
> Thank you,
> Paolo
>
> [1] See paragraph "Simplified TEI P5-like output" in
> http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/
> [2]
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html
>
Reply | Threaded
Open this post in threaded view
|

Re: PoS tagging in <w> with @ana: pointer?

Piotr Bański
Dear Paolo,

One more question/nitpick. You say:

 > "#p-acp" is no valid pointer (no valid URI)

Well, it is not, but it's a valid fragment identifier (see [1]), and
somewhere in the maze of W3C specs, there is a statement on interpreting
bare fragment identifiers as being virtually appended to the URI of the
current document, yielding a correct (longer) URI. So I think that you
are fine, syntactically (or have you actually got a failed validation
result? I'd be very curious to see a test case then), but obviously not
semantically (we address this "pretend that POS values are fragIDs, just
for the sake of the tei.pointer datatype" issue in the text of the
github ticket to which I pointed you, alongside other arguments against
using @ana for this purpose).

Best regards,

   Piotr

[1]: https://tools.ietf.org/html/rfc3986#appendix-A



On 01/05/18 16:39, Piotr Bański wrote:

> Dear Paolo,
>
> Please have a look at the proposal addressing this at
> https://github.com/TEIC/TEI/issues/1670
>
> It avoids the "POS-in-@ana" issue, and provides arguments for that.
> You will also see there a list of projects that use the proposed
> format, some of them based on MorphAdorner.
>
> The practical question for you now, I guess, is either to keep the
> existing TEI skeleton and disobey the @ana datatype or adopt the
> changes we have suggested in the ticket and put the POS information
> where it belongs, hoping that the Council will address the issue
> before the end of the world. It's a gamble... :-)
>
> Best wishes,
>
>   Piotr
>
>
> On 01/02/18 21:11, Paolo Monella wrote:
>> Dear all,
>>
>> I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file
>> and want to encode the result in attributes of <w>.
>>
>> I searched the TEI-L archives and the Internet. I found that
>> MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output
>> (e.g. "adjective, positive genitive plural masculine"):
>>
>> <w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>
>>
>> I had tried this encoding:
>>
>> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>>
>> The main difference is that MorphAdorner prepends a "#" to the value
>> of @ana because this value should be a teidata.pointer [2].
>>
>> In any case, also "#p-acp" is no valid pointer (no valid URI), so do
>> you think I should leave my encoding as it is, or prepend "#" as in
>> @ana="#4-S--------"?
>>
>> Thank you,
>> Paolo
>>
>> [1] See paragraph "Simplified TEI P5-like output" in
>> http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/
>> [2]
>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: PoS tagging in <w> with @ana: pointer?

Paolo Monella
Dear Eduard and Piotr,

thank you for your insights. I do hope that the proposal of the LingSIG
[1] is accepted. If useful, you might mention my own Ursus project [2]
as a use case, but I am sure that there are plenty of already existing
use cases.

I am currently encoding as follows:

<w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>

so I am not prepending a "#" to "4-S--------". It would take only a
little VI find/replace to prepend the "#", and minor changes in the JS
and Python scripts to make them process it (by removing it).
But I am reluctant to do so because I agree with the argument in the
ticket that it is a kludge.

No lint or parser gave me a failed validation because of this.

Do you still suggest that I prepend the "#"?

Best,
Paolo

[1] Ticket https://github.com/TEIC/TEI/issues/1670
[2] http://www.unipa.it/paolo.monella/ursus



Il 05/01/2018 17:41, Piotr Bański ha scritto:

> Dear Paolo,
>
> One more question/nitpick. You say:
>
>  > "#p-acp" is no valid pointer (no valid URI)
>
> Well, it is not, but it's a valid fragment identifier (see [1]), and
> somewhere in the maze of W3C specs, there is a statement on interpreting
> bare fragment identifiers as being virtually appended to the URI of the
> current document, yielding a correct (longer) URI. So I think that you
> are fine, syntactically (or have you actually got a failed validation
> result? I'd be very curious to see a test case then), but obviously not
> semantically (we address this "pretend that POS values are fragIDs, just
> for the sake of the tei.pointer datatype" issue in the text of the
> github ticket to which I pointed you, alongside other arguments against
> using @ana for this purpose).
>
> Best regards,
>
>    Piotr
>
> [1]: https://tools.ietf.org/html/rfc3986#appendix-A
>
>
>
> On 01/05/18 16:39, Piotr Bański wrote:
>> Dear Paolo,
>>
>> Please have a look at the proposal addressing this at
>> https://github.com/TEIC/TEI/issues/1670
>>
>> It avoids the "POS-in-@ana" issue, and provides arguments for that.
>> You will also see there a list of projects that use the proposed
>> format, some of them based on MorphAdorner.
>>
>> The practical question for you now, I guess, is either to keep the
>> existing TEI skeleton and disobey the @ana datatype or adopt the
>> changes we have suggested in the ticket and put the POS information
>> where it belongs, hoping that the Council will address the issue
>> before the end of the world. It's a gamble... :-)
>>
>> Best wishes,
>>
>>   Piotr
>>
>>
>> On 01/02/18 21:11, Paolo Monella wrote:
>>> Dear all,
>>>
>>> I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file
>>> and want to encode the result in attributes of <w>.
>>>
>>> I searched the TEI-L archives and the Internet. I found that
>>> MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output
>>> (e.g. "adjective, positive genitive plural masculine"):
>>>
>>> <w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>
>>>
>>> I had tried this encoding:
>>>
>>> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>>>
>>> The main difference is that MorphAdorner prepends a "#" to the value
>>> of @ana because this value should be a teidata.pointer [2].
>>>
>>> In any case, also "#p-acp" is no valid pointer (no valid URI), so do
>>> you think I should leave my encoding as it is, or prepend "#" as in
>>> @ana="#4-S--------"?
>>>
>>> Thank you,
>>> Paolo
>>>
>>> [1] See paragraph "Simplified TEI P5-like output" in
>>> http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/ 
>>>
>>> [2]
>>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html 
>>>
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: PoS tagging in <w> with @ana: pointer?

Piotr Bański
Dear Paolo,

Thanks for the link, impressive work! It's going to be a handy reference.

As for your question, on whether or not to prepend the '#', I would say
that it's a kludge either way, for different reasons, and I think in
such cases it's the practical factors that come to the fore. If it's
more work and maintenance for you to prepend the '#' only to cut it off
for querying/visualization, then I'd say don't bother...

It's a perfect illustration for part of our motivation for creating the
ticket: a corpus creator, upon looking at this sort of "dilemma" on
which kludge to use, may simply decide not to use the TEI at all, or
will hack it his way, and we're going to see yet another variation where
there could be a simple standardized approach. But maybe we need 15 more
cases of a similar sort to begin to sound convincing? I wonder.

Best regards,

   Piotr



On 01/05/18 19:01, Paolo Monella wrote:

> Dear Eduard and Piotr,
>
> thank you for your insights. I do hope that the proposal of the LingSIG
> [1] is accepted. If useful, you might mention my own Ursus project [2]
> as a use case, but I am sure that there are plenty of already existing
> use cases.
>
> I am currently encoding as follows:
>
> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>
> so I am not prepending a "#" to "4-S--------". It would take only a
> little VI find/replace to prepend the "#", and minor changes in the JS
> and Python scripts to make them process it (by removing it).
> But I am reluctant to do so because I agree with the argument in the
> ticket that it is a kludge.
>
> No lint or parser gave me a failed validation because of this.
>
> Do you still suggest that I prepend the "#"?
>
> Best,
> Paolo
>
> [1] Ticket https://github.com/TEIC/TEI/issues/1670
> [2] http://www.unipa.it/paolo.monella/ursus
>
>
>
> Il 05/01/2018 17:41, Piotr Bański ha scritto:
>> Dear Paolo,
>>
>> One more question/nitpick. You say:
>>
>>  > "#p-acp" is no valid pointer (no valid URI)
>>
>> Well, it is not, but it's a valid fragment identifier (see [1]), and
>> somewhere in the maze of W3C specs, there is a statement on
>> interpreting bare fragment identifiers as being virtually appended to
>> the URI of the current document, yielding a correct (longer) URI. So I
>> think that you are fine, syntactically (or have you actually got a
>> failed validation result? I'd be very curious to see a test case
>> then), but obviously not semantically (we address this "pretend that
>> POS values are fragIDs, just for the sake of the tei.pointer datatype"
>> issue in the text of the github ticket to which I pointed you,
>> alongside other arguments against using @ana for this purpose).
>>
>> Best regards,
>>
>>    Piotr
>>
>> [1]: https://tools.ietf.org/html/rfc3986#appendix-A
>>
>>
>>
>> On 01/05/18 16:39, Piotr Bański wrote:
>>> Dear Paolo,
>>>
>>> Please have a look at the proposal addressing this at
>>> https://github.com/TEIC/TEI/issues/1670
>>>
>>> It avoids the "POS-in-@ana" issue, and provides arguments for that.
>>> You will also see there a list of projects that use the proposed
>>> format, some of them based on MorphAdorner.
>>>
>>> The practical question for you now, I guess, is either to keep the
>>> existing TEI skeleton and disobey the @ana datatype or adopt the
>>> changes we have suggested in the ticket and put the POS information
>>> where it belongs, hoping that the Council will address the issue
>>> before the end of the world. It's a gamble... :-)
>>>
>>> Best wishes,
>>>
>>>   Piotr
>>>
>>>
>>> On 01/02/18 21:11, Paolo Monella wrote:
>>>> Dear all,
>>>>
>>>> I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file
>>>> and want to encode the result in attributes of <w>.
>>>>
>>>> I searched the TEI-L archives and the Internet. I found that
>>>> MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output
>>>> (e.g. "adjective, positive genitive plural masculine"):
>>>>
>>>> <w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>
>>>>
>>>> I had tried this encoding:
>>>>
>>>> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>>>>
>>>> The main difference is that MorphAdorner prepends a "#" to the value
>>>> of @ana because this value should be a teidata.pointer [2].
>>>>
>>>> In any case, also "#p-acp" is no valid pointer (no valid URI), so do
>>>> you think I should leave my encoding as it is, or prepend "#" as in
>>>> @ana="#4-S--------"?
>>>>
>>>> Thank you,
>>>> Paolo
>>>>
>>>> [1] See paragraph "Simplified TEI P5-like output" in
>>>> http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/ 
>>>>
>>>> [2]
>>>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html 
>>>>
>>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: PoS tagging in <w> with @ana: pointer?

James Cummings-5


Hi Paolo and Piotr, etc., 


In the use of @ana do remember that this is an attribute with a datatype of 1-inf teidata.pointer values. Thus when you have '4-S--------' you are really saying there is a file in the filesystem in this directory called that. I know that you know this, just noting it for completeness. 


I agree, with the intent of the issue 1670 referenced to say that the use of @ana in the linguistic examples is a kludge (though if I were doing that kind of thing I'd be pointing to a <category> of a <taxonomy>  rather than <interp> but that is probably because I'm not a linguist and like the hierarchical flexibility of nested <category elements). I'm 

   

Since you mention it, there was significant discussion on issue 1670 at the Council Face2Face meeting in Victoria but the ticket wasn't updated then because it wasn't done as part of the ticket-processing sessions but as a main discussion item (as we recognise its importance ... there are much older thornier tickets out there!). The ticket owner should update it when he gets time. My unreliable memory of this is that att.linguistic was strongly supported, includding having @lemma and @lemmaRef in it, that @pos and @msd were also thought ok. I seem to remember that the concept of @join was acceptable but people wondered about whether there was a better name (and I think I wondered what happens if two adjacent words have some form of conflicting @join, i.e. is this an error and should we add schematron for it or something). From my recollection most of the discussion was about the proposed @reg (whose name I certainly don't like for historical reasons). I'm sure I would have argued against the reintroduction of a @reg attribute fearing people would abuse this for what <reg> was created for in editorial transcription and negating the whole war on text-bearing attributes and creation of the <choice> element. I know from the ticket that you think imposing use of <choice> creates too much of a burden for regularisation, but you actually argue more in favour of it when you note that the proposed @reg might need to store multi-word sequences... exactly what we don't want in an attribute! Though your @reg attribute issue 2 on that issue seems to ignore that <w> can self nest? Surely that would be the solution for multi-word units needing a single @reg?  And I'm not against the introduction of new linguistic attributes, though think this often ignores the power of XML child hierarchies. Personally, I  want to avoid the storage of any free text of any sort in any attribute, that is I like attribute values to be strongly tied to processable, checkable, datatypes. (Thus I dislike @lemma for the same reason and think @lemmaRef should be used instead wherever feasible!)  So my memory this ticket is that it was going to be moved to status Go (or this and Needs Discussion simultaneously to reflect a need to change a couple aspects of it).        


Best wishes,

James 


--

Dr James Cummings, [hidden email]

School of English Literature, Language, and Linguistics, Newcastle University


From: TEI (Text Encoding Initiative) public discussion list <[hidden email]> on behalf of Piotr Bański <[hidden email]>
Sent: 06 January 2018 19:57:49
To: [hidden email]
Subject: Re: PoS tagging in <w> with @ana: pointer?
 
Dear Paolo,

Thanks for the link, impressive work! It's going to be a handy reference.

As for your question, on whether or not to prepend the '#', I would say
that it's a kludge either way, for different reasons, and I think in
such cases it's the practical factors that come to the fore. If it's
more work and maintenance for you to prepend the '#' only to cut it off
for querying/visualization, then I'd say don't bother...

It's a perfect illustration for part of our motivation for creating the
ticket: a corpus creator, upon looking at this sort of "dilemma" on
which kludge to use, may simply decide not to use the TEI at all, or
will hack it his way, and we're going to see yet another variation where
there could be a simple standardized approach. But maybe we need 15 more
cases of a similar sort to begin to sound convincing? I wonder.

Best regards,

   Piotr



On 01/05/18 19:01, Paolo Monella wrote:
> Dear Eduard and Piotr,
>
> thank you for your insights. I do hope that the proposal of the LingSIG
> [1] is accepted. If useful, you might mention my own Ursus project [2]
> as a use case, but I am sure that there are plenty of already existing
> use cases.
>
> I am currently encoding as follows:
>
> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>
> so I am not prepending a "#" to "4-S--------". It would take only a
> little VI find/replace to prepend the "#", and minor changes in the JS
> and Python scripts to make them process it (by removing it).
> But I am reluctant to do so because I agree with the argument in the
> ticket that it is a kludge.
>
> No lint or parser gave me a failed validation because of this.
>
> Do you still suggest that I prepend the "#"?
>
> Best,
> Paolo
>
> [1] Ticket https://github.com/TEIC/TEI/issues/1670


> [2] http://www.unipa.it/paolo.monella/ursus


>
>
>
> Il 05/01/2018 17:41, Piotr Bański ha scritto:
>> Dear Paolo,
>>
>> One more question/nitpick. You say:
>>
>>  > "#p-acp" is no valid pointer (no valid URI)
>>
>> Well, it is not, but it's a valid fragment identifier (see [1]), and
>> somewhere in the maze of W3C specs, there is a statement on
>> interpreting bare fragment identifiers as being virtually appended to
>> the URI of the current document, yielding a correct (longer) URI. So I
>> think that you are fine, syntactically (or have you actually got a
>> failed validation result? I'd be very curious to see a test case
>> then), but obviously not semantically (we address this "pretend that
>> POS values are fragIDs, just for the sake of the tei.pointer datatype"
>> issue in the text of the github ticket to which I pointed you,
>> alongside other arguments against using @ana for this purpose).
>>
>> Best regards,
>>
>>    Piotr
>>
>> [1]: https://tools.ietf.org/html/rfc3986#appendix-A
>>
>>
>>
>> On 01/05/18 16:39, Piotr Bański wrote:
>>> Dear Paolo,
>>>
>>> Please have a look at the proposal addressing this at
>>> https://github.com/TEIC/TEI/issues/1670


>>>
>>> It avoids the "POS-in-@ana" issue, and provides arguments for that.
>>> You will also see there a list of projects that use the proposed
>>> format, some of them based on MorphAdorner.
>>>
>>> The practical question for you now, I guess, is either to keep the
>>> existing TEI skeleton and disobey the @ana datatype or adopt the
>>> changes we have suggested in the ticket and put the POS information
>>> where it belongs, hoping that the Council will address the issue
>>> before the end of the world. It's a gamble... :-)
>>>
>>> Best wishes,
>>>
>>>   Piotr
>>>
>>>
>>> On 01/02/18 21:11, Paolo Monella wrote:
>>>> Dear all,
>>>>
>>>> I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file
>>>> and want to encode the result in attributes of <w>.
>>>>
>>>> I searched the TEI-L archives and the Internet. I found that
>>>> MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output
>>>> (e.g. "adjective, positive genitive plural masculine"):
>>>>
>>>> <w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>
>>>>
>>>> I had tried this encoding:
>>>>
>>>> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>>>>
>>>> The main difference is that MorphAdorner prepends a "#" to the value
>>>> of @ana because this value should be a teidata.pointer [2].
>>>>
>>>> In any case, also "#p-acp" is no valid pointer (no valid URI), so do
>>>> you think I should leave my encoding as it is, or prepend "#" as in
>>>> @ana="#4-S--------"?
>>>>
>>>> Thank you,
>>>> Paolo
>>>>
>>>> [1] See paragraph "Simplified TEI P5-like output" in
>>>> http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/


>>>>
>>>> [2]
>>>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html


>>>>
>>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: PoS tagging in <w> with @ana: pointer?

Paolo Monella
Thank you James: I hadn't thought of the fact that '4-S--------' is as
valid a pointer as '#4-S--------', though both of them are lies, since
there is no '4-S--------' file just like there is no '#4-S--------'
fragment.

So I'll accept the suggestion of Piotr, not to bother about prepending a
'#' only to cut it off later.

I'm happy to hear that the the Council Face2Face meeting liked @msd.

All best,
Paolo


Il 09/01/2018 18:36, James Cummings ha scritto:

>
> Hi Paolo and Piotr, etc.,
>
>
> In the use of @ana do remember that this is an attribute with a datatype
> of 1-inf teidata.pointer values. Thus when you have '4-S--------' you
> are really saying there is a file in the filesystem in this directory
> called that. I know that you know this, just noting it for completeness.
>
>
> I agree, with the intent of the issue 1670 referenced to say that the
> use of @ana in the linguistic examples is a kludge (though if I were
> doing that kind of thing I'd be pointing to a <category> of a <taxonomy>
>   rather than <interp> but that is probably because I'm not a linguist
> and like the hierarchical flexibility of nested <category elements). I'm
>
> Since you mention it, there was significant discussion on issue 1670 at
> the Council Face2Face meeting in Victoria but the ticket wasn't updated
> then because it wasn't done as part of the ticket-processing sessions
> but as a main discussion item (as we recognise its importance ... there
> are much older thornier tickets out there!). The ticket owner should
> update it when he gets time. My unreliable memory of this is that
> att.linguistic was strongly supported, includding having @lemma and
> @lemmaRef in it, that @pos and @msd were also thought ok. I seem to
> remember that the concept of @join was acceptable but people wondered
> about whether there was a better name (and I think I wondered what
> happens if two adjacent words have some form of conflicting @join, i.e.
> is this an error and should we add schematron for it or something). From
> my recollection most of the discussion was about the proposed @reg
> (whose name I certainly don't like for historical reasons). I'm sure I
> would have argued against the reintroduction of a @reg attribute fearing
> people would abuse this for what <reg> was created for
> in editorial transcription and negating the whole war on text-bearing
> attributes and creation of the <choice> element. I know from the ticket
> that you think imposing use of <choice> creates too much of a burden for
> regularisation, but you actually argue more in favour of it when you
> note that the proposed @reg might need to store multi-word sequences...
> exactly what we don't want in an attribute! Though your @reg attribute
> issue 2 on that issue seems to ignore that <w> can self nest? Surely
> that would be the solution for multi-word units needing a single @reg?
> And I'm not against the introduction of new linguistic attributes,
> though think this often ignores the power of XML child hierarchies.
> Personally, I  want to avoid the storage of any free text of any sort in
> any attribute, that is I like attribute values to be strongly tied to
> processable, checkable, datatypes. (Thus I dislike @lemma for the same
> reason and think @lemmaRef should be used instead wherever feasible!)
>   So my memory this ticket is that it was going to be moved to status Go
> (or this and Needs Discussion simultaneously to reflect a need to change
> a couple aspects of it).
>
>
> Best wishes,
>
> James
>
>
> --
>
> Dr James Cummings, [hidden email]
>
> School of English Literature, Language, and Linguistics, Newcastle
> University
>
> ------------------------------------------------------------------------
> *From:* TEI (Text Encoding Initiative) public discussion list
> <[hidden email]> on behalf of Piotr Bański <[hidden email]>
> *Sent:* 06 January 2018 19:57:49
> *To:* [hidden email]
> *Subject:* Re: PoS tagging in <w> with @ana: pointer?
> Dear Paolo,
>
> Thanks for the link, impressive work! It's going to be a handy reference.
>
> As for your question, on whether or not to prepend the '#', I would say
> that it's a kludge either way, for different reasons, and I think in
> such cases it's the practical factors that come to the fore. If it's
> more work and maintenance for you to prepend the '#' only to cut it off
> for querying/visualization, then I'd say don't bother...
>
> It's a perfect illustration for part of our motivation for creating the
> ticket: a corpus creator, upon looking at this sort of "dilemma" on
> which kludge to use, may simply decide not to use the TEI at all, or
> will hack it his way, and we're going to see yet another variation where
> there could be a simple standardized approach. But maybe we need 15 more
> cases of a similar sort to begin to sound convincing? I wonder.
>
> Best regards,
>
>     Piotr
>
>
>
> On 01/05/18 19:01, Paolo Monella wrote:
>> Dear Eduard and Piotr,
>>
>> thank you for your insights. I do hope that the proposal of the LingSIG
>> [1] is accepted. If useful, you might mention my own Ursus project [2]
>> as a use case, but I am sure that there are plenty of already existing
>> use cases.
>>
>> I am currently encoding as follows:
>>
>> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>>
>> so I am not prepending a "#" to "4-S--------". It would take only a
>> little VI find/replace to prepend the "#", and minor changes in the JS
>> and Python scripts to make them process it (by removing it).
>> But I am reluctant to do so because I agree with the argument in the
>> ticket that it is a kludge.
>>
>> No lint or parser gave me a failed validation because of this.
>>
>> Do you still suggest that I prepend the "#"?
>>
>> Best,
>> Paolo
>>
>> [1] Ticket https://github.com/TEIC/TEI/issues/1670
> <https://github.com/TEIC/TEI/issues/1670>
>
> att.linguistic for <w> and <pc> · Issue #1670 · TEIC/TEI
> <https://github.com/TEIC/TEI/issues/1670>
> github.com
> Quick links: diff of the pull request (will be kept synced against
> TEIC/TEI/dev) suggested text of the relevant chapter (minimal changes,
> pending acceptance) suggested documentation of att.lingui...
>
>
>
>> [2] http://www.unipa.it/paolo.monella/ursus
> Ursus from Benevento, De nomine - unipa.it
> <http://www.unipa.it/paolo.monella/ursus>
> www.unipa.it
> Paolo Monella, Digital scholarly edition of codex Casanatensis 1086, by
> Ursus from Benevento
>
>
>
>>
>>
>>
>> Il 05/01/2018 17:41, Piotr Bański ha scritto:
>>> Dear Paolo,
>>>
>>> One more question/nitpick. You say:
>>>
>>>  > "#p-acp" is no valid pointer (no valid URI)
>>>
>>> Well, it is not, but it's a valid fragment identifier (see [1]), and
>>> somewhere in the maze of W3C specs, there is a statement on
>>> interpreting bare fragment identifiers as being virtually appended to
>>> the URI of the current document, yielding a correct (longer) URI. So I
>>> think that you are fine, syntactically (or have you actually got a
>>> failed validation result? I'd be very curious to see a test case
>>> then), but obviously not semantically (we address this "pretend that
>>> POS values are fragIDs, just for the sake of the tei.pointer datatype"
>>> issue in the text of the github ticket to which I pointed you,
>>> alongside other arguments against using @ana for this purpose).
>>>
>>> Best regards,
>>>
>>>    Piotr
>>>
>>> [1]: https://tools.ietf.org/html/rfc3986#appendix-A
>>>
>>>
>>>
>>> On 01/05/18 16:39, Piotr Bański wrote:
>>>> Dear Paolo,
>>>>
>>>> Please have a look at the proposal addressing this at
>>>> https://github.com/TEIC/TEI/issues/1670
> <https://github.com/TEIC/TEI/issues/1670>
>
> att.linguistic for <w> and <pc> · Issue #1670 · TEIC/TEI
> <https://github.com/TEIC/TEI/issues/1670>
> github.com
> Quick links: diff of the pull request (will be kept synced against
> TEIC/TEI/dev) suggested text of the relevant chapter (minimal changes,
> pending acceptance) suggested documentation of att.lingui...
>
>
>
>>>>
>>>> It avoids the "POS-in-@ana" issue, and provides arguments for that.
>>>> You will also see there a list of projects that use the proposed
>>>> format, some of them based on MorphAdorner.
>>>>
>>>> The practical question for you now, I guess, is either to keep the
>>>> existing TEI skeleton and disobey the @ana datatype or adopt the
>>>> changes we have suggested in the ticket and put the POS information
>>>> where it belongs, hoping that the Council will address the issue
>>>> before the end of the world. It's a gamble... :-)
>>>>
>>>> Best wishes,
>>>>
>>>>   Piotr
>>>>
>>>>
>>>> On 01/02/18 21:11, Paolo Monella wrote:
>>>>> Dear all,
>>>>>
>>>>> I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file
>>>>> and want to encode the result in attributes of <w>.
>>>>>
>>>>> I searched the TEI-L archives and the Internet. I found that
>>>>> MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output
>>>>> (e.g. "adjective, positive genitive plural masculine"):
>>>>>
>>>>> <w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>
>>>>>
>>>>> I had tried this encoding:
>>>>>
>>>>> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>>>>>
>>>>> The main difference is that MorphAdorner prepends a "#" to the value
>>>>> of @ana because this value should be a teidata.pointer [2].
>>>>>
>>>>> In any case, also "#p-acp" is no valid pointer (no valid URI), so do
>>>>> you think I should leave my encoding as it is, or prepend "#" as in
>>>>> @ana="#4-S--------"?
>>>>>
>>>>> Thank you,
>>>>> Paolo
>>>>>
>>>>> [1] See paragraph "Simplified TEI P5-like output" in
>>>>> http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/
> MorphAdorner: XML Output - Northwestern University
> <http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/>
> morphadorner.northwestern.edu
> XML Output Introduction. MorphAdorner can add word-level morphological
> adornments to XML texts encoded in two common formats, the Text Encoding
> Initiative (TEI ...
>
>
>
>>>>>
>>>>> [2]
>>>>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html 
>
> TEI class att.global.analytic
> <http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html>
> www.tei-c.org
> P5: Guidelines for Electronic Text Encoding and Interchange. Version
> 3.2.0. Last updated on 10th July 2017, revision 0fcf651
>
>
>
>>>>>
>>>>>
>>>>