TEI conformance strategies

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

TEI conformance strategies

Jonathan Robie
I have developed a system for querying Greek syntax trees in Jupyter notebooks, using XML optimized to make queries straightforward and to support useful indexes. Here's an overview:


And here is the data:


TEI conformance has not been a requirement, but it may soon become one.  I have a fair amount of background in XML, XQuery, XPath, etc., and have queried a boatload of TEI, but I don't have much experience creating TEI that doesn't conform to an existing schema.

Here is a sample of what I currently have.  I would love to add TEI headers and such, I would hate to trade in my morphology attributes for cryptic parse code strings that require queries to select based on the nth character of an attribute, especially because there probably isn't an index on that.

Can I have my cake and eat it too?  Is there a way to get to TEI conformance with XML that is as intuitive and efficient to query as what I currently have?

Jonathan

   <sentence>
      <milestone unit="verse" id="John.11.35">John.11.35</milestone>
      <p>ἐδάκρυσεν ὁ Ἰησοῦς.</p>
      <wg class="cl">
         <wg class="cl" n="430110350010030">
            <w role="v"
               class="verb"
               osisId="John.11.35!1"
               n="430110350010010"
               lemma="δακρύω"
               normalized="ἐδάκρυσεν"
               strong="1145"
               number="singular"
               person="third"
               tense="aorist"
               voice="active"
               mood="indicative"
               head="true"
               gloss="Wept">ἐδάκρυσεν</w>
            <wg role="s" class="np" n="430110350020020" articular="true">
               <w class="det"
                  osisId="John.11.35!2"
                  n="430110350020010"
                  lemma="ὁ"
                  normalized="ὁ"
                  strong="3588"
                  number="singular"
                  gender="masculine"
                  case="nominative"
                  gloss="-">ὁ</w>
               <w class="noun"
                  type="proper"
                  osisId="John.11.35!3"
                  n="430110350030010"
                  lemma="Ἰησοῦς"
                  normalized="Ἰησοῦς"
                  strong="2424"
                  number="singular"
                  gender="masculine"
                  case="nominative"
                  head="true"
                  gloss="Jesus">Ἰησοῦς.</w>
            </wg>
         </wg>
      </wg>
   </sentence>
Reply | Threaded
Open this post in threaded view
|

Re: TEI conformance strategies

Toma Tasovac-3
Hi Jonathan,

if you don't want to go for the compact morphosyntactic annotation of the MULTEXT type (which would make you query attribute values based on the position of certain strings within it), you could definitely customize your TEI schema to allow additional attributes on <w>.
In that case, you should probably do some renaming of your elements to get as much out of the plain TEI schema, in order to customize as little as possible. Something like this:

<text>
      <body>
         <p>
            <s>
               <milestone unit="verse" xml:id="John.11.35"/>
               <seg>ἐδάκρυσεν ὁ Ἰησοῦς.</seg>
               <seg type="cl">
                  <seg type="cl" n="430110350010030">
                     <w type="v" subtype="verb" corresp="John.11.35!1" n="430110350010010"
                        lemma="δακρύω" normalized="ἐδάκρυσεν" strong="1145" number="singular"
                        person="third" tense="aorist" voice="active" mood="indicative" head="true"
                        gloss="Wept">ἐδάκρυσεν</w>
                  </seg>
               </seg>
            </s></p>
      </body>
   </text>


In the above segment, everything is kosher except attributes normalized, strong, number, person, tense, voice, mood, head and gloss. 

These you could add to <w> either using Roma http://www.tei-c.org/Roma/ or directly in the ODD file, generate a new schema and then have your XML conformant with your customized schema, which will, of course, also include teiHeader etc.

By the way, there is also a TEI element <cl> for clauses, but since you had <wg type="cl">, whichI guess means word group or something like that, I assumed that you will have other things than clauses there, so I went for the more generic <seg>. 

It's also not ideal, I think, to mark up the actual Greek text and the interpretation of it with the same element (wg in your case or seg above) —  this you could solve by turning the interpretative seg into a <note> or something like that. Also I'm not sure what's the role of nested <seg type="cl"> in your example, but you probably have them for a reason. 

All best,
Toma

--
Belgrade Center for Digital Humanities

18 дек. 2017 г., в 22:42, Jonathan Robie <[hidden email]> написал(а):

I have developed a system for querying Greek syntax trees in Jupyter notebooks, using XML optimized to make queries straightforward and to support useful indexes. Here's an overview:


And here is the data:


TEI conformance has not been a requirement, but it may soon become one.  I have a fair amount of background in XML, XQuery, XPath, etc., and have queried a boatload of TEI, but I don't have much experience creating TEI that doesn't conform to an existing schema.

Here is a sample of what I currently have.  I would love to add TEI headers and such, I would hate to trade in my morphology attributes for cryptic parse code strings that require queries to select based on the nth character of an attribute, especially because there probably isn't an index on that.

Can I have my cake and eat it too?  Is there a way to get to TEI conformance with XML that is as intuitive and efficient to query as what I currently have?

Jonathan

   <sentence>
      <milestone unit="verse" id="John.11.35">John.11.35</milestone>
      <p>ἐδάκρυσεν ὁ Ἰησοῦς.</p>
      <wg class="cl">
         <wg class="cl" n="430110350010030">
            <w role="v"
               class="verb"
               osisId="John.11.35!1"
               n="430110350010010"
               lemma="δακρύω"
               normalized="ἐδάκρυσεν"
               strong="1145"
               number="singular"
               person="third"
               tense="aorist"
               voice="active"
               mood="indicative"
               head="true"
               gloss="Wept">ἐδάκρυσεν</w>
            <wg role="s" class="np" n="430110350020020" articular="true">
               <w class="det"
                  osisId="John.11.35!2"
                  n="430110350020010"
                  lemma="ὁ"
                  normalized="ὁ"
                  strong="3588"
                  number="singular"
                  gender="masculine"
                  case="nominative"
                  gloss="-">ὁ</w>
               <w class="noun"
                  type="proper"
                  osisId="John.11.35!3"
                  n="430110350030010"
                  lemma="Ἰησοῦς"
                  normalized="Ἰησοῦς"
                  strong="2424"
                  number="singular"
                  gender="masculine"
                  case="nominative"
                  head="true"
                  gloss="Jesus">Ἰησοῦς.</w>
            </wg>
         </wg>
      </wg>
   </sentence>

Reply | Threaded
Open this post in threaded view
|

Re: TEI conformance strategies

Piotr Bański
Hi all,

Jonathan says,

 >> Can I have my cake and eat it too?  Is there a way to get to TEI
 >> conformance with XML that is as intuitive and efficient to query as
 >> what I currently have?

"Intuitive" is naturally a subjective judgement, and efficiency to a
large extent depends on the assumed technology, so I'm not going into
that, merely noting that you haven't shown much of your syntactic
approach -- if I guess right, you indicate the clausal boundaries
(potentially redundantly: sentence > main_clause) and you also indicate
noun phrases.

A minimal approach towards greater conformance would probably involve
looking at the TEI element repertoire for tree hierarchies

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/AI.html#AILC


and bearing in mind that non-TEI attributes should be in a non-TEI
namespace (rather than in the null namespace, as in your example). So,
since you state that you don't want to parse a single string with a
morphosyntactic description, you would minimally want to go for
something like

<w jr:class="det"
        jr:osisId="John.11.35!2"
        n="430110350020010"
        lemma="ὁ"
        jr:normalized="ὁ"
        jr:strong="3588"
        jr:number="singular"
        jr:gender="masculine"
        jr:case="nominative"
        jr:gloss="-">ὁ</w>

(where I took "jr" as a nearly random prefix that would have to be bound
to your project's namespace)

There is a freshly announced robust approach to syntactic description
offered by colleagues at Lyon, which I mention in case you were willing
to consider modifying your description above the word level:

https://groupes.renater.fr/wiki/txm-info/public/annotation/specs_annotation_analec

(in French).

I am cross-posting this to the [hidden email] list,
in case you cared to join it at

http://listserv.brown.edu/archives/cgi-bin/wa?A0=TEI-LINGUISTICS

and possibly continue the discussion there.



Now as a kind of post-scriptum: you mention, Jonathan, that you wouldn't
like to parse positional morphosyntactic attributes, EAGLES/MULTEXT
style. If you however cared to consider parsing attribute-value pairs,
concatenated as in

msd="number:singular|gender:masculine|case:nominative"

then you might want to pursue an approach that might eventually (pending
the Council's very thoroughly thought-out decision) offer a greater
degree of conformance, using the extended set of
grammatical/language-technology attributes that colleagues and I have
suggested in a ticket at

https://github.com/TEIC/TEI/issues/1670

The approach there would probably not cater to all your needs (I am not
sure what e.g. @strong is, and you clearly want to have an extra ID
there, etc.), but it would be a compromise, restricting the number of
non-TEI attributes in your schema, if you care about restricting them.

My primary motivation to mention that last issue is not so much any
pressure to reduce the number of non-TEI attributes (it's your choice)
as rather a light that blinked when reading your remark concerning
parsing positional indexes -- you don't need to follow where EAGLES
flew, and could just parse the @msd string above with XPath functions.

Best,

    Piotr


On 12/19/17 09:18, Toma Tasovac wrote:

> Hi Jonathan,
>
> if you don't want to go for the compact morphosyntactic annotation of
> the MULTEXT type (which would make you query attribute values based on
> the position of certain strings within it), you could definitely
> customize your TEI schema to allow additional attributes on <w>.
> In that case, you should probably do some renaming of your elements to
> get as much out of the plain TEI schema, in order to customize as little
> as possible. Something like this:
>
> <text>
> <body>
> <p>
> <s>
> <milestone unit="verse" xml:id="John.11.35"/>
> <seg>ἐδάκρυσεν ὁ Ἰησοῦς.</seg>
> <seg type="cl">
> <seg type="cl" n="430110350010030">
> <w type="v" subtype="verb" corresp="John.11.35!1" n="430110350010010"
>                          
> lemma="δακρύω" normalized="ἐδάκρυσεν" strong="1145" number="singular"
>                          
> person="third" tense="aorist" voice="active" mood="indicative" head="true"
>                          gloss="Wept">ἐδάκρυσεν</w>
> </seg>
> </seg>
> </s></p>
> </body>
> </text>
>
>
> In the above segment, everything is kosher except attributes normalized,
> strong, number, person, tense, voice, mood, head and gloss.
>
> These you could add to <w> either using Roma
> http://www.tei-c.org/Roma/ or directly in the ODD file, generate a new
> schema and then have your XML conformant with your customized schema,
> which will, of course, also include teiHeader etc.
>
> By the way, there is also a TEI element <cl> for clauses, but since you
> had <wg type="cl">, whichI guess means word group or something like
> that, I assumed that you will have other things than clauses there, so I
> went for the more generic <seg>.
>
> It's also not ideal, I think, to mark up the actual Greek text and the
> interpretation of it with the same element (wg in your case or seg
> above) —  this you could solve by turning the interpretative seg into a
> <note> or something like that. Also I'm not sure what's the role of
> nested <seg type="cl"> in your example, but you probably have them for a
> reason.
>
> All best,
> Toma
>
> --
> Belgrade Center for Digital Humanities
> http://humanistika.org
>
>> 18 дек. 2017 г., в 22:42, Jonathan Robie <[hidden email]
>> <mailto:[hidden email]>> написал(а):
>>
>> I have developed a system for querying Greek syntax trees in Jupyter
>> notebooks, using XML optimized to make queries straightforward and to
>> support useful indexes. Here's an overview:
>>
>> http://jonathanrobie.biblicalhumanities.org/blog/2017/12/08/jupyter-tutorial/ 
>> <http://jonathanrobie.biblicalhumanities.org/blog/2017/12/08/jupyter-tutorial/>
>> http://jonathanrobie.biblicalhumanities.org/assets/greeksyntax-tutorial-proiel.html 
>> <http://jonathanrobie.biblicalhumanities.org/assets/greeksyntax-tutorial-proiel.html>
>>
>> And here is the data:
>>
>> https://github.com/biblicalhumanities/greek-new-testament/tree/master/syntax-trees/nestle1904-lowfat/xml
>>
>> TEI conformance has not been a requirement, but it may soon become
>> one.  I have a fair amount of background in XML, XQuery, XPath, etc.,
>> and have queried a boatload of TEI, but I don't have much experience
>> creating TEI that doesn't conform to an existing schema.
>>
>> Here is a sample of what I currently have.  I would love to add TEI
>> headers and such, I would hate to trade in my morphology attributes
>> for cryptic parse code strings that require queries to select based on
>> the nth character of an attribute, especially because there probably
>> isn't an index on that.
>>
>> Can I have my cake and eat it too?  Is there a way to get to TEI
>> conformance with XML that is as intuitive and efficient to query as
>> what I currently have?
>>
>> Jonathan
>>
>>    <sentence>
>>       <milestone unit="verse" id="John.11.35">John.11.35</milestone>
>>       <p>ἐδάκρυσεν ὁ Ἰησοῦς.</p>
>>       <wg class="cl">
>>          <wg class="cl" n="430110350010030">
>>             <w role="v"
>>                class="verb"
>>                osisId="John.11.35!1"
>>                n="430110350010010"
>>                lemma="δακρύω"
>>                normalized="ἐδάκρυσεν"
>>                strong="1145"
>>                number="singular"
>>                person="third"
>>                tense="aorist"
>>                voice="active"
>>                mood="indicative"
>>                head="true"
>>                gloss="Wept">ἐδάκρυσεν</w>
>>             <wg role="s" class="np" n="430110350020020" articular="true">
>>                <w class="det"
>>                   osisId="John.11.35!2"
>>                   n="430110350020010"
>>                   lemma="ὁ"
>>                   normalized="ὁ"
>>                   strong="3588"
>>                   number="singular"
>>                   gender="masculine"
>>                   case="nominative"
>>                   gloss="-">ὁ</w>
>>                <w class="noun"
>>                   type="proper"
>>                   osisId="John.11.35!3"
>>                   n="430110350030010"
>>                   lemma="Ἰησοῦς"
>>                   normalized="Ἰησοῦς"
>>                   strong="2424"
>>                   number="singular"
>>                   gender="masculine"
>>                   case="nominative"
>>                   head="true"
>>                   gloss="Jesus">Ἰησοῦς.</w>
>>             </wg>
>>          </wg>
>>       </wg>
>>    </sentence>
>
Reply | Threaded
Open this post in threaded view
|

Re: TEI conformance strategies

Jonathan Robie
In reply to this post by Toma Tasovac-3
Thanks, Toma, for this detailed and helpful answer.
  • Using <p/> and <s/> is clearly an improvement. 
  • <seg/> instead of <wg/> is fine, the semantics feel a bit off, but you know ...
  • I prefer <seg wg='cl'/> to <cl/> because many queries treat phrases and clauses the same way and I want to capture that commonality.  There are several kinds of phrases.
  • The outermost clause is an artifact that is not terribly useful.  In our ultimate format, it will probably not exist.
  • In this representation, the syntax tree is the actual text, and the plain-text representation is a note to make it easier to read.  So I might put that in a note instead? This isn't too bad, modulo the element note for the plain text representation.
<text>
      <body>
         <p>
            <s>
               <milestone unit="verse" xml:id="John.11.35"/>
               <note>ἐδάκρυσεν ὁ Ἰησοῦς.</note>
               <seg type="cl">
                  <seg type="cl" n="430110350010030">
                     <w type="v" subtype="verb" corresp="John.11.35!1" n="430110350010010"
                        lemma="δακρύω" normalized="ἐδάκρυσεν" strong="1145" number="singular"
                        person="third" tense="aorist" voice="active" mood="indicative" head="true"
                        gloss="Wept">ἐδάκρυσεν</w>
                  </seg>
               </seg>
            </s></p>
      </body>
   </text>

I gather the other thing I need to do is put the attributes in my own namespace, though the first thing I would do when importing into the database is strip out the namespaces for easier queries.  So it could be stored like this:

<text>
      <body>
         <p>
            <s>
               <milestone unit="verse" xml:id="John.11.35"/>
               <note>ἐδάκρυσεν ὁ Ἰησοῦς.</note>
               <seg type="cl">
                  <seg type="cl" n="430110350010030">
                     <w type="v" subtype="verb" corresp="John.11.35!1" n="430110350010010"

                        b:lemma="δακρύω" b:normalized="ἐδάκρυσεν" b:strong="1145" b:number="singular"
                        b:person="third" b:tense="aorist" b:voice="active" b:mood="indicative" b:head="true"
                        b:gloss="Wept">ἐδάκρυσεν</w>
                  </seg>
               </seg>
            </s></p>
      </body>
   </text>

But honestly, I don't want my users to have to worry about namespaces, and I would probably strip them out in the query environment.

How does that look to you?

Jonathan

On Tue, Dec 19, 2017 at 3:18 AM, Toma Tasovac <[hidden email]> wrote:
Hi Jonathan,

if you don't want to go for the compact morphosyntactic annotation of the MULTEXT type (which would make you query attribute values based on the position of certain strings within it), you could definitely customize your TEI schema to allow additional attributes on <w>.
In that case, you should probably do some renaming of your elements to get as much out of the plain TEI schema, in order to customize as little as possible. Something like this:

<text>
      <body>
         <p>
            <s>
               <milestone unit="verse" xml:id="John.11.35"/>
               <seg>ἐδάκρυσεν ὁ Ἰησοῦς.</seg>
               <seg type="cl">
                  <seg type="cl" n="430110350010030">
                     <w type="v" subtype="verb" corresp="John.11.35!1" n="430110350010010"
                        lemma="δακρύω" normalized="ἐδάκρυσεν" strong="1145" number="singular"
                        person="third" tense="aorist" voice="active" mood="indicative" head="true"
                        gloss="Wept">ἐδάκρυσεν</w>
                  </seg>
               </seg>
            </s></p>
      </body>
   </text>


In the above segment, everything is kosher except attributes normalized, strong, number, person, tense, voice, mood, head and gloss. 

These you could add to <w> either using Roma http://www.tei-c.org/Roma/ or directly in the ODD file, generate a new schema and then have your XML conformant with your customized schema, which will, of course, also include teiHeader etc.

By the way, there is also a TEI element <cl> for clauses, but since you had <wg type="cl">, whichI guess means word group or something like that, I assumed that you will have other things than clauses there, so I went for the more generic <seg>. 

It's also not ideal, I think, to mark up the actual Greek text and the interpretation of it with the same element (wg in your case or seg above) —  this you could solve by turning the interpretative seg into a <note> or something like that. Also I'm not sure what's the role of nested <seg type="cl"> in your example, but you probably have them for a reason. 

All best,
Toma

--
Belgrade Center for Digital Humanities

18 дек. 2017 г., в 22:42, Jonathan Robie <[hidden email]> написал(а):

I have developed a system for querying Greek syntax trees in Jupyter notebooks, using XML optimized to make queries straightforward and to support useful indexes. Here's an overview:


And here is the data:


TEI conformance has not been a requirement, but it may soon become one.  I have a fair amount of background in XML, XQuery, XPath, etc., and have queried a boatload of TEI, but I don't have much experience creating TEI that doesn't conform to an existing schema.

Here is a sample of what I currently have.  I would love to add TEI headers and such, I would hate to trade in my morphology attributes for cryptic parse code strings that require queries to select based on the nth character of an attribute, especially because there probably isn't an index on that.

Can I have my cake and eat it too?  Is there a way to get to TEI conformance with XML that is as intuitive and efficient to query as what I currently have?

Jonathan

   <sentence>
      <milestone unit="verse" id="John.11.35">John.11.35</milestone>
      <p>ἐδάκρυσεν ὁ Ἰησοῦς.</p>
      <wg class="cl">
         <wg class="cl" n="430110350010030">
            <w role="v"
               class="verb"
               osisId="John.11.35!1"
               n="430110350010010"
               lemma="δακρύω"
               normalized="ἐδάκρυσεν"
               strong="1145"
               number="singular"
               person="third"
               tense="aorist"
               voice="active"
               mood="indicative"
               head="true"
               gloss="Wept">ἐδάκρυσεν</w>
            <wg role="s" class="np" n="430110350020020" articular="true">
               <w class="det"
                  osisId="John.11.35!2"
                  n="430110350020010"
                  lemma="ὁ"
                  normalized="ὁ"
                  strong="3588"
                  number="singular"
                  gender="masculine"
                  case="nominative"
                  gloss="-">ὁ</w>
               <w class="noun"
                  type="proper"
                  osisId="John.11.35!3"
                  n="430110350030010"
                  lemma="Ἰησοῦς"
                  normalized="Ἰησοῦς"
                  strong="2424"
                  number="singular"
                  gender="masculine"
                  case="nominative"
                  head="true"
                  gloss="Jesus">Ἰησοῦς.</w>
            </wg>
         </wg>
      </wg>
   </sentence>


Reply | Threaded
Open this post in threaded view
|

Re: TEI conformance strategies

Jonathan Robie
In reply to this post by Jonathan Robie
Thanks, Piotr,

Yes, intuitive is subjective and depends on a lot of things, including the query environment.    In my environment, people use queries in XPath or in XQuery.  Anything that requires them to parse the insides of an attribute makes it much harder.  And one of my goals is to just import this into BaseX or eXistDB and have it run efficiently, that doesn't work well with anything buried inside an attribute, the database doesn't know that it needs to build an index on it.  I also like to be able to treat word groups in the same way, many queries will look at a group of words that may be a phrase, clause, or individual word that has a particular semantic role, and I like as much commonality as possible in such queries.

My use cases and requirements all involve doing specific queries on this dataset and other related datasets, I want to simplify those queries and make them run efficiently, under the assumption that sophisticated users are writing those queries in a Jupyter Notebook environment.  So that's what I am optimizing for.  Actually, I am also optimizing for one other thing, the ability to create and edit query trees using a little language called Treedown (http://jonathanrobie.biblicalhumanities.org/blog/2017/05/12/lowfat-treebanks-visualizing/), then use a parser to create an XML representation.  So it's quite possible that my use cases and requirements are substantially different from those of the other initiatives you pointed out.

I already responded to Toma in a separate message, any thoughts on the markup I suggested there?  Obviously, if I strip the namespaces when importing into a database, I eliminate the advantage of having them except for TEI conformance.  But TEI conformance is the main advantage I am looking for here.

Thanks for pointing me to that mailing list, I will sign up.
Reply | Threaded
Open this post in threaded view
|

Re: TEI conformance strategies

Lou Burnard-6
In reply to this post by Jonathan Robie
A couple of remarks with my conformance hat on.

(1) Most of your proposed attributes have values which are clearly not textual strings but enumerable encoded values (@type, @n, @strong,@tense) but
others have values which look very much like textual strings (@normalised, @gloss, and arguably @lemma). I wonder if you've considered making the latter into parallel child elements instead? Something like

    <seg type="cl" n="430110350010030">
                     <w type="v" subtype="verb" corresp="John.11.35!1" n="430110350010010"
                        lemma="δακρύω" strong="1145" number="singular"
                        person="third" tense="aorist" voice="active" mood="indicative" head="true">
ἐδάκρυσεν</w>
<w type="normalised">
ἐδάκρυσεν</w>
<w type="
gloss" xml:lang="en">Wept</w>
<w type="gloss" xml:lang="fr">Pleura</w>

                  </seg>


This would
(a) seem semantically more accurate : the gloss isn't a property of the word form, but something added to it in parallel
(b) permit more complex markup (e.g. of variant glyphs or sub components) within the normalization or gloss
(c) permit glosses in multiple languages (as above), or  multiple normalisations adopting different principles
(d) avoid getting caught up in skirmishes about whether or not the value of @xml:lang applicable to the content (Greek presumably) applies to the value as well. ("active" isn't a Greek word, but since it's presumably an enumerable token rather than a string of text that's fine)

(2) TEI-defined attributes (unlike TEI-defined elements) don't belong to any namespace so the conformance situation is less clear than it is for elements. I don't think the TEI has expressed a view as to whether non-namespaced non-TEI-defined attributes affect conformance or not. Attaching them to a different namespace, or just naming them in some consistent way,  seem equally valid or effective as a way of making clearer to a processor or end user what their status is.


On 19/12/17 20:37, Jonathan Robie wrote:
Thanks, Toma, for this detailed and helpful answer.
  • Using <p/> and <s/> is clearly an improvement. 
  • <seg/> instead of <wg/> is fine, the semantics feel a bit off, but you know ...
  • I prefer <seg wg='cl'/> to <cl/> because many queries treat phrases and clauses the same way and I want to capture that commonality.  There are several kinds of phrases.
  • The outermost clause is an artifact that is not terribly useful.  In our ultimate format, it will probably not exist.
  • In this representation, the syntax tree is the actual text, and the plain-text representation is a note to make it easier to read.  So I might put that in a note instead? This isn't too bad, modulo the element note for the plain text representation.
<text>
      <body>
         <p>
            <s>
               <milestone unit="verse" xml:id="John.11.35"/>
               <note>ἐδάκρυσεν ὁ Ἰησοῦς.</note>
               <seg type="cl">
                  <seg type="cl" n="430110350010030">
                     <w type="v" subtype="verb" corresp="John.11.35!1" n="430110350010010"
                        lemma="δακρύω" normalized="ἐδάκρυσεν" strong="1145" number="singular"
                        person="third" tense="aorist" voice="active" mood="indicative" head="true"
                        gloss="Wept">ἐδάκρυσεν</w>
                  </seg>
               </seg>
            </s></p>
      </body>
   </text>

I gather the other thing I need to do is put the attributes in my own namespace, though the first thing I would do when importing into the database is strip out the namespaces for easier queries.  So it could be stored like this:

<text>
      <body>
         <p>
            <s>
               <milestone unit="verse" xml:id="John.11.35"/>
               <note>ἐδάκρυσεν ὁ Ἰησοῦς.</note>
               <seg type="cl">
                  <seg type="cl" n="430110350010030">
                     <w type="v" subtype="verb" corresp="John.11.35!1" n="430110350010010"

                        b:lemma="δακρύω" b:normalized="ἐδάκρυσεν" b:strong="1145" b:number="singular"
                        b:person="third" b:tense="aorist" b:voice="active" b:mood="indicative" b:head="true"
                        b:gloss="Wept">ἐδάκρυσεν</w>
                  </seg>
               </seg>
            </s></p>
      </body>
   </text>

But honestly, I don't want my users to have to worry about namespaces, and I would probably strip them out in the query environment.

How does that look to you?

Jonathan

On Tue, Dec 19, 2017 at 3:18 AM, Toma Tasovac <[hidden email]> wrote:
Hi Jonathan,

if you don't want to go for the compact morphosyntactic annotation of the MULTEXT type (which would make you query attribute values based on the position of certain strings within it), you could definitely customize your TEI schema to allow additional attributes on <w>.
In that case, you should probably do some renaming of your elements to get as much out of the plain TEI schema, in order to customize as little as possible. Something like this:

<text>
      <body>
         <p>
            <s>
               <milestone unit="verse" xml:id="John.11.35"/>
               <seg>ἐδάκρυσεν ὁ Ἰησοῦς.</seg>
               <seg type="cl">
                  <seg type="cl" n="430110350010030">
                     <w type="v" subtype="verb" corresp="John.11.35!1" n="430110350010010"
                        lemma="δακρύω" normalized="ἐδάκρυσεν" strong="1145" number="singular"
                        person="third" tense="aorist" voice="active" mood="indicative" head="true"
                        gloss="Wept">ἐδάκρυσεν</w>
                  </seg>
               </seg>
            </s></p>
      </body>
   </text>


In the above segment, everything is kosher except attributes normalized, strong, number, person, tense, voice, mood, head and gloss. 

These you could add to <w> either using Roma http://www.tei-c.org/Roma/ or directly in the ODD file, generate a new schema and then have your XML conformant with your customized schema, which will, of course, also include teiHeader etc.

By the way, there is also a TEI element <cl> for clauses, but since you had <wg type="cl">, whichI guess means word group or something like that, I assumed that you will have other things than clauses there, so I went for the more generic <seg>. 

It's also not ideal, I think, to mark up the actual Greek text and the interpretation of it with the same element (wg in your case or seg above) —  this you could solve by turning the interpretative seg into a <note> or something like that. Also I'm not sure what's the role of nested <seg type="cl"> in your example, but you probably have them for a reason. 

All best,
Toma

--
Belgrade Center for Digital Humanities

18 дек. 2017 г., в 22:42, Jonathan Robie <[hidden email]> написал(а):

I have developed a system for querying Greek syntax trees in Jupyter notebooks, using XML optimized to make queries straightforward and to support useful indexes. Here's an overview:


And here is the data:


TEI conformance has not been a requirement, but it may soon become one.  I have a fair amount of background in XML, XQuery, XPath, etc., and have queried a boatload of TEI, but I don't have much experience creating TEI that doesn't conform to an existing schema.

Here is a sample of what I currently have.  I would love to add TEI headers and such, I would hate to trade in my morphology attributes for cryptic parse code strings that require queries to select based on the nth character of an attribute, especially because there probably isn't an index on that.

Can I have my cake and eat it too?  Is there a way to get to TEI conformance with XML that is as intuitive and efficient to query as what I currently have?

Jonathan

   <sentence>
      <milestone unit="verse" id="John.11.35">John.11.35</milestone>
      <p>ἐδάκρυσεν ὁ Ἰησοῦς.</p>
      <wg class="cl">
         <wg class="cl" n="430110350010030">
            <w role="v"
               class="verb"
               osisId="John.11.35!1"
               n="430110350010010"
               lemma="δακρύω"
               normalized="ἐδάκρυσεν"
               strong="1145"
               number="singular"
               person="third"
               tense="aorist"
               voice="active"
               mood="indicative"
               head="true"
               gloss="Wept">ἐδάκρυσεν</w>
            <wg role="s" class="np" n="430110350020020" articular="true">
               <w class="det"
                  osisId="John.11.35!2"
                  n="430110350020010"
                  lemma="ὁ"
                  normalized="ὁ"
                  strong="3588"
                  number="singular"
                  gender="masculine"
                  case="nominative"
                  gloss="-">ὁ</w>
               <w class="noun"
                  type="proper"
                  osisId="John.11.35!3"
                  n="430110350030010"
                  lemma="Ἰησοῦς"
                  normalized="Ἰησοῦς"
                  strong="2424"
                  number="singular"
                  gender="masculine"
                  case="nominative"
                  head="true"
                  gloss="Jesus">Ἰησοῦς.</w>
            </wg>
         </wg>
      </wg>
   </sentence>



Reply | Threaded
Open this post in threaded view
|

Re: TEI conformance strategies

Piotr Bański
[cross-posting again]

Reacting to two points extracted from Lou's message below:

1. "[consecutive <w>] would (a) seem semantically more accurate"

Markup semantics is on the whole yet another box of wonders. I see why
Lou would like to see some of the attributes promoted, but on the other
hand, since Jonathan rather explicitly assumes that his markup is to
reflect the assumed syntactic structure, we end up with a <seg>ment
consisting of 4 <w> elements, and it's not at all obvious that @typing
them helps, since (a) no such markup-semantic claim is standard or even
at the best-practice level (to say that within an XML structure meant to
be homomorphic with some modelling construct, @type creates multiple
dimensions or "delamination"[1] within that structure) and (b) Jonathan
uses @type for part-of-speech in his markup, and that even more strongly
speaks against "delamination" in this case.


2. Interesting point about the null namespace, thanks. I've spent most
of my adult life ;-) under an impression that foreign attributes
injected into the TEI *must* be namespaced.

Best,

    Piotr

[1] I'm sleep-deprived. There must exist a better term for multiple
planes hooked up to the same skeleton.




On 12/20/17 12:46, Lou Burnard wrote:

> A couple of remarks with my conformance hat on.
>
> (1) Most of your proposed attributes have values which are clearly not
> textual strings but enumerable encoded values (@type, @n,
> @strong,@tense) but
> others have values which look very much like textual strings
> (@normalised, @gloss, and arguably @lemma). I wonder if you've
> considered making the latter into parallel child elements instead?
> Something like
>
> <seg type="cl" n="430110350010030">
> <w type="v" subtype="verb"corresp="John.11.35!1" n="430110350010010"
> lemma="δακρύω"strong="1145"number="singular"
> person="third" tense="aorist"voice="active" mood="indicative" head="true">ἐδάκρυσεν</w>
> <w type="normalised">ἐδάκρυσεν</w>
> <w type="gloss" xml:lang="en">Wept</w>
> <w type="gloss" xml:lang="fr">Pleura</w>
>
> </seg>
>
> This would
> (a) seem semantically more accurate : the gloss isn't a property of the
> word form, but something added to it in parallel
> (b) permit more complex markup (e.g. of variant glyphs or sub
> components) within the normalization or gloss
> (c) permit glosses in multiple languages (as above), or  multiple
> normalisations adopting different principles
> (d) avoid getting caught up in skirmishes about whether or not the value
> of @xml:lang applicable to the content (Greek presumably) applies to the
> value as well. ("active" isn't a Greek word, but since it's presumably
> an enumerable token rather than a string of text that's fine)
>
> (2) TEI-defined attributes (unlike TEI-defined elements) don't belong to
> any namespace so the conformance situation is less clear than it is for
> elements. I don't think the TEI has expressed a view as to whether
> non-namespaced non-TEI-defined attributes affect conformance or not.
> Attaching them to a different namespace, or just naming them in some
> consistent way,  seem equally valid or effective as a way of making
> clearer to a processor or end user what their status is.
>
>
> On 19/12/17 20:37, Jonathan Robie wrote:
>> Thanks, Toma, for this detailed and helpful answer.
>>
>>   * Using <p/> and <s/> is clearly an improvement.
>>   * <seg/> instead of <wg/> is fine, the semantics feel a bit off, but
>>     you know ...
>>   * I prefer <seg wg='cl'/> to <cl/> because many queries treat
>>     phrases and clauses the same way and I want to capture that
>>     commonality.  There are several kinds of phrases.
>>   * The outermost clause is an artifact that is not terribly useful.
>>     In our ultimate format, it will probably not exist.
>>   * In this representation, the syntax tree is the actual text, and
>>     the plain-text representation is a note to make it easier to
>>     read.  So I might put that in a note instead? This isn't too bad,
>>     modulo the element note for the plain text representation.
>>
>> <text>
>> <body>
>> <p>
>> <s>
>> <milestone unit="verse" xml:id="John.11.35"/>
>> <note>ἐδάκρυσεν ὁ Ἰησοῦς.</note>
>> <seg type="cl">
>> <seg type="cl" n="430110350010030">
>> <w type="v" subtype="verb"corresp="John.11.35!1" n="430110350010010"
>>     lemma="δακρύω" normalized="ἐδάκρυσεν" strong="1145"number="singular"
>>    
>> person="third" tense="aorist"voice="active" mood="indicative" head="true"
>>     gloss="Wept">ἐδάκρυσεν</w>
>> </seg>
>> </seg>
>> </s></p>
>> </body>
>> </text>
>>
>> I gather the other thing I need to do is put the attributes in my own
>> namespace, though the first thing I would do when importing into the
>> database is strip out the namespaces for easier queries.  So it could
>> be stored like this:
>>
>> <text>
>> <body>
>> <p>
>> <s>
>> <milestone unit="verse" xml:id="John.11.35"/>
>> <note>ἐδάκρυσεν ὁ Ἰησοῦς.</note>
>> <seg type="cl">
>> <seg type="cl" n="430110350010030">
>> <w type="v" subtype="verb"corresp="John.11.35!1" n="430110350010010"
>>  
>> b:lemma="δακρύω" b:normalized="ἐδάκρυσεν" b:strong="1145" b:number="singular"
>>  
>> b:person="third" b:tense="aorist" b:voice="active" b:mood="indicative" b:head="true"
>>   b:gloss="Wept">ἐδάκρυσεν</w>
>> </seg>
>> </seg>
>> </s></p>
>> </body>
>> </text>
>>
>> But honestly, I don't want my users to have to worry about namespaces,
>> and I would probably strip them out in the query environment.
>>
>> How does that look to you?
>>
>> Jonathan
>>
>> On Tue, Dec 19, 2017 at 3:18 AM, Toma Tasovac
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>     Hi Jonathan,
>>
>>     if you don't want to go for the compact morphosyntactic annotation
>>     of the MULTEXT type (which would make you query attribute values
>>     based on the position of certain strings within it), you could
>>     definitely customize your TEI schema to allow additional
>>     attributes on <w>.
>>     In that case, you should probably do some renaming of your
>>     elements to get as much out of the plain TEI schema, in order to
>>     customize as little as possible. Something like this:
>>
>>     <text>
>>     <body>
>>     <p>
>>     <s>
>>     <milestone unit="verse" xml:id="John.11.35"/>
>>     <seg>ἐδάκρυσεν ὁ Ἰησοῦς.</seg>
>>     <seg type="cl">
>>     <seg type="cl" n="430110350010030">
>>     <w type="v" subtype="verb"corresp="John.11.35!1" n="430110350010010"
>>                      
>>     lemma="δακρύω" normalized="ἐδάκρυσεν" strong="1145"number="singular"
>>                      
>>     person="third" tense="aorist"voice="active" mood="indicative" head="true"
>>                       gloss="Wept">ἐδάκρυσεν</w>
>>     </seg>
>>     </seg>
>>     </s></p>
>>     </body>
>>     </text>
>>
>>
>>     In the above segment, everything is kosher except attributes
>>     normalized, strong, number, person, tense, voice, mood, head and
>>     gloss.
>>
>>     These you could add to <w> either using Roma
>>     http://www.tei-c.org/Roma/ <http://www.tei-c.org/Roma/> or
>>     directly in the ODD file, generate a new schema and then have your
>>     XML conformant with your customized schema, which will, of course,
>>     also include teiHeader etc.
>>
>>     By the way, there is also a TEI element <cl> for clauses, but
>>     since you had <wg type="cl">, whichI guess means word group or
>>     something like that, I assumed that you will have other things
>>     than clauses there, so I went for the more generic <seg>.
>>
>>     It's also not ideal, I think, to mark up the actual Greek text and
>>     the interpretation of it with the same element (wg in your case or
>>     seg above) —  this you could solve by turning the interpretative
>>     seg into a <note> or something like that. Also I'm not sure what's
>>     the role of nested <seg type="cl"> in your example, but you
>>     probably have them for a reason.
>>
>>     All best,
>>     Toma
>>
>>     --
>>     Belgrade Center for Digital Humanities
>>     http://humanistika.org
>>
>>>     18 дек. 2017 г., в 22:42, Jonathan Robie
>>>     <[hidden email] <mailto:[hidden email]>>
>>>     написал(а):
>>>
>>>     I have developed a system for querying Greek syntax trees in
>>>     Jupyter notebooks, using XML optimized to make queries
>>>     straightforward and to support useful indexes. Here's an overview:
>>>
>>>     http://jonathanrobie.biblicalhumanities.org/blog/2017/12/08/jupyter-tutorial/
>>>     <http://jonathanrobie.biblicalhumanities.org/blog/2017/12/08/jupyter-tutorial/>
>>>     http://jonathanrobie.biblicalhumanities.org/assets/greeksyntax-tutorial-proiel.html
>>>     <http://jonathanrobie.biblicalhumanities.org/assets/greeksyntax-tutorial-proiel.html>
>>>
>>>     And here is the data:
>>>
>>>     https://github.com/biblicalhumanities/greek-new-testament/tree/master/syntax-trees/nestle1904-lowfat/xml
>>>     <https://github.com/biblicalhumanities/greek-new-testament/tree/master/syntax-trees/nestle1904-lowfat/xml>
>>>
>>>     TEI conformance has not been a requirement, but it may soon
>>>     become one.  I have a fair amount of background in XML, XQuery,
>>>     XPath, etc., and have queried a boatload of TEI, but I don't have
>>>     much experience creating TEI that doesn't conform to an existing
>>>     schema.
>>>
>>>     Here is a sample of what I currently have.  I would love to add
>>>     TEI headers and such, I would hate to trade in my morphology
>>>     attributes for cryptic parse code strings that require queries to
>>>     select based on the nth character of an attribute, especially
>>>     because there probably isn't an index on that.
>>>
>>>     Can I have my cake and eat it too?  Is there a way to get to TEI
>>>     conformance with XML that is as intuitive and efficient to query
>>>     as what I currently have?
>>>
>>>     Jonathan
>>>
>>>        <sentence>
>>>           <milestone unit="verse" id="John.11.35">John.11.35</milestone>
>>>           <p>ἐδάκρυσεν ὁ Ἰησοῦς.</p>
>>>           <wg class="cl">
>>>              <wg class="cl" n="430110350010030">
>>>                 <w role="v"
>>>                    class="verb"
>>>                    osisId="John.11.35!1"
>>>                    n="430110350010010"
>>>                    lemma="δακρύω"
>>>                    normalized="ἐδάκρυσεν"
>>>                    strong="1145"
>>>                    number="singular"
>>>                    person="third"
>>>                    tense="aorist"
>>>                    voice="active"
>>>                    mood="indicative"
>>>                    head="true"
>>>      gloss="Wept">ἐδάκρυσεν</w>
>>>                 <wg role="s" class="np" n="430110350020020"
>>>     articular="true">
>>>                    <w class="det"
>>>                       osisId="John.11.35!2"
>>>                       n="430110350020010"
>>>                       lemma="ὁ"
>>>                       normalized="ὁ"
>>>                       strong="3588"
>>>                       number="singular"
>>>                       gender="masculine"
>>>                       case="nominative"
>>>     gloss="-">ὁ</w>
>>>                    <w class="noun"
>>>                       type="proper"
>>>                       osisId="John.11.35!3"
>>>                       n="430110350030010"
>>>                       lemma="Ἰησοῦς"
>>>                       normalized="Ἰησοῦς"
>>>                       strong="2424"
>>>                       number="singular"
>>>                       gender="masculine"
>>>                       case="nominative"
>>>                       head="true"
>>>     gloss="Jesus">Ἰησοῦς.</w>
>>>                 </wg>
>>>              </wg>
>>>           </wg>
>>>        </sentence>
>>
>>
>