Returning fragments from TEI documents?

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Returning fragments from TEI documents?

Jonathan Robie
When querying documents or doing full text search, I need to return the portion of a document that matches, which may or may not be well-formed XML.

Is there a standard wrapper element for doing this?

Jonathan
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

C. M. Sperberg-McQueen
> On Sep 14, 2017, at 7:41 AM, Jonathan Robie <[hidden email]> wrote:
>
> When querying documents or doing full text search, I need to return the portion of a document that matches, which may or may not be well-formed XML.
>
> Is there a standard wrapper element for doing this?

Assuming it’s well-balanced XML with no unexpanded entities,
I think tei:ab might be what you’re looking for.

If it were me, my inclination would be to extend the vocabulary
with a ‘hit’ or ‘result’ element in a separate namespace.  (And
wrap the entire thing in a ‘hits’ or ‘results’ element which also
records metadata like time and date of query and what the
server understood the query to be.  Perhaps that’s just
because I so often find myself needing to check to see whether
the query issued by the user has in fact arrived at the server
in the expected form, but it doesn’t take much space or time
and can be very handy.)

But then I seem to have less hesitation to customize TEI than
some other users, so YMMV.

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
[hidden email]
http://www.blackmesatech.com
********************************************
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Toma Tasovac-3
In reply to this post by Jonathan Robie
I don’t think there is a standard wrapper for this in TEI. 

In our dictionary API (http://docs.raskovnikapi.apiary.io/#), which serves TEI-encoded data, we have two types of wrapper, both of which are optional and can be controlled with parameters when we make the call to the API:

1. <exist:result>, which is in a different namespace, wraps the entire result
2. <tei:div>, which we use to group results per dictionary (with dictionary ID in the xml:id of each div); or some other types of groupings, for instance: current, left and right contexts of an entry in the dictionary macrostructure… (http://docs.raskovnikapi.apiary.io/#reference/0/context/list-context

Pagination stuff and links to first, previous, next and last page, in case of multi-page results, we put in the header.

All best,
Toma
--
Belgrade Center for Digital Humanities

14.09.2017., в 15.41, Jonathan Robie <[hidden email]> написал(а):

When querying documents or doing full text search, I need to return the portion of a document that matches, which may or may not be well-formed XML.

Is there a standard wrapper element for doing this?

Jonathan

Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Torsten Schassan-2
Jonathan,

if I got it right there might be one more problem which is not about TEI: You wrote that your result might be a "portion of a document that matches, which may or may not be well-formed XML". In case your portion is not-wellformed XML there is *no* possibility to wrap it with any TEI element because even the result will be not-wellformed XML and thus no XML at all. You should try to make sure your result is well-formed and than apply any of the proposed solutions.

Best,
Torsten

-
Torsten Schassan - Digital Editions - Manuscript and Special Collections
Herzog August Bibliothek, D-38299 Wolfenbuettel, Tel.: +49 5331 808-130 Fax -165
Manuscript database <http://diglib.hab.de/?db=mss>


Von: Toma Tasovac <[hidden email]>
An: <[hidden email]>
Gesendet: 14.09.2017 19:48
Betreff: Re: Returning fragments from TEI documents?

I don’t think there is a standard wrapper for this in TEI. 

In our dictionary API (http://docs.raskovnikapi.apiary.io/#), which serves TEI-encoded data, we have two types of wrapper, both of which are optional and can be controlled with parameters when we make the call to the API:

1. <exist:result>, which is in a different namespace, wraps the entire result
2. <tei:div>, which we use to group results per dictionary (with dictionary ID in the xml:id of each div); or some other types of groupings, for instance: current, left and right contexts of an entry in the dictionary macrostructure… (http://docs.raskovnikapi.apiary.io/#reference/0/context/list-context

Pagination stuff and links to first, previous, next and last page, in case of multi-page results, we put in the header.

All best,
Toma
--
Belgrade Center for Digital Humanities

14.09.2017., в 15.41, Jonathan Robie <[hidden email]> написал(а):

When querying documents or doing full text search, I need to return the portion of a document that matches, which may or may not be well-formed XML.

Is there a standard wrapper element for doing this?

Jonathan

Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Lou Burnard-6
In reply to this post by C. M. Sperberg-McQueen

Jonathan doesn't expound much on the context in which this requirement arises, but I'm supposing that he's thinking of producing something like a KWIC index, in which an arbitrary chunk of text (say 10 words to the left and 10 to the right of a specified search term) is to be returned as the result of a search. As others have noted,  in the general case this is not straightforward since the fragment you want to return may not be well formed. Your search term might, for example, be the first word in a <div> containing hundreds of words. When dealing with this issue for Xaira, we identified two possible solutions, and (typically) tried both of them.

a) make the fragment well formed by adding tags as necessary

b) be honest: you are not returning an arbitrary string derived from a marked up source: either eliminate the markup or represent it as empty tags

So, if my algorithm for finding the arbitrary chunk of text around WORD is going to produce  something like this

"lots of other words</div> <div> WORD lots of other words</div>"

Option (a) is to generate instead

<my:result><div><gap/>lots of other words</div><div>WORD lots of other words</div></my:result>

And option (b) is to generate instead

<my:result>lots of other words<tag n="div" type="end-tag"/><tag n="div" type="start-tag"/>WORD lots of other words<tag n="div" type="end-tag"/></my:result>

As you can see, both approaches take for granted Michael's suggestion of using a personalised schema rather than contorting existing TEI elements to provide the wrapper.term itself. The text we're modelling here is the output from an algorithm for finding contextual chunks, it's not really a representation of a (fragment of) the source document. I think I'd also add an element <my:hit> to wrap the hit WORD or search term. Note that these options also work if the search is aware of the markup (e.g. "find WORD as the first word in a div element"). 


On 14/09/17 17:18, C. M. Sperberg-McQueen wrote:
On Sep 14, 2017, at 7:41 AM, Jonathan Robie [hidden email] wrote:

When querying documents or doing full text search, I need to return the portion of a document that matches, which may or may not be well-formed XML.

Is there a standard wrapper element for doing this?
Assuming it’s well-balanced XML with no unexpanded entities, 
I think tei:ab might be what you’re looking for.

If it were me, my inclination would be to extend the vocabulary
with a ‘hit’ or ‘result’ element in a separate namespace.  (And
wrap the entire thing in a ‘hits’ or ‘results’ element which also
records metadata like time and date of query and what the 
server understood the query to be.  Perhaps that’s just 
because I so often find myself needing to check to see whether
the query issued by the user has in fact arrived at the server
in the expected form, but it doesn’t take much space or time
and can be very handy.) 

But then I seem to have less hesitation to customize TEI than 
some other users, so YMMV.

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
[hidden email]
http://www.blackmesatech.com
********************************************

Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Peter Boot-3
> As others have noted, in the general case this is not straightforward since the fragment you want to 
> return may not be well formed. Your search term might, for example, be the first word in a <div> 
> containing hundreds of words. When dealing with this issue for Xaira, we identified two possible 
> solutions, and (typically) tried both of them.
> a) make the fragment well formed by adding tags as necessary
> b) be honest: you are not returning an arbitrary string derived from a marked up source: either eliminate the markup or represent it as empty tags


Rather than trying to make the result well-formed, which involves messing with the result string, why not embed it in within a CDATA section:

<my:result><![CDATA[lots of other words</div> <div> WORD lots of other words</div>]]></my:result>


Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Lou Burnard-6

Yes, this would work. But then you couldn't do any useful XML processing on the result string (e.g. display the hit word in a particular way)


On 15/09/17 11:18, Peter Boot wrote:
As others have noted, in the general case this is not straightforward since the fragment you want to
return may not be well formed. Your search term might, for example, be the first word in a <div>
containing hundreds of words. When dealing with this issue for Xaira, we identified two possible
solutions, and (typically) tried both of them.
a) make the fragment well formed by adding tags as necessary
b) be honest: you are not returning an arbitrary string derived from a marked up source: either eliminate the markup or represent it as empty tags

Rather than trying to make the result well-formed, which involves messing with the result string, why not embed it in within a CDATA section:

<my:result><![CDATA[lots of other words</div> <div> WORD lots of other words</div>]]></my:result>

?


Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Piotr Banski
I'd say that the question is pretty badly stated because if it assumes
full-text search and returning non-XML, and if it assumes any kind of
systematic handling of results, then it has to assume a non-XML way of
handling these results (including highlighting, sorting, etc.)  and,
with all due respect, asking for a TEI-based wrapper in such a case
resembles beating around the bush rather than searching for a good
across-the-board solution.

HTH,

   P.

On 09/15/17 15:19, Lou Burnard wrote:

> Yes, this would work. But then you couldn't do any useful XML processing
> on the result string (e.g. display the hit word in a particular way)
>
>
> On 15/09/17 11:18, Peter Boot wrote:
>>> As others have noted, in the general case this is not straightforward since the fragment you want to
>>> return may not be well formed. Your search term might, for example, be the first word in a <div>
>>> containing hundreds of words. When dealing with this issue for Xaira, we identified two possible
>>> solutions, and (typically) tried both of them.
>>> a) make the fragment well formed by adding tags as necessary
>>> b) be honest: you are not returning an arbitrary string derived from a marked up source: either eliminate the markup or represent it as empty tags
>>
>> Rather than trying to make the result well-formed, which involves messing with the result string, why not embed it in within a CDATA section:
>>
>> <my:result><![CDATA[lots of other words</div> <div> WORD lots of other words</div>]]></my:result>
>>
>> ?
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Hugh Cayless-2
I think that's a rather uncharitable interpretation, but allow me to clarify, as I'm working on this with Jonathan. We're looking at defining an API for text retrieval, one of the functions of which would be to return document chunks (think chapter, section, line, etc.). These obviously would often be well-formed, especially at larger sizes, but non-verse lines especially might not be. They'd be balanced, but not well-formed without a wrapper element.

One of our collaborators asked me if there was such a standard wrapper in TEI, and I said "No, but maybe there should be." The question, re-stated a bit is this:

Is there any agreed on standard for representing TEI fragments in such a way that they're clearly TEI, but not misrepresented as complete? Wrapping them in an <ab> for example, risks making the assertion that the <ab> is part of the original document or that the fragment is complete.

Hugh

Sent from my phone.

> On Sep 15, 2017, at 09:59, Piotr Bański <[hidden email]> wrote:
>
> I'd say that the question is pretty badly stated because if it assumes full-text search and returning non-XML, and if it assumes any kind of systematic handling of results, then it has to assume a non-XML way of handling these results (including highlighting, sorting, etc.)  and, with all due respect, asking for a TEI-based wrapper in such a case resembles beating around the bush rather than searching for a good across-the-board solution.
>
> HTH,
>
>  P.
>
>> On 09/15/17 15:19, Lou Burnard wrote:
>> Yes, this would work. But then you couldn't do any useful XML processing on the result string (e.g. display the hit word in a particular way)
>> On 15/09/17 11:18, Peter Boot wrote:
>>>> As others have noted, in the general case this is not straightforward since the fragment you want to
>>>> return may not be well formed. Your search term might, for example, be the first word in a <div>
>>>> containing hundreds of words. When dealing with this issue for Xaira, we identified two possible
>>>> solutions, and (typically) tried both of them.
>>>> a) make the fragment well formed by adding tags as necessary
>>>> b) be honest: you are not returning an arbitrary string derived from a marked up source: either eliminate the markup or represent it as empty tags
>>>
>>> Rather than trying to make the result well-formed, which involves messing with the result string, why not embed it in within a CDATA section:
>>>
>>> <my:result><![CDATA[lots of other words</div> <div> WORD lots of other words</div>]]></my:result>
>>>
>>> ?
>>>
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Piotr Banski
Hi Hugh,

Apologies for letting the local weather affect my mood and sounding
'uncharitable', and thanks for the restatement of the question.

I was wondering about what possibly stood behind the question, and what
could be going on in terms of information needs and query processing,
wondering also about whether the "agent" (broadly speaking) on the
querying side has control of the queried resource (in terms of being
able to access the schema, and somewhat crucially, to effect potential
corrections in the source -- so that the query system needn't compensate
for any breakage but can safely die with an appropriate message and wait
for the source to get fixed). If it is a closed system then you probably
don't need to worry about signalling that it's TEI, since you are always
able to recover the context and even to validate the context.

 > but non-verse lines especially might not be

But these are surely wrapped inside a larger well-formed chunk, and if
you need to single out the exact hit then it will (especially when you
use full-text search) usually be an entity separable from the XML. It
seems to me that you are looking at such a dichotomy from the beginning:
XML context on the one hand (which just happens to be TEI) and a
sequence of characters co-extensive with (part of) the string value of
that context.

What I'm driving at is that perhaps you want to maintain this dichotomy
in your search results, keeping XML in the context field for all sorts
of fanciness, and in another field, keeping the character sequence
together with its metadata (that a.o. enables anchoring the hit in the
context provided, by pointer magic that I know you are deeply familiar
with).

But if so, we're not strictly in the TEI universe any longer, and what
seems to be needed is information about the schema for the sake of the
context XML fragment, and maybe also the MIME type with a suitable
parameter, if applicable (see [1]), and of course some application of
the pointer magic, but I am not so sure that the TEI wants to deal with
the nature and structure of such a result envelope, because the TEI is
more or less a coincidence in such a picture.

[ I keep thinking back to the egXML treatment of fragments (feasible,
valid, etc.) that I have never been able to fully grasp, but this is
probably only a very distant and useless analogy in this case. ]

[1]: https://github.com/TEIC/TEI/issues/1483

Best,

   Piotr


On 09/15/17 16:29, Hugh Cayless wrote:

> I think that's a rather uncharitable interpretation, but allow me to clarify, as I'm working on this with Jonathan. We're looking at defining an API for text retrieval, one of the functions of which would be to return document chunks (think chapter, section, line, etc.). These obviously would often be well-formed, especially at larger sizes, but non-verse lines especially might not be. They'd be balanced, but not well-formed without a wrapper element.
>
> One of our collaborators asked me if there was such a standard wrapper in TEI, and I said "No, but maybe there should be." The question, re-stated a bit is this:
>
> Is there any agreed on standard for representing TEI fragments in such a way that they're clearly TEI, but not misrepresented as complete? Wrapping them in an <ab> for example, risks making the assertion that the <ab> is part of the original document or that the fragment is complete.
>
> Hugh
>
> Sent from my phone.
>
>> On Sep 15, 2017, at 09:59, Piotr Bański <[hidden email]> wrote:
>>
>> I'd say that the question is pretty badly stated because if it assumes full-text search and returning non-XML, and if it assumes any kind of systematic handling of results, then it has to assume a non-XML way of handling these results (including highlighting, sorting, etc.)  and, with all due respect, asking for a TEI-based wrapper in such a case resembles beating around the bush rather than searching for a good across-the-board solution.
>>
>> HTH,
>>
>>   P.
>>
>>> On 09/15/17 15:19, Lou Burnard wrote:
>>> Yes, this would work. But then you couldn't do any useful XML processing on the result string (e.g. display the hit word in a particular way)
>>> On 15/09/17 11:18, Peter Boot wrote:
>>>>> As others have noted, in the general case this is not straightforward since the fragment you want to
>>>>> return may not be well formed. Your search term might, for example, be the first word in a <div>
>>>>> containing hundreds of words. When dealing with this issue for Xaira, we identified two possible
>>>>> solutions, and (typically) tried both of them.
>>>>> a) make the fragment well formed by adding tags as necessary
>>>>> b) be honest: you are not returning an arbitrary string derived from a marked up source: either eliminate the markup or represent it as empty tags
>>>>
>>>> Rather than trying to make the result well-formed, which involves messing with the result string, why not embed it in within a CDATA section:
>>>>
>>>> <my:result><![CDATA[lots of other words</div> <div> WORD lots of other words</div>]]></my:result>
>>>>
>>>> ?
>>>>
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Martin Holmes
In reply to this post by Hugh Cayless-2
In the CodeSharing project I use <egXML> for wrapping search results,
but that's probably not appropriate for your case. It does have the
advantage of coming with fewer expectations about the validity of the
content, but of course it shifts things into a different namespace.

Cheers,
Martin

On 2017-09-15 07:29 AM, Hugh Cayless wrote:

> I think that's a rather uncharitable interpretation, but allow me to clarify, as I'm working on this with Jonathan. We're looking at defining an API for text retrieval, one of the functions of which would be to return document chunks (think chapter, section, line, etc.). These obviously would often be well-formed, especially at larger sizes, but non-verse lines especially might not be. They'd be balanced, but not well-formed without a wrapper element.
>
> One of our collaborators asked me if there was such a standard wrapper in TEI, and I said "No, but maybe there should be." The question, re-stated a bit is this:
>
> Is there any agreed on standard for representing TEI fragments in such a way that they're clearly TEI, but not misrepresented as complete? Wrapping them in an <ab> for example, risks making the assertion that the <ab> is part of the original document or that the fragment is complete.
>
> Hugh
>
> Sent from my phone.
>
>> On Sep 15, 2017, at 09:59, Piotr Bański <[hidden email]> wrote:
>>
>> I'd say that the question is pretty badly stated because if it assumes full-text search and returning non-XML, and if it assumes any kind of systematic handling of results, then it has to assume a non-XML way of handling these results (including highlighting, sorting, etc.)  and, with all due respect, asking for a TEI-based wrapper in such a case resembles beating around the bush rather than searching for a good across-the-board solution.
>>
>> HTH,
>>
>>   P.
>>
>>> On 09/15/17 15:19, Lou Burnard wrote:
>>> Yes, this would work. But then you couldn't do any useful XML processing on the result string (e.g. display the hit word in a particular way)
>>> On 15/09/17 11:18, Peter Boot wrote:
>>>>> As others have noted, in the general case this is not straightforward since the fragment you want to
>>>>> return may not be well formed. Your search term might, for example, be the first word in a <div>
>>>>> containing hundreds of words. When dealing with this issue for Xaira, we identified two possible
>>>>> solutions, and (typically) tried both of them.
>>>>> a) make the fragment well formed by adding tags as necessary
>>>>> b) be honest: you are not returning an arbitrary string derived from a marked up source: either eliminate the markup or represent it as empty tags
>>>>
>>>> Rather than trying to make the result well-formed, which involves messing with the result string, why not embed it in within a CDATA section:
>>>>
>>>> <my:result><![CDATA[lots of other words</div> <div> WORD lots of other words</div>]]></my:result>
>>>>
>>>> ?
>>>>
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Lou Burnard-6
In reply to this post by Hugh Cayless-2

I don't see why using <ab> carries any implication about completeness, though it arguably carries the implication that this is a block identified in the TEI source text. Hence the suggestion of making the constructed nature of the fragment explicit by using a special tag. But you could just say <ab type="extractedFragment">

If you're going to return XML chunks, the only possible ill-formedness will be the the lack of a wrapper, which it's easy enough to add. In my earlier response I was thinking about the more general case where your extraction is likely to return arbitrary strings from the input document, with arbitrary tags in them. I am somewhat relieved to gather that this is not the issue.


On 15/09/17 15:29, Hugh Cayless wrote:
I think that's a rather uncharitable interpretation, but allow me to clarify, as I'm working on this with Jonathan. We're looking at defining an API for text retrieval, one of the functions of which would be to return document chunks (think chapter, section, line, etc.). These obviously would often be well-formed, especially at larger sizes, but non-verse lines especially might not be. They'd be balanced, but not well-formed without a wrapper element. 

One of our collaborators asked me if there was such a standard wrapper in TEI, and I said "No, but maybe there should be." The question, re-stated a bit is this: 

Is there any agreed on standard for representing TEI fragments in such a way that they're clearly TEI, but not misrepresented as complete? Wrapping them in an <ab> for example, risks making the assertion that the <ab> is part of the original document or that the fragment is complete. 

Hugh

Sent from my phone. 

On Sep 15, 2017, at 09:59, Piotr Bański [hidden email] wrote:

I'd say that the question is pretty badly stated because if it assumes full-text search and returning non-XML, and if it assumes any kind of systematic handling of results, then it has to assume a non-XML way of handling these results (including highlighting, sorting, etc.)  and, with all due respect, asking for a TEI-based wrapper in such a case resembles beating around the bush rather than searching for a good across-the-board solution.

HTH,

 P.

On 09/15/17 15:19, Lou Burnard wrote:
Yes, this would work. But then you couldn't do any useful XML processing on the result string (e.g. display the hit word in a particular way)
On 15/09/17 11:18, Peter Boot wrote:
As others have noted, in the general case this is not straightforward since the fragment you want to
return may not be well formed. Your search term might, for example, be the first word in a <div>
containing hundreds of words. When dealing with this issue for Xaira, we identified two possible
solutions, and (typically) tried both of them.
a) make the fragment well formed by adding tags as necessary
b) be honest: you are not returning an arbitrary string derived from a marked up source: either eliminate the markup or represent it as empty tags
Rather than trying to make the result well-formed, which involves messing with the result string, why not embed it in within a CDATA section:

<my:result><![CDATA[lots of other words</div> <div> WORD lots of other words</div>]]></my:result>

?


Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

C. M. Sperberg-McQueen
In reply to this post by Hugh Cayless-2
I’m grateful to Hugh for posting this clarification, although I think
that much of the background should have been clear already to a
reasonably attentive and informed reader of Jonathan Robie’s initial
inquiry.

(Perhaps I should explain, for those who do not know him. Jonathan
Robie is the editor of the XQuery spec and the lead implementor of
more SGML and XML database engines than I've had hot dinners this
week.  The latter fact may not be widely publicized, but the first is
reasonably easy to learn.  It may safely be assumed that he is
reasonably conversant with the concept of well-formed XML and ways to
wrestle non-well-formed data into an XML context.  He asked about a
TEI wrapper element because that was the piece of information he
needed.)


> On Sep 15, 2017, at 8:29 AM, Hugh Cayless <[hidden email]> wrote:
>
> I think that's a rather uncharitable interpretation, but allow me to clarify, as I'm working on this with Jonathan. We're looking at defining an API for text retrieval, one of the functions of which would be to return document chunks (think chapter, section, line, etc.). These obviously would often be well-formed, especially at larger sizes, but non-verse lines especially might not be. They'd be balanced, but not well-formed without a wrapper element.
>
> One of our collaborators asked me if there was such a standard wrapper in TEI, and I said "No, but maybe there should be." The question, re-stated a bit is this:
>
> Is there any agreed on standard for representing TEI fragments in such a way that they're clearly TEI, but not misrepresented as complete? Wrapping them in an <ab> for example, risks making the assertion that the <ab> is part of the original document or that the fragment is complete.

I think that if the ‘ab’ is either the child of a
my-query-sytem:results element, or the outermost element of a result
returned by the my-query-system:ftquery() function, the ‘ab’ can be
assumed to have been supplied by the query.  That is, if the result
happened to be an already present tei:ab element, I’d wrap it in
another tei:ab element to avoid any ambiguity, even if any other
single-node result is just returned without a wrapper.  I think that
should also prevent confusion about completeness (as do both of the
techniques Lou mentions in connection with Xaira).

But fwiw I think a non-tei:result element really is a simpler solution.

Michael


********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
[hidden email]
http://www.blackmesatech.com
********************************************
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Piotr Banski
I stand doubly chastised! :-)

Best,

   P.

On 09/15/17 20:02, C. M. Sperberg-McQueen wrote:

> I’m grateful to Hugh for posting this clarification, although I think
> that much of the background should have been clear already to a
> reasonably attentive and informed reader of Jonathan Robie’s initial
> inquiry.
>
> (Perhaps I should explain, for those who do not know him. Jonathan
> Robie is the editor of the XQuery spec and the lead implementor of
> more SGML and XML database engines than I've had hot dinners this
> week.  The latter fact may not be widely publicized, but the first is
> reasonably easy to learn.  It may safely be assumed that he is
> reasonably conversant with the concept of well-formed XML and ways to
> wrestle non-well-formed data into an XML context.  He asked about a
> TEI wrapper element because that was the piece of information he
> needed.)
>
>
>> On Sep 15, 2017, at 8:29 AM, Hugh Cayless <[hidden email]> wrote:
>>
>> I think that's a rather uncharitable interpretation, but allow me to clarify, as I'm working on this with Jonathan. We're looking at defining an API for text retrieval, one of the functions of which would be to return document chunks (think chapter, section, line, etc.). These obviously would often be well-formed, especially at larger sizes, but non-verse lines especially might not be. They'd be balanced, but not well-formed without a wrapper element.
>>
>> One of our collaborators asked me if there was such a standard wrapper in TEI, and I said "No, but maybe there should be." The question, re-stated a bit is this:
>>
>> Is there any agreed on standard for representing TEI fragments in such a way that they're clearly TEI, but not misrepresented as complete? Wrapping them in an <ab> for example, risks making the assertion that the <ab> is part of the original document or that the fragment is complete.
>
> I think that if the ‘ab’ is either the child of a
> my-query-sytem:results element, or the outermost element of a result
> returned by the my-query-system:ftquery() function, the ‘ab’ can be
> assumed to have been supplied by the query.  That is, if the result
> happened to be an already present tei:ab element, I’d wrap it in
> another tei:ab element to avoid any ambiguity, even if any other
> single-node result is just returned without a wrapper.  I think that
> should also prevent confusion about completeness (as do both of the
> techniques Lou mentions in connection with Xaira).
>
> But fwiw I think a non-tei:result element really is a simpler solution.
>
> Michael
>
>
> ********************************************
> C. M. Sperberg-McQueen
> Black Mesa Technologies LLC
> [hidden email]
> http://www.blackmesatech.com
> ********************************************
>
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Hugh Cayless-2
In reply to this post by Martin Holmes
I'd wondered about egXML...apart from the problem that it uses the example namespace, it's almost exactly the thing for the job. I was toying with the idea of proposing an element for this sort of thing. To clarify a bit further, since this thing is an API, it raises the possibility that we'll end up with bits of document displayed out of context. So wrapping the fragments in an element that explicitly means "my contents are out of their original context" has some appeal.

Hugh

On Fri, Sep 15, 2017 at 12:09 PM, Martin Holmes <[hidden email]> wrote:
In the CodeSharing project I use <egXML> for wrapping search results, but that's probably not appropriate for your case. It does have the advantage of coming with fewer expectations about the validity of the content, but of course it shifts things into a different namespace.

Cheers,
Martin


On 2017-09-15 07:29 AM, Hugh Cayless wrote:
I think that's a rather uncharitable interpretation, but allow me to clarify, as I'm working on this with Jonathan. We're looking at defining an API for text retrieval, one of the functions of which would be to return document chunks (think chapter, section, line, etc.). These obviously would often be well-formed, especially at larger sizes, but non-verse lines especially might not be. They'd be balanced, but not well-formed without a wrapper element.

One of our collaborators asked me if there was such a standard wrapper in TEI, and I said "No, but maybe there should be." The question, re-stated a bit is this:

Is there any agreed on standard for representing TEI fragments in such a way that they're clearly TEI, but not misrepresented as complete? Wrapping them in an <ab> for example, risks making the assertion that the <ab> is part of the original document or that the fragment is complete.

Hugh

Sent from my phone.

On Sep 15, 2017, at 09:59, Piotr Bański <[hidden email]> wrote:

I'd say that the question is pretty badly stated because if it assumes full-text search and returning non-XML, and if it assumes any kind of systematic handling of results, then it has to assume a non-XML way of handling these results (including highlighting, sorting, etc.)  and, with all due respect, asking for a TEI-based wrapper in such a case resembles beating around the bush rather than searching for a good across-the-board solution.

HTH,

  P.

On 09/15/17 15:19, Lou Burnard wrote:
Yes, this would work. But then you couldn't do any useful XML processing on the result string (e.g. display the hit word in a particular way)
On 15/09/17 11:18, Peter Boot wrote:
As others have noted, in the general case this is not straightforward since the fragment you want to
return may not be well formed. Your search term might, for example, be the first word in a <div>
containing hundreds of words. When dealing with this issue for Xaira, we identified two possible
solutions, and (typically) tried both of them.
a) make the fragment well formed by adding tags as necessary
b) be honest: you are not returning an arbitrary string derived from a marked up source: either eliminate the markup or represent it as empty tags

Rather than trying to make the result well-formed, which involves messing with the result string, why not embed it in within a CDATA section:

<my:result><![CDATA[lots of other words</div> <div> WORD lots of other words</div>]]></my:result>

?


Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Jonathan Robie
In reply to this post by Lou Burnard-6
Lou Burnard-6 wrote
> Jonathan doesn't expound much on the context in which this requirement
> arises, but I'm supposing that he's thinking of producing something like
> a KWIC index, in which an arbitrary chunk of text (say 10 words to the
> left and 10 to the right of a specified search term) is to be returned
> as the result of a search.

Sorry to ask the question and disappear - no, I really am expecting elements
to be returned, this does not describe my use case.

I do assume well-balanced XML, with no unexpanded entities.



--
Sent from: http://tei-l.970651.n3.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Jonathan Robie
In reply to this post by Peter Boot-3
Peter Boot-3 wrote
> Rather than trying to make the result well-formed, which involves messing
> with the result string, why not embed it in within a CDATA section?

I'm thinking of results as elements rather than strings.  I could extract
them from a CDATA section and parse them, of course, but I would rather not
prevent them from being parsed as XML.



--
Sent from: http://tei-l.970651.n3.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Jonathan Robie
In reply to this post by Hugh Cayless-2
Hugh Cayless-2 wrote

> We're looking at defining an API for text retrieval, one of the functions
> of which would be to return document chunks (think chapter, section, line,
> etc.). These obviously would often be well-formed, especially at larger
> sizes, but non-verse lines especially might not be. They'd be balanced,
> but not well-formed without a wrapper element.
>
> One of our collaborators asked me if there was such a standard wrapper in
> TEI, and I said "No, but maybe there should be." The question, re-stated a
> bit is this:
>
> Is there any agreed on standard for representing TEI fragments in such a
> way that they're clearly TEI, but not misrepresented as complete? Wrapping
> them in an
> <ab>
>  for example, risks making the assertion that the
> <ab>
>  is part of the original document or that the fragment is complete.

Exactly.

Jonathan



--
Sent from: http://tei-l.970651.n3.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Returning fragments from TEI documents?

Jonathan Robie
In reply to this post by C. M. Sperberg-McQueen
C. M. Sperberg-McQueen wrote
> It may safely be assumed that he is reasonably conversant with the concept
> of well-formed XML and ways to
> wrestle non-well-formed data into an XML context.  He asked about a TEI
> wrapper element because that was the piece of information he needed.)

Thanks - that's exactly the case.

Apologies for stepping away and letting this bubble without responding, I
thought I had the answer and stopped paying attention. Ooops!

Jonathan



--
Sent from: http://tei-l.970651.n3.nabble.com/