corpus query language

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

corpus query language

Eduard Drenth
Dear all,

Why should one, according to you, consider using CQL on a TEI corpus (with some form of linguistic markup) when XQuery/XPath is available?

Eduard Drenth, Software Architekt

[hidden email]

Doelestrjitte 8
8911 DX  Ljouwert
+31 58 234 30 47
+31 62 094 34 28 (privé)

gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43
Reply | Threaded
Open this post in threaded view
|

Re: corpus query language

C. M. Sperberg-McQueen
> On Oct 6, 2017, at 1:06 PM, Eduard Drenth <[hidden email]> wrote:
>
> Dear all,
>
> Why should one, according to you, consider using CQL on a TEI corpus (with some form of linguistic markup) when XQuery/XPath is available?
>
> Eduard Drenth, Software Architekt

Several reasons to consider it occur to me.  There may be others.

- The intended users of the corpus may be familiar with CQL but not
  with XPath or XQuery.

- CQL may have more convenient ways to refer to linguistic entities
  like word forms, lemmas, morphological properties, and so on than
  XPath or XQuery formulated against the form of TEI used in the
  corpus.  By “more convenient” here, I mean “more convenient for
  the intended users of the corpus”, which in turn generally means
  “currently more familiar to the intended users of the corpus”.

- CQL may have more convenient ways to search for patterns occurring
  among siblings (this kind of word, followed immediately by that kind
  of word, followed by zero or more words with this third set of
  properties).

- The intended users of the corpus may be familiar with both CQL and
  XQuery but not with the markup structures used in your corpus (and
  not be interested in learning).

- If you have a CQL engine not based on an XQuery engine, the CQL
  engine might take fewer resources or be preferable in some other
  way.  

Note that these are reasons to consider using CQL, not reasons for
using CQL.  Each of them, if true, would probably count as a reason to
use CQL.  I don't know any of them are true: I don't know the intended
users, and 'learn more about CQL' has been on my to-do list for longer
than I'd like to admit, but has not yet been crossed off.

In the meantime, everything I do with language corpora I do with
XQuery and XSLT.  (That is, I have never gotten around to a serious
consideration of whether I should use CQL or something else instead of
XQuery.  So, you're ahead of me there!  If I discovered that the
target audience for a corpus was happy with CQL, I would be seriously
tempted to try implementing an interpreter for a subset of CQL by
translating it into XQuery, but that may be an eccentric suggestion.)

The first several items are relevant only if the user interface is one
in which the user types in queries in the chosen syntax.  If the role
of CQL or XQuery is to provide a back-end implementation
of a Web-based search interface (other than one in which the user
types in arbitrary queries in the chosen syntax), then I think only the
final item in my list applies.  (That is, if the search interface is a
set of Web forms in which the user types selected values, then the
users’ familiarity with, or the relative convenience of, the two query
languages is pretty much irrelevant.  The only relevant concerns
are then resource consumption, and the most important resources
are likely to be programmer time and programmer ability to
understand and debug queries that are going wrong.)

I look forward to others’ responses to your query.


********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
[hidden email]
http://www.blackmesatech.com
********************************************
Reply | Threaded
Open this post in threaded view
|

Re: corpus query language

Martin Mueller
That is a masterly set of reasons, to which I add the following considerations from a person like me who doesn’t know a whole lot about programming, has a very hard time wrapping braing his around XSLT and Xquery but has found the introduction to the corpus query language  of the old Stuttgart workbench a model of clarity and has with some success used the BlackLab implementation of a CQL on the basis of a Lucene index.  There is an experimental implementation of it at https://classify.at.northwestern.edu/corpussearch/pubsearch/  The interface is minimal, but it lets you look for phrases like “handsome, clever, and rich” by entering a search string like [pos=”j”][pos=”j”]”and”[pos=”j”] .  Executed against a set of early plays it retrieves strings like {The Scottish grows} dull, Frostie and wayward.”  I could imagine teaching an English colleague with no interest in computing how to carry out such a search. It would take fifteen minutes. There is no way such a colleague could learn how to use an Xquery in that period. With a proper interface you can hide a lot of complexity, but always at the cost of flexibility.  The CQL query syntax seems quite straightforward and paratactic. So you can present users with relatively raw command strings.

I don’t’ know about performance but could imagine that a properly indexed CQL would be faster than Xquery.

On 10/6/17, 7:47 PM, "TEI (Text Encoding Initiative) public discussion list on behalf of C. M. Sperberg-McQueen" <[hidden email] on behalf of [hidden email]> wrote:

    > On Oct 6, 2017, at 1:06 PM, Eduard Drenth <[hidden email]> wrote:
    >
    > Dear all,
    >
    > Why should one, according to you, consider using CQL on a TEI corpus (with some form of linguistic markup) when XQuery/XPath is available?
    >
    > Eduard Drenth, Software Architekt
   
    Several reasons to consider it occur to me.  There may be others.
   
    - The intended users of the corpus may be familiar with CQL but not
      with XPath or XQuery.
   
    - CQL may have more convenient ways to refer to linguistic entities
      like word forms, lemmas, morphological properties, and so on than
      XPath or XQuery formulated against the form of TEI used in the
      corpus.  By “more convenient” here, I mean “more convenient for
      the intended users of the corpus”, which in turn generally means
      “currently more familiar to the intended users of the corpus”.
   
    - CQL may have more convenient ways to search for patterns occurring
      among siblings (this kind of word, followed immediately by that kind
      of word, followed by zero or more words with this third set of
      properties).
   
    - The intended users of the corpus may be familiar with both CQL and
      XQuery but not with the markup structures used in your corpus (and
      not be interested in learning).
   
    - If you have a CQL engine not based on an XQuery engine, the CQL
      engine might take fewer resources or be preferable in some other
      way.  
   
    Note that these are reasons to consider using CQL, not reasons for
    using CQL.  Each of them, if true, would probably count as a reason to
    use CQL.  I don't know any of them are true: I don't know the intended
    users, and 'learn more about CQL' has been on my to-do list for longer
    than I'd like to admit, but has not yet been crossed off.
   
    In the meantime, everything I do with language corpora I do with
    XQuery and XSLT.  (That is, I have never gotten around to a serious
    consideration of whether I should use CQL or something else instead of
    XQuery.  So, you're ahead of me there!  If I discovered that the
    target audience for a corpus was happy with CQL, I would be seriously
    tempted to try implementing an interpreter for a subset of CQL by
    translating it into XQuery, but that may be an eccentric suggestion.)
   
    The first several items are relevant only if the user interface is one
    in which the user types in queries in the chosen syntax.  If the role
    of CQL or XQuery is to provide a back-end implementation
    of a Web-based search interface (other than one in which the user
    types in arbitrary queries in the chosen syntax), then I think only the
    final item in my list applies.  (That is, if the search interface is a
    set of Web forms in which the user types selected values, then the
    users’ familiarity with, or the relative convenience of, the two query
    languages is pretty much irrelevant.  The only relevant concerns
    are then resource consumption, and the most important resources
    are likely to be programmer time and programmer ability to
    understand and debug queries that are going wrong.)
   
    I look forward to others’ responses to your query.
   
   
    ********************************************
    C. M. Sperberg-McQueen
    Black Mesa Technologies LLC
    [hidden email]
    https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackmesatech.com&d=DwIFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=U0qAxxzKwl_HughpKoodDHeqVVwPsVvrzwrRALe7jOY&s=22O0f51_HzZ7wV57jwMNzNrhQ4DUa1t12xp1b-XWqo0&e= 
    ********************************************
   



Reply | Threaded
Open this post in threaded view
|

Re: corpus query language

Lou Burnard-6
In reply to this post by C. M. Sperberg-McQueen

For those curious about CQL (Corpus Query Language, as invented in Stuttgart many years ago, taken up by Sketch Engine, and now being reimplemented at the Institute for Dutch Language INL) I found a useful site at http://inl.github.io/BlackLab/corpus-query-language.html

It's not clear how well or how consistently the three implementations of the language implement searching of XML structures, just as it's not clear how easily XML-based query languages allow you to search for tokens or strings independently of XML structures. For example, using XQuery to find a sequence of words X Y across an element boundary (with X at the end of one sentence and Y at the start of another say) is not so easy. Not impossible, but not easy.  Likewise, finding a word X inside an element Y with a particular value for its attribute Z is not easy in any implementation of CQL that I have looked at (a notable exception being the CQL we invented for Xaira, but that's another story).  

And then, if you care about such things, although all three current CQL implementations share a common core of features, each of them also has non overlapping extensions. Which may mean that you are effectively tied into using the CQL engine which does what you want, even if it's not the one you'd prefer or the one you can afford. Cue boring talk on the need for standardisation.

I bet some bright spark at eXist or baseX is already working on a CQL translator; if not, perhaps they should be...


On 07/10/17 01:47, C. M. Sperberg-McQueen wrote:
On Oct 6, 2017, at 1:06 PM, Eduard Drenth [hidden email] wrote:

Dear all,

Why should one, according to you, consider using CQL on a TEI corpus (with some form of linguistic markup) when XQuery/XPath is available?

Eduard Drenth, Software Architekt
Several reasons to consider it occur to me.  There may be others.

- The intended users of the corpus may be familiar with CQL but not
  with XPath or XQuery.

- CQL may have more convenient ways to refer to linguistic entities
  like word forms, lemmas, morphological properties, and so on than
  XPath or XQuery formulated against the form of TEI used in the
  corpus.  By “more convenient” here, I mean “more convenient for
  the intended users of the corpus”, which in turn generally means
  “currently more familiar to the intended users of the corpus”.

- CQL may have more convenient ways to search for patterns occurring
  among siblings (this kind of word, followed immediately by that kind
  of word, followed by zero or more words with this third set of
  properties).

- The intended users of the corpus may be familiar with both CQL and
  XQuery but not with the markup structures used in your corpus (and
  not be interested in learning).

- If you have a CQL engine not based on an XQuery engine, the CQL
  engine might take fewer resources or be preferable in some other
  way.   

Note that these are reasons to consider using CQL, not reasons for
using CQL.  Each of them, if true, would probably count as a reason to
use CQL.  I don't know any of them are true: I don't know the intended
users, and 'learn more about CQL' has been on my to-do list for longer
than I'd like to admit, but has not yet been crossed off.

In the meantime, everything I do with language corpora I do with
XQuery and XSLT.  (That is, I have never gotten around to a serious
consideration of whether I should use CQL or something else instead of
XQuery.  So, you're ahead of me there!  If I discovered that the
target audience for a corpus was happy with CQL, I would be seriously
tempted to try implementing an interpreter for a subset of CQL by
translating it into XQuery, but that may be an eccentric suggestion.)

The first several items are relevant only if the user interface is one
in which the user types in queries in the chosen syntax.  If the role 
of CQL or XQuery is to provide a back-end implementation
of a Web-based search interface (other than one in which the user
types in arbitrary queries in the chosen syntax), then I think only the
final item in my list applies.  (That is, if the search interface is a
set of Web forms in which the user types selected values, then the
users’ familiarity with, or the relative convenience of, the two query
languages is pretty much irrelevant.  The only relevant concerns 
are then resource consumption, and the most important resources
are likely to be programmer time and programmer ability to 
understand and debug queries that are going wrong.)

I look forward to others’ responses to your query.


********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
[hidden email]
http://www.blackmesatech.com
********************************************

Reply | Threaded
Open this post in threaded view
|

Re: corpus query language

Toma Tasovac-3
7 окт. 2017 г., в 10:36, Lou Burnard <[hidden email]> написал(а):

I bet some bright spark at eXist or baseX is already working on a CQL translator; if not, perhaps they should be...

I don’t know about the bright spark part, but we’ve already created a CQL module for eXist-db. It parses CQL expressions such as [lemma='table' & ana=‘V'] or 'confus.*' []{2} ‘by’ into something that eXist can actually deal with.


All best,
Toma

--
Belgrade Center for Digital Humanities


Reply | Threaded
Open this post in threaded view
|

Re: corpus query language

Eduard Drenth

Thanks for all your input! I'm going to try and understand it all, recap and see what all this means for strategy and architecture at the fryske akademy.


There is also this: https://meertensinstituut.github.io/mtas/search_cql.html


bye,


Eduard Drenth, Software Architekt


[hidden email]


Doelestrjitte 8

8911 DX  Ljouwert

+31 58 234 30 47

+31 62 094 34 28 (privé)


gpg: https://sks-keyservers.net/pks/lookup?op=get&search=0x065EF82A1E02CC43




From: TEI (Text Encoding Initiative) public discussion list <[hidden email]> on behalf of Toma Tasovac <[hidden email]>
Sent: Saturday, October 7, 2017 12:04 PM
To: [hidden email]
Subject: Re: corpus query language
 
7 окт. 2017 г., в 10:36, Lou Burnard <[hidden email]> написал(а):

I bet some bright spark at eXist or baseX is already working on a CQL translator; if not, perhaps they should be...

I don’t know about the bright spark part, but we’ve already created a CQL module for eXist-db. It parses CQL expressions such as [lemma='table' & ana=‘V'] or 'confus.*' []{2} ‘by’ into something that eXist can actually deal with.


All best,
Toma

--
Belgrade Center for Digital Humanities


Reply | Threaded
Open this post in threaded view
|

Re: corpus query language

Serge Heiden-2
In reply to this post by Eduard Drenth
Hi Eduard,

Le 06/10/2017 à 21:06, Eduard Drenth a écrit :
Why should one, according to you, consider using CQL on a TEI corpus (with some form of linguistic markup) when XQuery/XPath is available?

It may be easier to answer if you qualify in a way the user supposed to query (scientific interests in digital texts and methodology used, or digital literacy) and what kind of queries are planed.

The original Stuttgart CQL (language query of the CQP search engine) was at first designed for linguists to query sequences of words with properties in text corpora annotated at word level. Later, a further - limited - development added some query capabilities to look for words inside structures with properties (before the XML age so syntax is awkward and capabilities limited).
All following implementations of CQL, like the first open-source implementation called Manatee used in the Sketch Engine, implement different parts of the original search engine syntax and were developed in the XML era. Tree nodes query features could be better than the original one but you should take care to the CQL syntactic features really available and to performance of each.

The goal of the CQL query language is primarily to concisely express a search for varying sequences of words qualified by properties (word form, pos, lemma...), expressed with sequences of character based values, possibly contained by some structure. The CQL syntax is thus organized with two nested levels of regular expressions: a first level on character sequences of word property values, a second level on word sequences occurring in texts. Regular expressions are really nice to express varying sequence patterns (of words and of characters), and words and word sequences are first class citizens to search for in digital texts for content or style analysis (to build word lists, kwic displays, cooccurrent words lists, bag of words vector models, etc.).

The goal of the XQuery language is to express general queries on a forest of trees. Some leafs or intermediary nodes could represent words, but I don't know any XQuery syntax expressing the nesting of two levels of regular expressions on character sequences of word properties combined to word sequences of texts. So for that aspect CQL may be interesting. For tree nodes querying in general, CQL cannot be compared to XQuery which is far better in syntax and performance.

We have been using the original CQP search engine in the TXM platform since the beginning in 2008 as a core component of an implementation of the textometry text analysis methodology and it is really appreciated by end users (see http://textometrie.ens-lyon.fr/?lang=en). TXM is basically crunching XML TEI encoded texts corpora to feed CQP, on which distant reading tools are built, and to build advanced HTML-CSS-Javascript text editions for close reading tools.

Various text annotations query systems have been asked for since, and our strategy is currently to integrate a new dedicated search engine component for each annotation system, developed as open-source by tiers. For example there is a TXM extension called TIGERSearch which allows the user to import syntactic annotations on texts (basically directed acyclic graphs) represented in a specific XML format, and query the syntactic annotations available with the TIGERSearch engine, which has its own query syntax dedicated to syntactic trees querying (the TIGERSearch query syntax looks like the CQL syntax).
The user can then choose the best query syntax for each annotation type when needed, TIGERSearch or CQL, and possibly combine their results through constraints on words (the two engines share the same stream of word tokens).

As examples of further developments, we may have to integrate a triple store search engine (semantic annotations), the ANNIS search engine (linguistic annotations) or possibly a XQuery search engine as a general backend solution to query all XML based transport representations at once. For close reading components, we could also have to integrate a XQuery search engine to host text publication frameworks based on it, like TEI Publisher <https://teipublisher.com> or MaX <http://www.unicaen.fr/recherche/mrsh/document_numerique/outils/max> for example.

Best,
Serge

-- 
Dr. Serge Heiden, [hidden email], http://textometrie.ens-lyon.fr
ENS de Lyon - IHRIM UMR5317
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883