<c> tag

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

<c> tag

Ciarán Ó Duibhín
Hi,
 
I'm looking for a tag to mark a character so that it can be ignored or not, depending on what processing is happening, eg. the character would be ignored when making a word-list, but not ignored when displaying text.  I'm considering the <c></c> tag — "nonlexical character" — for this purpose.  (Most of the characters to be so marked would be alphabetic, and only a small proportion of their occurrences would be marked in this way.)
 
1. What DTD should I use in order to have <c> available?  It doesn't seem to come with teixlite.dtd or with tei_corpus.dtd.
 
2. It seems from the documentation that <c> can be used only within a level of segmentation of <s> or below.  I'd like to use it directly under <body>, <div> or <p>.
 
3. Is <c> recognised by Xaira, and implemented (ie. ignored in indexing)?
 
Thanks for your ideas,
Ciarán Ó Duibhín.
Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Martin Mueller

You can use <c> or <w> within <p> or similar elements. You can’t use is in <div> or higher-level elements because those can’t bear text directly. I believe that <c> is present in TEI Simple or whatever it is called now.

 

It’s not clear to me what the characters that you would want to make direct children of <body> or <div>.

 

From: "TEI (Text Encoding Initiative) public discussion list" <[hidden email]> on behalf of Ciarán Ó Duibhín <[hidden email]>
Reply-To: Ciarán Ó Duibhín <[hidden email]>
Date: Saturday, February 24, 2018 at 9:31 AM
To: "TEI (Text Encoding Initiative) public discussion list" <[hidden email]>
Subject: <c> tag

 

Hi,

 

I'm looking for a tag to mark a character so that it can be ignored or not, depending on what processing is happening, eg. the character would be ignored when making a word-list, but not ignored when displaying text.  I'm considering the <c></c> tag — "nonlexical character" — for this purpose.  (Most of the characters to be so marked would be alphabetic, and only a small proportion of their occurrences would be marked in this way.)

 

1. What DTD should I use in order to have <c> available?  It doesn't seem to come with teixlite.dtd or with tei_corpus.dtd.

 

2. It seems from the documentation that <c> can be used only within a level of segmentation of <s> or below.  I'd like to use it directly under <body>, <div> or <p>.

 

3. Is <c> recognised by Xaira, and implemented (ie. ignored in indexing)?

 

Thanks for your ideas,

Ciarán Ó Duibhín.

Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Ciarán Ó Duibhín

Thanks to Martin and to Ioana.
 
1° For the DTD, Roma is a good solution.
 
2° My fault for not seeing <p> among the parents of <c> on p 900 of the P5 guidelines.  I agree that it is not sensible to have <c> as a child of <body> or <div> in a finished document.  I am thinking rather of the process of preparing a text, where marking these "non-lexical characters" is something I would do at an early stage, before the text is marked up into divisions of any kind.  This would leave the text non-conformant until the divisions into <p> or smaller were later marked.
But in fact, I think I will continue to develop my texts in a terse non-XML markup (where for example I mark a non-lexical character by placing a ^ before it); my applications work on this non-TEI text.  But I will also think about making a program to convert this markup to TEI.
 
3° It would be reassuring to confirm that <c> could work as I intend, by trying it out with a program for creating indexes or wordlists or concordances from TEI texts — if not Xaira, can anyone suggest such a program?
----- Original Message -----
Sent: Saturday, February 24, 2018 9:00 PM
Subject: Re: <c> tag

You can use <c> or <w> within <p> or similar elements. You can’t use is in <div> or higher-level elements because those can’t bear text directly. I believe that <c> is present in TEI Simple or whatever it is called now.

 

It’s not clear to me what the characters that you would want to make direct children of <body> or <div>.

 

From: "TEI (Text Encoding Initiative) public discussion list" <[hidden email]> on behalf of Ciarán Ó Duibhín <[hidden email]>
Reply-To: Ciarán Ó Duibhín <[hidden email]>
Date: Saturday, February 24, 2018 at 9:31 AM
To: "TEI (Text Encoding Initiative) public discussion list" <[hidden email]>
Subject: <c> tag

 

 

I'm looking for a tag to mark a character so that it can be ignored or not, depending on what processing is happening, eg. the character would be ignored when making a word-list, but not ignored when displaying text.  I'm considering the <c></c> tag — "nonlexical character" — for this purpose.  (Most of the characters to be so marked would be alphabetic, and only a small proportion of their occurrences would be marked in this way.)

 

1. What DTD should I use in order to have <c> available?  It doesn't seem to come with teixlite.dtd or with tei_corpus.dtd.

 

2. It seems from the documentation that <c> can be used only within a level of segmentation of <s> or below.  I'd like to use it directly under <body>, <div> or <p>.

 

3. Is <c> recognised by Xaira, and implemented (ie. ignored in indexing)?

 

Thanks for your ideas,

Ciarán Ó Duibhín.

Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Ciarán Ó Duibhín
In reply to this post by Ciarán Ó Duibhín

May I repeat this request, hopefully more clearly.
 
I would like to locate any program (preferably for Windows) for making indexes, word lists, or concordances from TEI text, and which will interpret the <c> tag in the following way, which I hope is in accordance with its description as "non-lexical character":  the content of the <c> tag is to be dropped in extracting tokens, but is to be included in displaying segments of text. 
 
For example, the text "an b<c>h</c>ean" should yield tokens "an" and "bean", but should be displayed as "an bhean".
 
Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Martin Holmes
Hi Ciarán,

I think your best approach would be a simple XSLT transformation. What
kind of output format do you want? What will you be using to display the
results?

Cheers,
Martin

On 2018-03-03 07:23 AM, Ciarán Ó Duibhín wrote:
> May I repeat this request, hopefully more clearly.
> I would like to locate any program (preferably for Windows) for making
> indexes, word lists, or concordances from TEI text, and which will
> interpret the <c> tag in the following way, which I hope is in
> accordance with its description as "non-lexical character":  the content
> of the <c> tag is to be dropped in extracting tokens, but is to be
> included in displaying segments of text.
> For example, the text "an b<c>h</c>ean" should yield tokens "an" and
> "bean", but should be displayed as "an bhean".
Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Syd Bauman-10
Ciarán --

I think Martin is basically right. I don't know of any software out
in the world, let alone Windows-based software, that will on its own
interpret the <c> element as you want. (Although perhaps some could
be configured to do so.) But running your data through an XSLT
pre-processor would likely yield quite satisfactory results.

Here is an example of one:

--------- begin program c_is_for_Ciarán.xslt ---------
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  exclude-result-prefixes="#all"
  xpath-default-namespace="http://www.tei-c.org/ns/1.0"
  version="3.0">
 
  <!--
    c_is_for_Ciarán.xslt
    Copyleft 2018 Syd Bauman and the Women Writers Project, few rights reserved.
    Feel free to copy, modify, run, use this pgm pretty much however you want,
    just please leave attribution to me somewhere, and the result must be copyleft.
   
    Demo program to read in a TEI P5 document, and write out 2 similar documents:
     - one is a copy *except* that <c> elements have been summarily dropped
     - one is a copy *except* that <c> *tags* have been dropped, but the content
       has been retained.
    See the thread "<c> tag" on TEI-L that started 2018-02-24T15:22Z.
  -->

  <!-- Explicitly state we're writing out XML: -->
  <xsl:output method="xml"/>
  <!-- Anything not matched is just copied over: -->
  <!-- (Why can't I get this to work with <xsl:mode on-no-match="shallow-copy">?) -->
  <xsl:template match="@*|node()" mode="#all">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" mode="#current"/>
    </xsl:copy>
  </xsl:template>
  <!-- Get name of input file w/o ending ".xml": -->
  <xsl:param name="baseName" select="substring( document-uri(/), 1, string-length(document-uri(/)) - 4 )"/>
  <!-- (BTW, if input filename does not end in ".xml" this pgm may not work) -->

  <!-- Match the document root, and ... -->
  <xsl:template match="/">
    <!-- ... generate both output URIs -->
    <xsl:variable name="name4token_extraction" select="concat( $baseName, '_extractTokens.xml')"/>
    <xsl:variable name="name4display" select="concat( $baseName, '_display.xml')"/>
    <!-- putting output into file for token extraction ... -->
    <xsl:result-document href="{$name4token_extraction}">
      <!-- ... process all child nodes for token extraction -->
      <xsl:apply-templates select="node()" mode="extractTokens"/>
    </xsl:result-document>
    <!-- putting output into file for display ... -->
    <xsl:result-document href="{$name4display}">
      <!-- ... process all child nodes for display -->
      <xsl:apply-templates select="node()" mode="display"/>
    </xsl:result-document>
  </xsl:template>

  <!-- Remember, processing of any node other than the document root or those listed below
       just results in a copy of said node. Thus all we do here is copy <c> differently,
       and the result is two output files that are the same except where <c>s occurred in
       the input. -->
  <!-- For token extraction, drop the entire <c> element. -->
  <xsl:template match="c" mode="extractTokens"/>
  <!-- For display keep the *content* of the <c>, but drop the tags. -->
  <xsl:template match="c" mode="display">
    <xsl:apply-templates select="node()"/>
  </xsl:template>
 
</xsl:stylesheet>
--------- end program c_is_for_Ciarán.xslt ---------

BTW, I realize this is not the XSLT list, but if someone could show
me how to do the same thing using the new <xsl:mode
on-no-match="shallow-copy"/>, I'd appreciate it.

> I think your best approach would be a simple XSLT
> transformation. What kind of output format do you want? What will
> you be using to display the results?
Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Peter Flynn-8
In reply to this post by Ciarán Ó Duibhín
On 03/03/18 15:23, Ciarán Ó Duibhín wrote:

> May I repeat this request, hopefully more clearly.
>  
> I would like to locate any program (preferably for Windows) for making
> indexes, word lists, or concordances from TEI text, and which will
> interpret the <c> tag in the following way, which I hope is in
> accordance with its description as "non-lexical character":  the content
> of the <c> tag is to be dropped in extracting tokens, but is to be
> included in displaying segments of text. 
>  
> For example, the text "an b<c>h</c>ean" should yield tokens "an" and
> "bean", but should be displayed as "an bhean".

I appear to have missed your first post on this, sorry.

Can you please clarify "extracting tokens" vs "displaying segments of
text" a little more?

In your example, if the normalized character data content is tokenized
on a space, it yields the two tokens "an" and "bhean".

In XSLT2, it is fairly trivial to pre-parse the original content before
normalization in order to omit the c element, yielding "an" and "bean".

But there is nothing to suggest that "an b<c>h</c>ean" is itself
contained in such a way as to create a token "an bhean" (for example, if
it was <p>an b<c>h</c>ean</p> rather than a paragraph containing  much
longer phrase of which "an b<c>h</c>ean" was merely part.

What you appear to want to do is easy with XSLT2 but it would be useful
to see a much larger example if you can post one (or send it privately).

///Peter
--
Peter Flynn | Human Factors Research Group | School of Applied
Psychology | 🏫 University College Cork | 🇮🇪 Ireland | ☎ +353 21 490
2609 | ✉ [hidden email] | 🌍 research.ucc.ie/A011/[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Ciarán Ó Duibhín
In reply to this post by Ciarán Ó Duibhín

Grateful thanks to Peter, Syd and Martin for taking the trouble to answer, but I seem to have given everyone the impression that I want to transform a TEI text containing <c> tags into another text, or even two other texts.  That wasn't what I had in mind at all.
 
What I envisage is inputting a text containing <c> tags to a TEI-aware indexing or concordancing program.  Xaira is a program of this type, but, when it is extracting indexing terms (tokenising), I haven't been able to make it handle the <c> tags in the way which I might expect "non-lexical characters" to be handled, even when it is informed that the text is TEI-conformant, not just XML-conformant.
 
Briefly, a concordancing program (for example), written in a programming language, will read a text, extracting each token (dropping non-lexical characters within it) and noting the token's offset within the text, and putting a record into a file, which is then sorted alphabetically on the tokens.  This sorted file is then read back, and for each record, we display the token (still without non-lexical characters) and, going back to the text, display a segment from around the offset (this time retaining the non-lexical characters).  The output is a concordance, not another XML version of the text.
 
Concordance programs, which have been around for many decades, routinely handle non-lexical characters, which they call "padding".  The OCP manual (1979) states concisely "padding letters will be printed but otherwise ignored".  With these concordance programs, for any character declared as padding (a hyphen, say), *every* occurrence of the character is so treated.
 
With TEI markup, we can declare each instance of the character individually to be non-lexical or not, which is something I need to be able to do.  But few concordance programs can handle TEI markup, other than by stripping out the tags altogether.
 
A TEI-aware concordance program would do "the right thing" with every tag, including <c>.  If "non-lexical character" means anything, the right thing with <c> must be to omit or include the content depending on the operation.  Tokenisation demands omission, display of context demands inclusion, at different points in the concordancing or indexing process. 
 
The background to all this is that I have texts in non-TEI markup, and programs which index and retrieve them, an essential feature of which is to take account of non-lexical characters.  I was considering writing a conversion from my own markup to TEI, with the object of making the texts more widely usable.  But unless there is a TEI construct for non-lexical characters, and off-the-shelf TEI-aware programs for indexing, concording, etc. that implement it, not only outside <w> but also within <w>, there would be little point in such a markup conversion.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Peter Flynn-8
On 04/03/18 14:59, Ciarán Ó Duibhín wrote:
> Grateful thanks to Peter, Syd and Martin for taking the trouble to
> answer, but I seem to have given everyone the impression that I want
> to transform a TEI text containing <c> tags into another text, or
> even two other texts.  That wasn't what I had in mind at all.

I suspected that might be the case, hence my cagey wording.

WARNING: you need to use a fixed-width font like Courier to read the
examples I give below.

> What I envisage is inputting a text containing <c> tags

Can we be clear; do you mean a valid (or at least well-formed) TEI XML
document which allows character-level linguistic markup? Or do you mean
just a chunk of text with pointy brackets around the letter 'c'?

> to a TEI-aware indexing or concordancing program.  Xaira is a program
> of this type, but, when it is extracting indexing terms (tokenising),
OK, another point of clarity needed. "Tokenising" in that sense may or
may not be the same thing as the operation performed by the XSLT2
function tokenize(). The XSLT2 function returns a sequence of atomic
objects which were identified because they were separated by the
specified delimiter. So tokenize($string,' ') when $string is the
sentence "All is discovered. Flee at once!" will return six words,
keeping the case and the punctuation. This may not be what Xaira means.

> I haven't been able to make it handle the <c> tags in the way which I
> might expect "non-lexical characters" to be handled, even when it is
> informed that the text is TEI-conformant, not just XML-conformant.
To be frank, I'd give up on what appears now to be an unsupported
utility if it isn't possible to do what you want. You just need to
define sufficiently for (eg) XSLT2 what you want to do.

> Briefly, a concordancing program (for example), written in a programming
> language, will read a text, extracting each token

"Token" being what in this case? A word?

> (dropping non-lexical characters within it)

OK, those identified by the c element type, or a list of characters to skip?

> and noting the token's offset within the text,

Ah. That's an entirely different <insert your own cultural meme: mine is
a kettle of fish or a pair of sleeves>. Is the text normalised (all
multiple spaces and newlines converted to single spaces) first? Is the
presence of preceding non-lexical characters to be included in the
offset or not (presumably yes, otherwise it will never align)? And is
the additional space occupied by the TEI markup itself also to be taken
into account? Does the offset re-zero itself at points in the document
(eg start of new sections)?

> and putting a record into a file,

What kind of record is this? A single line of unmarked characters? What
determines the start and end of a record?

> which is then sorted alphabetically on the tokens.

You mean the *content* of the record (presumably tokens with their
associated offsets) is sorted? Or the records themselves (on what)?

> This sorted file is then read back, and for each record, we display
> the token (still without non-lexical characters)

The implication here is that 1 record = 1 token = 1 word. Is that
correct? In other words, for my earlier example, sorted:

1,All
25,at
8,discovered.
20,Flee
5,is
28,once!

> and, going back to the text, display a segment from around the offset
> (this time retaining the non-lexical characters). The output is a
> concordance, not another XML version of the text.

OK, now we are getting somewhere. This is called KWIC format (KeyWord In
Context), and was (is?) the standard output of text searches in the days
of unmarked text, and into SGML days (in the CELT project we used PAT
for searching SGML TEI P2; it was [a] blindingly fast, and [b] returned
KWIC). In the above example, with a span of 20 characters either side,
we would get

1. All:        ...ng is the sentence "All is discovered. Fl...
2. at :        ...is discovered. Flee at once!" will return...
3. discovered: ...he sentence "All is discovered. Flee at o...
etc.

> Concordance programs, which have been around for many decades, routinely
> handle non-lexical characters, which they call "padding".

Normally you would define a list of these: comma, period, semicolon,
etc. I think what confused the issue was that you were giving an
alphabetic letter in the c element.

> With TEI markup, we can declare each instance of the character
> individually to be non-lexical or not, which is something I need to be
> able to do.  But few concordance programs can handle TEI markup, other
> than by stripping out the tags altogether.

Right. But it doesn't sound terribly difficult, and XSLT2 is IMNSHO
ideal for the purpose.

> A TEI-aware concordance program would do "the right thing" with every
> tag, including <c>.

I suspect the definition of "the right thing" is different for every TEI
project of any significant magnitude. The CELT project has gazillions of
instances of the character-level element types used in linguistic markup
combined with the standard TEI features for editorial intervention,
semantic correction, lemmatisation and parallel readings, and physical
aspects like the rest of the name has been gnawed by rats. And every
project has its own list of "weird stuff", like we need lg within head
because some titles include fragments of poetry.

> If "non-lexical character" means anything, the right thing with <c>
> must be to omit or include the content depending on the operation.
> Tokenisation demands omission, display of context demands inclusion,
> at different points in the concordancing or indexing process.

Yep. All doable once "the right thing" has been defined.

> The background to all this is that I have texts in non-TEI markup,
> and programs which index and retrieve them, an essential feature of
> which is to take account of non-lexical characters.

I haven't had to do this at corpus level for many years; I would be
surprised if someone hasn't already done this in XSLT2 for TEI.

> I was considering writing a conversion from my own markup to TEI,
> with the object of making the texts more widely usable.

That would be a very generous and public-spirited action.

> But unless there is a TEI construct for non-lexical characters, and
> off-the-shelf TEI-aware programs for indexing, concording, etc. that
> implement it, not only outside <w> but also within <w>, there would
> be little point in such a markup conversion.

Apart from Xaira I don't know of anything off-the-shelf. But as Syd
implied, handling the text is not the problem; the problem is defining
what needs to be done for every element type in the TEI schema/DTD that
you are using.

///Peter
Reply | Threaded
Open this post in threaded view
|

Why can catDesc take date children, but not have att.datable attributes?

Paterson, Duncan
Dear all, 

my apologies if this question has come up before, but I couldn’t find it in the mailinglist archives or GitHub issues.
Here is my situation:
I m encoding the bureaucratic hierarchy for the official titles that occur in a set of historical documents. 
This org chart appears in a taxonomy in the header. Now in 1392 an office was renamed. I would like to signal this change, by having two catDesc elements in the same category (nothing else changes about the office, or its children):
element category {
element catDesc {attribute notAfter {'1392’}, ‘’oldName”} 
element catDesc {attribute notBefore {'1392’}, ‘’newName”}
}

But classDecl doesn’t take any att.datable attributes, (neither does category) my questions therefore: a) why not, and b) how is the following valid category preferable:

element category {
element catDesc {element date { attribute notAfter {'1392’}}, ‘’oldName”} 
element catDesc {element date { attribute notBefore {'1392’}}, ‘’newName”}
}
Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Serge Heiden-2
In reply to this post by Ciarán Ó Duibhín
Hi Ciarán,

TXM is a TEI-aware indexing or concordancing program: https://wiki.tei-c.org/index.php/TXM. Developed in the continuity of Xaira (Lou Burnard was a partner in the founding project, with other software builders), it provides what you are looking for; not by using tag or character categories declaration but by tuning how information available in the sources are transfered into its different components (indexing, editing, etc.).

Various TEI usage flavors are processed through XSL stylesheets adaptators.
One of TXM import modules begins to interpret some TEI tags in the sources directly (<w> but not <c>), the XTZ+CSV import module, see http://textometrie.ens-lyon.fr/html/doc/manual/0.7.9/fr/manual32.xhtml#toc136 (in French).
In that module, sources importation begins to be phased in several steps, and those steps are designed to help you get what you want:
- the "3-posttok" step will help you tune your tokens for indexing if necessary, for example get "an" followed by "bean";
- the "4-edition" step will help you to display differently your tokens in text editions, for example read "an bhean" or "an b[h]ean" etc., in relation with the "an" and "bean" tokens used for indexing or concordancing (when you get back to the full text from a concordance of the "bean" pivot, the whole "b[h]ean" occurrence is highlighted).

As a live example, if you open the following URL in a browser, http://portal.textometrie.org/demo/?command=concordance&path=/GRAAL&query=%22sainte%22, you will get a concordance of the "sainte" word in an edition of the Holy Grail in Old French.
If you then double-click on the first line of the concordance "... lessa traveillier en la    sainte    veraie croiz por delivrer ...", the TXM portal will split the display in two to open the full text of that pivot occurrence (by displaying the Holy Grail text edition), that you can read as :
"... lessa
traveillier en la sai[n]te veraie croiz por delivrer
..."

TXM documentation is currently in French, but a pre-release of the TXM manual in English (illustrations are not translated yet) is available since last week: http://textometrie.ens-lyon.fr/files/documentation/TXM%20Manual%200.7.pdf.

The current version of the TXM manual (in French), documents the XTZ+CSV module: http://textometrie.ens-lyon.fr/files/documentation/Manuel%20de%20TXM%200.7%20FR.pdf.

A sample XML-TEI corpus configured for XTZ+CSV import is available (Leviathan by Thomas Hobbes, 1588-1679): https://sourceforge.net/projects/txm/files/corpora/leviathan.

All the best,
Serge


Le 03/03/2018 à 16:23, Ciarán Ó Duibhín a écrit :
May I repeat this request, hopefully more clearly.
 
I would like to locate any program (preferably for Windows) for making indexes, word lists, or concordances from TEI text, and which will interpret the <c> tag in the following way, which I hope is in accordance with its description as "non-lexical character":  the content of the <c> tag is to be dropped in extracting tokens, but is to be included in displaying segments of text. 
 
For example, the text "an b<c>h</c>ean" should yield tokens "an" and "bean", but should be displayed as "an bhean".
 

-- 
Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr
Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
Reply | Threaded
Open this post in threaded view
|

Re: Why can catDesc take date children, but not have att.datable attributes?

Martin Holmes
In reply to this post by Paterson, Duncan
Hi Duncan,

I would be inclined to do this at a slightly lower level, but using only
one catDesc:

<catDesc>The original role name of <name type="officialTitle"
notAfter="1392">Spongrilizer</name> was later renamed to <name
type="officialTitle" notBefore="1392">Artisinal Spongrilizer</name>
in 1392.</catDesc>

That would enable you to handle other types of change in a similar way.

Cheers,
martin

On 2018-03-05 07:12 AM, Paterson, Duncan wrote:

> Dear all,
>
> my apologies if this question has come up before, but I couldn’t find it
> in the mailinglist archives or GitHub issues.
> Here is my situation:
> I m encoding the bureaucratic hierarchy for the official titles that
> occur in a set of historical documents.
> This org chart appears in a taxonomy in the header. Now in 1392 an
> office was renamed. I would like to signal this change, by having two
> catDesc elements in the same category (nothing else changes about the
> office, or its children):
> element category {
> element catDesc {attribute notAfter {'1392’}, ‘’oldName”}
> element catDesc {attribute notBefore {'1392’}, ‘’newName”}
> }
>
> But classDecl doesn’t take any att.datable attributes, (neither does
> category) my questions therefore: a) why not, and b) how is the
> following valid category preferable:
>
> element category {
> element catDesc {element date { attribute notAfter {'1392’}}, ‘’oldName”}
> element catDesc {element date { attribute notBefore {'1392’}}, ‘’newName”}
> }
Reply | Threaded
Open this post in threaded view
|

on no match shallow copy (was Re: <c> tag)

Elisa Beshero-Bondar-2
In reply to this post by Syd Bauman-10
Hi Syd— For <xsl:mode on-no-match="shallow-copy”>, I think you have to transform w version 3.0 rather than 2.0. It has been working for me in oXygen. 

Cheers,
Elisa
--
Elisa Beshero-Bondar, PhD 
Director, Center for the Digital Text
Associate Professor of English 
University of Pittsburgh at Greensburg
150 Finoli Drive, Greensburg, PA 15601 USA
E-mail: [hidden email] | Development site: http://newtfire.org

Typeset by hand on my iPad

On Mar 3, 2018, at 5:09 PM, Syd Bauman <[hidden email]> wrote:

Ciarán --

I think Martin is basically right. I don't know of any software out
in the world, let alone Windows-based software, that will on its own
interpret the <c> element as you want. (Although perhaps some could
be configured to do so.) But running your data through an XSLT
pre-processor would likely yield quite satisfactory results.

Here is an example of one:

--------- begin program c_is_for_Ciarán.xslt ---------
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 exclude-result-prefixes="#all"
 xpath-default-namespace="http://www.tei-c.org/ns/1.0"
 version="3.0">

 <!--
   c_is_for_Ciarán.xslt
   Copyleft 2018 Syd Bauman and the Women Writers Project, few rights reserved.
   Feel free to copy, modify, run, use this pgm pretty much however you want,
   just please leave attribution to me somewhere, and the result must be copyleft.

   Demo program to read in a TEI P5 document, and write out 2 similar documents:
    - one is a copy *except* that <c> elements have been summarily dropped
    - one is a copy *except* that <c> *tags* have been dropped, but the content
      has been retained.
   See the thread "<c> tag" on TEI-L that started 2018-02-24T15:22Z.
 -->

 <!-- Explicitly state we're writing out XML: -->
 <xsl:output method="xml"/>
 <!-- Anything not matched is just copied over: -->
 <!-- (Why can't I get this to work with <xsl:mode on-no-match="shallow-copy">?) -->
 <xsl:template match="@*|node()" mode="#all">
   <xsl:copy>
     <xsl:apply-templates select="@*|node()" mode="#current"/>
   </xsl:copy>
 </xsl:template>
 <!-- Get name of input file w/o ending ".xml": -->
 <xsl:param name="baseName" select="substring( document-uri(/), 1, string-length(document-uri(/)) - 4 )"/>
 <!-- (BTW, if input filename does not end in ".xml" this pgm may not work) -->

 <!-- Match the document root, and ... -->
 <xsl:template match="/">
   <!-- ... generate both output URIs -->
   <xsl:variable name="name4token_extraction" select="concat( $baseName, '_extractTokens.xml')"/>
   <xsl:variable name="name4display" select="concat( $baseName, '_display.xml')"/>
   <!-- putting output into file for token extraction ... -->
   <xsl:result-document href="{$name4token_extraction}">
     <!-- ... process all child nodes for token extraction -->
     <xsl:apply-templates select="node()" mode="extractTokens"/>
   </xsl:result-document>
   <!-- putting output into file for display ... -->
   <xsl:result-document href="{$name4display}">
     <!-- ... process all child nodes for display -->
     <xsl:apply-templates select="node()" mode="display"/>
   </xsl:result-document>
 </xsl:template>

 <!-- Remember, processing of any node other than the document root or those listed below
      just results in a copy of said node. Thus all we do here is copy <c> differently,
      and the result is two output files that are the same except where <c>s occurred in
      the input. -->
 <!-- For token extraction, drop the entire <c> element. -->
 <xsl:template match="c" mode="extractTokens"/>
 <!-- For display keep the *content* of the <c>, but drop the tags. -->
 <xsl:template match="c" mode="display">
   <xsl:apply-templates select="node()"/>
 </xsl:template>

</xsl:stylesheet>
--------- end program c_is_for_Ciarán.xslt ---------

BTW, I realize this is not the XSLT list, but if someone could show
me how to do the same thing using the new <xsl:mode
on-no-match="shallow-copy"/>, I'd appreciate it.

I think your best approach would be a simple XSLT
transformation. What kind of output format do you want? What will
you be using to display the results?
Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Ciarán Ó Duibhín-2
In reply to this post by Serge Heiden-2

Just to sign off this thread before the end of the month...  I was looking for an indexing/retrieval program to act on TEI text.  If I could find one, I would consider converting my corpus of texts to TEI, so they could be used by a range of applications and on a range of platforms, including the web (currently they are only available for retrieval in Windows through a specially-written program).
 
When the thread started, I was concentrating on how to express certain aspects of my mark-up in TEI terms — whence the thread title — and whether there were programs in existence which could be made to interpret that TEI mark-up in the manner I intended.  I've meanwhile been discussing the possibilities on other lists (TXM, CWB).
 
The two main text indexing/retrieval programs are Corpus Workbench (CWB) and Sketch/NoSketch Engine.  These don't act on TEI text, but on verticalized text (ie. one token and its attributes per line).  I'm close to getting my texts running on Sketch Engine (web platform), which will require their conversion to verticalized text.
 
After that, I intend to look again at TXM, which can be used as a front-end to CWB, and accepts TEI text.  This is looking hopeful, although some small problems remain.  There may also be a tool-chain which will deliver TEI text to Sketch Engine, but I'm not looking at that.
 
Overall my conclusion is that there is little point in converting to TEI a corpus of texts intended for indexing/retrieval, as it does not mean they can be easily used with more applications and on more platforms.  If Xaira had continued to be developed, this might have been different.
 
Thanks to all those who helped, both on and off TEI-L, particularly Peter Flynn and Serge Heiden.
 
Ciarán Ó Duibhín
Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Serge Heiden-2
Hi Ciarán,

Le 29/03/2018 à 00:50, Ciarán Ó Duibhín a écrit :
...
Overall my conclusion is that there is little point in converting to TEI a corpus of texts intended for indexing/retrieval, as it does not mean they can be easily used with more applications and on more platforms.  If Xaira had continued to be developed, this might have been different.
...

Thank you for the report on the applications.

What would help a lot would be to list explicitly some services or features of Xaira useful or necessary for you that are not found in the software discussed. Somehow the relevant features of XML editors for teaching have been discussed and synthesized here:
https://wiki.tei-c.org/index.php/Editor_for_teaching_TEI_-_features

Best,
Serge

--
Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr
Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
Reply | Threaded
Open this post in threaded view
|

Re: <c> tag

Martin Mueller

Ciarán’s observation does not square with our experience in the EarlyPrint project.  Consider ‘handsome, clever, and rich’ from the opening sentence of Emma. There may be occasions where you want to identify phrases like that in some other corpus.

 

Well, go to http://blacklab.earlyprint.org/corpussearch/ and enter the search term 

 

[pos="j"][pos="j"]["and"][pos="j"]

 

Within seconds it will retrieve 8,291 matches from texts between 1640 and 1660, my personal favourite being “the Scottish growes dulle, Frostie, and wayward.”

 

I am told by Phil Burns, who knows a lot about these things, that the Blacklab search engine is relatively easy to install. It also supports incremental indexing, which is a big help.  The current user interface is very Spartan, and a user has to know the tag set on which the searches are based. Blacklab is element aware in simple ways that will support many of the uses that come up in literary scholarship. For instance, you can look for adjectives before ‘liberty’ in poetry. And so on.

 

 

 

From: "TEI (Text Encoding Initiative) public discussion list" <[hidden email]> on behalf of Serge Heiden <[hidden email]>
Organization: ENS de Lyon
Reply-To: Serge Heiden <[hidden email]>
Date: Tuesday, April 3, 2018 at 8:30 AM
To: "TEI (Text Encoding Initiative) public discussion list" <[hidden email]>
Subject: Re: <c> tag

 

Hi Ciarán,

Le 29/03/2018 à 00:50, Ciarán Ó Duibhín a écrit :

...
Overall my conclusion is that there is little point in converting to TEI a corpus of texts intended for indexing/retrieval, as it does not mean they can be easily used with more applications and on more platforms.  If Xaira had continued to be developed, this might have been different.
...


Thank you for the report on the applications.

What would help a lot would be to list explicitly some services or features of Xaira useful or necessary for you that are not found in the software discussed. Somehow the relevant features of XML editors for teaching have been discussed and synthesized here:
https://wiki.tei-c.org/index.php/Editor_for_teaching_TEI_-_features

Best,
Serge

--

Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr
Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883