TEI Corpus of literary texts: Corpus of Máirtín Ó Cadhain's Literary Idiolect

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

TEI Corpus of literary texts: Corpus of Máirtín Ó Cadhain's Literary Idiolect

Hynek Janoušek
I am looking for solutions of an editorial nature for the encoding of literary texts by the Irish-language Máirtín Ó Cadhain. I acquired txt documents containing some of the texts from the Royal Irish Academy; the texts will be included in the corpus 'Corpas Chanúint Liteartha Mháirtín Uí Chadhain' (Corpus of Máirtín Ó Cadhain's Literary Idiolect). However, those text documents that have been produced using OCR software need to be checked against the printed texts, in this case, the first editions of each individual collection of short stories or novel.
How do I indicate this in the headers of the individual corpus texts, that is to say, how do I state that I acquired the text files from the Royal Irish Academy and that I subsequently checked and corrected them according to the version contained in the printed texts? As well as that, I decided that there have to be some minor changes made to the original texts of the printed versions (mostly missing punctuation or obvious orthographical errors). Could advise me on the best ways of conveying these facts of my project? The minor editorial changes have been made because the corpus will need to be tagged for parts of speech and lemmatised by NLP Tools for Irish. The resultant corpus will serve as a basis for a Dictionary of Máirtín Ó Cadhain's Literary Idiom.
Reply | Threaded
Open this post in threaded view
|

Re: TEI Corpus of literary texts: Corpus of Máirtín Ó Cadhain's Literary Idiolect

Paul Schaffner
One quick and easy way to do what you want is to regard the
OCR-generated text as the 'original' text, to which all further
changes are applied. If you think of the text that way, all the
changes that you describe amount to revisions of that original
text (correcting it against the print, adjusting punctuation, etc.),
which can be expressed in the <revisionDesc> within the header.
If you find it easier, you can break this down into a set of
discrete and dateable changes, and use <listChange> to enumerate
the changes; if that is not feasible, simply pasting your email
(below) into the <revisionDesc> would be a good start.

Some of your changes may reflect an underlying editorial
policy. Eventually, once you've codified those, you can (and
probably should) express them in the <encodingDesc> element
(also in the header) which "documents the relationship between an
electronic text and the source or sources from which it was derived,
[either in the form of] paragraphs of text, marked up using the p
element,
[or] with more specialized elements, [or both]."

pfs

On Wed, Oct 18, 2017, at 17:10, Hynek Janoušek wrote:

> I am looking for solutions of an editorial nature for the encoding of
> literary texts by the Irish-language Máirtín Ó Cadhain. I acquired txt
> documents containing some of the texts from the Royal Irish Academy; the
> texts will be included in the corpus 'Corpas Chanúint Liteartha Mháirtín
> Uí Chadhain' (Corpus of Máirtín Ó Cadhain's Literary Idiolect). However,
> those text documents that have been produced using OCR software need to
> be checked against the printed texts, in this case, the first editions of
> each individual collection of short stories or novel.
> How do I indicate this in the headers of the individual corpus texts,
> that is to say, how do I state that I acquired the text files from the
> Royal Irish Academy and that I subsequently checked and corrected them
> according to the version contained in the printed texts? As well as that,
> I decided that there have to be some minor changes made to the original
> texts of the printed versions (mostly missing punctuation or obvious
> orthographical errors). Could advise me on the best ways of conveying
> these facts of my project? The minor editorial changes have been made
> because the corpus will need to be tagged for parts of speech and
> lemmatised by NLP Tools for Irish. The resultant corpus will serve as a
> basis for a Dictionary of Máirtín Ó Cadhain's Literary Idiom.


--
Paul Schaffner  Digital Content & Collections
University of Michigan Libraries
[hidden email] | http://www.umich.edu/~pfs/
Reply | Threaded
Open this post in threaded view
|

Re: TEI Corpus of literary texts: Corpus of Máirtín Ó Cadhain's Literary Idiolect

Kevin Hawkins
In reply to this post by Hynek Janoušek
To be clear, if you treat the OCR-generated text as the original, then
you'll want to provide a bibliographic description of the OCR-generated
text in the <sourceDesc>.  Within <sourceDesc>, you can have the choice
of using <bibl>, <biblStruct>, or <biblFull>.  While few people choose
<biblFull>, you might have a very good use case for it: needing to
provide a <sourceDesc> that describes the original printed document.
Thus, you'd have this basic structure (omitting many other header
elements that you'll probably want to use as well):

<teiHeader>
   <fileDesc>
     <!-- bibliographic description of TEI document -->
     <sourceDesc>
       <biblFull>
         <fileDesc>
           <!-- bibliographic description of the OCR-generated text -->
           <sourceDesc>
             <!-- bibliographic description of the printed document
using bibl, biblStruct, or biblFull -->
           </sourceDesc>
         </fileDesc>
         <!-- more metadata about OCR-generated text -->
       </biblFull>
     </sourceDesc>
   </fileDesc>
   <!-- more metadata about TEI document -->
</teiHeader>

--Kevin

On 10/20/17 1:46 PM, Paul Schaffner wrote:

> One quick and easy way to do what you want is to regard the
> OCR-generated text as the 'original' text, to which all further
> changes are applied. If you think of the text that way, all the
> changes that you describe amount to revisions of that original
> text (correcting it against the print, adjusting punctuation, etc.),
> which can be expressed in the <revisionDesc> within the header.
> If you find it easier, you can break this down into a set of
> discrete and dateable changes, and use <listChange> to enumerate
> the changes; if that is not feasible, simply pasting your email
> (below) into the <revisionDesc> would be a good start.
>
> Some of your changes may reflect an underlying editorial
> policy. Eventually, once you've codified those, you can (and
> probably should) express them in the <encodingDesc> element
> (also in the header) which "documents the relationship between an
> electronic text and the source or sources from which it was derived,
> [either in the form of] paragraphs of text, marked up using the p
> element,
> [or] with more specialized elements, [or both]."
>
> pfs
>
> On Wed, Oct 18, 2017, at 17:10, Hynek Janoušek wrote:
>> I am looking for solutions of an editorial nature for the encoding of
>> literary texts by the Irish-language Máirtín Ó Cadhain. I acquired txt
>> documents containing some of the texts from the Royal Irish Academy; the
>> texts will be included in the corpus 'Corpas Chanúint Liteartha Mháirtín
>> Uí Chadhain' (Corpus of Máirtín Ó Cadhain's Literary Idiolect). However,
>> those text documents that have been produced using OCR software need to
>> be checked against the printed texts, in this case, the first editions of
>> each individual collection of short stories or novel.
>> How do I indicate this in the headers of the individual corpus texts,
>> that is to say, how do I state that I acquired the text files from the
>> Royal Irish Academy and that I subsequently checked and corrected them
>> according to the version contained in the printed texts? As well as that,
>> I decided that there have to be some minor changes made to the original
>> texts of the printed versions (mostly missing punctuation or obvious
>> orthographical errors). Could advise me on the best ways of conveying
>> these facts of my project? The minor editorial changes have been made
>> because the corpus will need to be tagged for parts of speech and
>> lemmatised by NLP Tools for Irish. The resultant corpus will serve as a
>> basis for a Dictionary of Máirtín Ó Cadhain's Literary Idiom.
>
>