Hierarchy of XML files for <text> transcription and <msDesc>

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Hierarchy of XML files for <text> transcription and <msDesc>

Sebastiaan Verweij-3
**re-posting my earlier question -- thanks for your time!**

Dear TEI list

I’m coming back to TEI after a long spell of other things, and am wondering how best to approach a new project. I have IT support at my university, though we are building this sort of DH resource here for the first time. We are using an eXist database. 

We will be digitising British Library manuscripts; creating bibliographical descriptions for these; and creating transcriptions of selected texts. The texts are so-called ‘manuscript pamphlets’, political in nature, often copied in multiple manuscripts, and appearing in different kinds of manuscript (e.g., letter books, journals, bound separates, etc). Most users will come to our website wanting to read the texts alongside the MS images. Secondarily, however, they will want to find info about all the witnesses containing the same text, and associated info (e.g., who made the copy, when, and where, so that we can also trace networks of dissemination). In addition, we’d like to list manuscript content for each witness separately (though not exhaustively). 

I am wondering how best to organise my corpus. I’m very used to writing single witness XML, with a <header> containing a <msDesc>, and <text> element with full transcription, and/or <facsimile>. We now have research questions around two major ‘units’: first unit is the texts themselves, their content (in full diplomatic transcription, lightly marked up for names, dates, places), the associated text types (e.g., a prophesy, a speech, a proclamation), their subject (e.g., royal succession, taxation, death of Prince Henry). The second ‘unit’ of interest if the manuscript itself: even if we only select a single manuscript from which to transcribe, we still need to capture information about all the others, in the usual style of <msDesc>, including (selected) contents.

We now have three possible ways forward: 

One: 
We encode the info about all manuscript witnesses in a single <header> in the same doc that contains the <text> transcription, using multiple instances of <msDesc> under <listBibl>. The huge downside to this is that some texts may have, say, 20 manuscripts witnesses (with a lot of associated bibliographical detail); moreover, there will also be considerable duplication of <msDesc> where different texts are shared by the same manuscript (so we would have many identical <msDesc> sections across different XML docs that transcribe different texts). 

Two:
We separate out XML docs for manuscripts and texts. For the former, we’d create a single XML for each MS perhaps nested like this:
<text>
    <body>
       <listBibl>
          <msDesc>
etc. 

The upside here would be less duplication, but the text XMLs themselves would not contain the MS description data, but only references to these other docs. 

Three:
We are also considering a half-way solution, where the transcribed texts have a single witness (from which the text is taken) described in <msDesc>, and further references to other witness that are themselves described in separate XML docs. (I don’t much like this option). 

Do any of these approaches violate the spirit of TEI? Are there more obvious ways to do this? If you have experience creating a corpus like this, I’d be very grateful for advice on how best to organise the TEI infrastructure! Huge thanks. 

Sebastiaan

-- 
Dr Sebastiaan Verweij
Lecturer in Late-Medieval and Early Modern English Literature
University of Bristol 
(+44) (0) <a href="tel://117 92 88090">117 92 88090
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Hierarchy of XML files for <text> transcription and <msDesc>

Torsten Schassan-2
Dear Sebastiaan,

I can describe our approach to your question rather than giving an advice:

- First of all, we distinguish between "the manuscript" and
"descriptions": An TEI-XML file containing and <msDesc> inside the
<teiHeader> would represent the manuscript. Additional information could
be the <facsimile> and/or some transcription inside of <text>. Anyway,
we store both types of additional content in separate files. The
contents of <msDesc> are subject to change as new information on the
manuscript could be available at every time.

e.g. <http://diglib.hab.de/?db=mss&list=ms&id=46-noviss-2f>

Additionally, we would store "published" descriptions (e.g. in print) in
separate TEI-XML files. This time the <msDesc> would be stored in <text>
as this is a representation of the catalogue. These files are stable and
may not be changed at all as the text of the description doesn't change
as well.

e.g. <http://diglib.hab.de/?db=mss&list=ms&id=46-noviss-2f&catalog=Butzmann>


- Secondly, if we consider transcriptions of single manuscripts, we
store them alongside with the digital images of the respective
manuscript. If it were a complex digital object, e.g. an edition, we
would store it as such in a distinct place. The distinction between a
"transcription" and an "edition" is a fine line though. We consider a
transcription to be a "diplomatic edition". ;-)

e.g. <http://diglib.hab.de/mss/46-noviss-2f/start.htm?distype=start>


- Third, as mentioned before, we store all these files separately for
the sake of reusability. With the means of XInclude or XLink mechanisms
it is easy to grab any content for a special purpose, e.g. the
description of the textual transmission. Concerning your units, I would
consider them to be constituents of such a "complex digital object"
which (c|sh)ould pull either texts or description and text together.


To conclude: I would vote for option three, with separate files for
manuscripts, descriptions, transcriptions, editions, maybe even
introductory texts for the editions.

I am happy to discuss any aspect in more detail.


Best,
Torsten

--
Torsten Schassan - Digitale Editionen - Abteilung Handschriften und
Sondersammlungen
Herzog August Bibliothek, Postfach 1364, D-38299 Wolfenbuettel, Tel.:
+49-5331-808-130 (Fax -165)
Handschriftendatenbank: http://diglib.hab.de/?db=mss



  Von:   Sebastiaan Verweij <[hidden email]>  An:
<[hidden email]>  Gesendet:   20.06.2017 10:48  Betreff:
Hierarchy of XML files for <text> transcription and <msDesc>


**re-posting my earlier question -- thanks for your time!**


Dear TEI list


I’m coming back to TEI after a long spell of other things, and am
wondering how best to approach a new project. I have IT support at my
university, though we are building this sort of DH resource here for the
first time. We are using an eXist database.


We will be digitising British Library manuscripts; creating
bibliographical descriptions for these; and creating transcriptions of
selected texts. The texts are so-called ‘manuscript pamphlets’,
political in nature, often copied in multiple manuscripts, and appearing
in different kinds of manuscript (e.g., letter books, journals, bound
separates, etc). Most users will come to our website wanting to read the
texts alongside the MS images. Secondarily, however, they will want to
find info about all the witnesses containing the same text, and
associated info (e.g., who made the copy, when, and where, so that we
can also trace networks of dissemination). In addition, we’d like to
list manuscript content for each witness separately (though not
exhaustively).


I am wondering how best to organise my corpus. I’m very used to writing
single witness XML, with a <header> containing a <msDesc>, and <text>
element with full transcription, and/or <facsimile>. We now have
research questions around two major ‘units’: first unit is the texts
themselves, their content (in full diplomatic transcription, lightly
marked up for names, dates, places), the associated text types (e.g., a
prophesy, a speech, a proclamation), their subject (e.g., royal
succession, taxation, death of Prince Henry). The second ‘unit’ of
interest if the manuscript itself: even if we only select a single
manuscript from which to transcribe, we still need to capture
information about all the others, in the usual style of <msDesc>,
including (selected) contents.


We now have three possible ways forward:


One:
We encode the info about all manuscript witnesses in a single <header>
in the same doc that contains the <text> transcription, using multiple
instances of <msDesc> under <listBibl>. The huge downside to this is
that some texts may have, say, 20 manuscripts witnesses (with a lot of
associated bibliographical detail); moreover, there will also be
considerable duplication of <msDesc> where different texts are shared by
the same manuscript (so we would have many identical <msDesc> sections
across different XML docs that transcribe different texts).


Two:
We separate out XML docs for manuscripts and texts. For the former, we’d
create a single XML for each MS perhaps nested like this:
<text>
     <body>
        <listBibl>          <msDesc>
etc.


The upside here would be less duplication, but the text XMLs themselves
would not contain the MS description data, but only references to these
other docs.


Three:
We are also considering a half-way solution, where the transcribed texts
have a single witness (from which the text is taken) described in
<msDesc>, and further references to other witness that are themselves
described in separate XML docs. (I don’t much like this option).


Do any of these approaches violate the spirit of TEI? Are there more
obvious ways to do this? If you have experience creating a corpus like
this, I’d be very grateful for advice on how best to organise the TEI
infrastructure! Huge thanks.



Sebastiaan



--
Dr Sebastiaan Verweij
Lecturer in Late-Medieval and Early Modern English Literature
University of Bristol
(+44) (0) 117 92 88090
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Hierarchy of XML files for <text> transcription and <msDesc>

Sebastiaan Verweij-3
Huge thanks to Peter, Duncan, and Torsten. 

I will check out the Textual Communities resources, but Peter you are indeed dealing with fully comparable situations (in a sense, nothing much changes from the medieval to early modern manuscript corpuses in how we capture them in TEI). Duncan, my longwinded question must have reflected I didn't know about XInclude or how to implement it, but I like how neat this option is. I've read up on this now, and it looks perfect for the job. Torsten, I agree ours are like your 'complex digital objects' -- thanks so much for sending links to your XMLs and these perfectly model what we are trying to accomplish. I like the option of storing files separately to avoid duplication and make re-use more intuitive. 

Thanks once more -- I'm meeting with our IT developer this week and we will bash out our XML templates on the basis of your sound advice. I may come back with a few more quick Qs, but thanks to you all coming up trumps. 

All best, Sebastiaan 

On 20 June 2017 at 15:54, Torsten Schassan <[hidden email]> wrote:
Dear Sebastiaan,

I can describe our approach to your question rather than giving an advice:

- First of all, we distinguish between "the manuscript" and "descriptions": An TEI-XML file containing and <msDesc> inside the <teiHeader> would represent the manuscript. Additional information could be the <facsimile> and/or some transcription inside of <text>. Anyway, we store both types of additional content in separate files. The contents of <msDesc> are subject to change as new information on the manuscript could be available at every time.

e.g. <http://diglib.hab.de/?db=mss&list=ms&id=46-noviss-2f>

Additionally, we would store "published" descriptions (e.g. in print) in separate TEI-XML files. This time the <msDesc> would be stored in <text> as this is a representation of the catalogue. These files are stable and may not be changed at all as the text of the description doesn't change as well.

e.g. <http://diglib.hab.de/?db=mss&list=ms&id=46-noviss-2f&catalog=Butzmann>


- Secondly, if we consider transcriptions of single manuscripts, we store them alongside with the digital images of the respective manuscript. If it were a complex digital object, e.g. an edition, we would store it as such in a distinct place. The distinction between a "transcription" and an "edition" is a fine line though. We consider a transcription to be a "diplomatic edition". ;-)

e.g. <http://diglib.hab.de/mss/46-noviss-2f/start.htm?distype=start>


- Third, as mentioned before, we store all these files separately for the sake of reusability. With the means of XInclude or XLink mechanisms it is easy to grab any content for a special purpose, e.g. the description of the textual transmission. Concerning your units, I would consider them to be constituents of such a "complex digital object" which (c|sh)ould pull either texts or description and text together.


To conclude: I would vote for option three, with separate files for manuscripts, descriptions, transcriptions, editions, maybe even introductory texts for the editions.

I am happy to discuss any aspect in more detail.


Best,
Torsten

--
Torsten Schassan - Digitale Editionen - Abteilung Handschriften und Sondersammlungen
Herzog August Bibliothek, Postfach 1364, D-38299 Wolfenbuettel, Tel.: <a href="tel:%2B49-5331-808-130" value="+495331808130" target="_blank">+49-5331-808-130 (Fax -165)
Handschriftendatenbank: http://diglib.hab.de/?db=mss



 Von:   Sebastiaan Verweij <[hidden email]>  An: <[hidden email]>  Gesendet:   20.06.2017 10:48  Betreff: Hierarchy of XML files for <text> transcription and <msDesc>



**re-posting my earlier question -- thanks for your time!**


Dear TEI list


I’m coming back to TEI after a long spell of other things, and am wondering how best to approach a new project. I have IT support at my university, though we are building this sort of DH resource here for the first time. We are using an eXist database.


We will be digitising British Library manuscripts; creating bibliographical descriptions for these; and creating transcriptions of selected texts. The texts are so-called ‘manuscript pamphlets’, political in nature, often copied in multiple manuscripts, and appearing in different kinds of manuscript (e.g., letter books, journals, bound separates, etc). Most users will come to our website wanting to read the texts alongside the MS images. Secondarily, however, they will want to find info about all the witnesses containing the same text, and associated info (e.g., who made the copy, when, and where, so that we can also trace networks of dissemination). In addition, we’d like to list manuscript content for each witness separately (though not exhaustively).


I am wondering how best to organise my corpus. I’m very used to writing single witness XML, with a <header> containing a <msDesc>, and <text> element with full transcription, and/or <facsimile>. We now have research questions around two major ‘units’: first unit is the texts themselves, their content (in full diplomatic transcription, lightly marked up for names, dates, places), the associated text types (e.g., a prophesy, a speech, a proclamation), their subject (e.g., royal succession, taxation, death of Prince Henry). The second ‘unit’ of interest if the manuscript itself: even if we only select a single manuscript from which to transcribe, we still need to capture information about all the others, in the usual style of <msDesc>, including (selected) contents.


We now have three possible ways forward:


One:
We encode the info about all manuscript witnesses in a single <header> in the same doc that contains the <text> transcription, using multiple instances of <msDesc> under <listBibl>. The huge downside to this is that some texts may have, say, 20 manuscripts witnesses (with a lot of associated bibliographical detail); moreover, there will also be considerable duplication of <msDesc> where different texts are shared by the same manuscript (so we would have many identical <msDesc> sections across different XML docs that transcribe different texts).


Two:
We separate out XML docs for manuscripts and texts. For the former, we’d create a single XML for each MS perhaps nested like this:
<text>
    <body>
       <listBibl>          <msDesc>
etc.


The upside here would be less duplication, but the text XMLs themselves would not contain the MS description data, but only references to these other docs.


Three:
We are also considering a half-way solution, where the transcribed texts have a single witness (from which the text is taken) described in <msDesc>, and further references to other witness that are themselves described in separate XML docs. (I don’t much like this option).


Do any of these approaches violate the spirit of TEI? Are there more obvious ways to do this? If you have experience creating a corpus like this, I’d be very grateful for advice on how best to organise the TEI infrastructure! Huge thanks.



Sebastiaan



--
Dr Sebastiaan Verweij
Lecturer in Late-Medieval and Early Modern English Literature
University of Bristol
<a href="tel:%28%2B44%29%20%280%29%20117%2092%2088090" value="+441179288090" target="_blank">(+44) (0) 117 92 88090


Loading...