Draft TEI export from FromThePage

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Draft TEI export from FromThePage

Ben Brumfield
Dear TEI-L,

For years I've admired the TEI from afar, scouring the Guidelines for standards I could follow for my own manuscript encoding work despite my guilt at not actually using TEI XML itself.  More recently, I came to the conclusion (with the help of some on this list) that it's possible to support TEI without using TEI-XML internally within a tool or externally as a user interface presented to transcribers.

To that end, I'm adding a TEI export feature to FromThePage, an open-source tool for transcribing, indexing, and annotating handwritten material.  This export mines the internal relational database FromThePage uses to record revisions, subject articles, notes, and page-to-subject links and combines those with transformation of the internal XML I use for transcripts to produce a single XML file.

But is it any good?  The XML validates against TEI P5 version 2.3.0, but the contents are programmatically generated from the application and user-generated content.  I suspect that the results may contain some real howlers when compared against hand-encoded TEI.  Before I deploy the export into the FromThePage codebase, I'd really appreciate some advice.

Here are two files generated by the exporter:

https://gist.github.com/benwbrum/6933615
Zenas Matthews' Mexican War Diary was scanned and posted by Southwestern University's Smith Library Special Collections.  It was transcribed, indexed, and annotated by Scott Patrick, a retired petroleum worker from Houston.

https://gist.github.com/benwbrum/6933603
Julia Brumfield's 1919 Diary was scanned and posted by me, transcribed largely by volunteer Linda Tucker, and indexed and annotated by me.

I'd love to hear suggestions for improvements and corrections.  I'm particularly interested in any issues with my use of RS and all the PERSON and PLACE elements in the source description. 

I'd also welcome any criticism of "supporting TEI" by letting volunters edit in a wiki, storing the results in an RDBMS, and only producing TEI  on export.

With many thanks,

Ben W. Brumfield
http://manuscripttranscription.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Draft TEI export from FromThePage

Hugh Cayless-2
Hi Ben. I'd say this looks pretty freakin' good.

At quick first glance, I think you're overusing <p> and I don't like <div1> etc. (I prefer <div>, but opinion is somewhat divided). I wouldn't wrap the contents of a page in <p>, since they aren't really paragraphs. <ab> would be better, if you need a container. I might also use just <persName> and <note> in <person>, instead of wrapping the contents in <p>s.

I'm not sure I understand the occasional uses of <s> that I see, and we (by we, I mean the EpiDoc folks) might be able to help/advise on converting things like [?] and [ordered?] into <gap reason="illegible"/> and <supplied>, etc.

Still, I think this is fantastic!

Hugh

On Oct 11, 2013, at 8:12 , Ben Brumfield <[hidden email]> wrote:

> Dear TEI-L,
>
> For years I've admired the TEI from afar, scouring the Guidelines for standards I could follow for my own manuscript encoding work despite my guilt at not actually using TEI XML itself.  More recently, I came to the conclusion (with the help of some on this list) that it's possible to support TEI without using TEI-XML internally within a tool or externally as a user interface presented to transcribers.
>
> To that end, I'm adding a TEI export feature to FromThePage, an open-source tool for transcribing, indexing, and annotating handwritten material.  This export mines the internal relational database FromThePage uses to record revisions, subject articles, notes, and page-to-subject links and combines those with transformation of the internal XML I use for transcripts to produce a single XML file.
>
> But is it any good?  The XML validates against TEI P5 version 2.3.0, but the contents are programmatically generated from the application and user-generated content.  I suspect that the results may contain some real howlers when compared against hand-encoded TEI.  Before I deploy the export into the FromThePage codebase, I'd really appreciate some advice.
>
> Here are two files generated by the exporter:
>
> https://gist.github.com/benwbrum/6933615
> Zenas Matthews' Mexican War Diary was scanned and posted by Southwestern University's Smith Library Special Collections.  It was transcribed, indexed, and annotated by Scott Patrick, a retired petroleum worker from Houston.
>
> https://gist.github.com/benwbrum/6933603
> Julia Brumfield's 1919 Diary was scanned and posted by me, transcribed largely by volunteer Linda Tucker, and indexed and annotated by me.
>
> I'd love to hear suggestions for improvements and corrections.  I'm particularly interested in any issues with my use of RS and all the PERSON and PLACE elements in the source description.  
>
> I'd also welcome any criticism of "supporting TEI" by letting volunters edit in a wiki, storing the results in an RDBMS, and only producing TEI  on export.
>
> With many thanks,
>
> Ben W. Brumfield
> http://manuscripttranscription.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Draft TEI export from FromThePage

Lou Burnard-6
I am saving up a closer look at this for my next train journey (in a few
hours time), but just to say I (almost) entirely agree with Hugh! The
only negative points I spotted so far are

-- numbered divs are so last century
-- this is TEI-world: we use <p> to mean "paragraph" not "um bunch of
stuff I think might want to format as a block"
-- <s> is meant to be used for end-to-end segmentation (i.e. where the
whole text is divided up into non-overlapping snippets): if you want a
tag for "arbitrary segment" use <seg>
-- The TEI first commandment precludes using things like [?]  -- you
should represent the uncertainty in  your TEI tagging

Thanks for persevering in this excellent work!

Lou



On 11/10/13 14:38, Hugh Cayless wrote:
> Hi Ben. I'd say this looks pretty freakin' good.
>
> At quick first glance, I think you're overusing <p> and I don't like <div1> etc. (I prefer <div>, but opinion is somewhat divided). I wouldn't wrap the contents of a page in <p>, since they aren't really paragraphs. <ab> would be better, if you need a container. I might also use just <persName> and <note> in <person>, instead of wrapping the contents in <p>s.
>
> I'm not sure I understand the occasional uses of <s> that I see, and we (by we, I mean the EpiDoc folks) might be able to help/advise on converting things like [?] and [ordered?] into <gap reason="illegible"/> and <supplied>, etc.
>
> Still, I think this is fantastic!
>
> Hugh
Reply | Threaded
Open this post in threaded view
|

Re: Draft TEI export from FromThePage

Ben Brumfield
In reply to this post by Ben Brumfield
Thank you for the kind words, Hugh.

Re. div1: I'd seen so many <div1>s in texts I'd worked with that I
assumed that plain ol' <div> had fallen out of fashion.  I'm  glad to
hear you and Lou that I can go back to <div>.


<p> is a problem for me.  FromThePage users are encoding a single page
at a time, and are using a single carriage return to indicate a line-break
in the MS, and a double carriage return to indicate some form of
paragraph, whether marked by indentation, vertical whitespace, or
something else.  As a result, I have no way to determine programmatically
whether the beginning of a page of transcripts is the beginning of a paragraph
(as is the case in the Julia Brumfield diary, which is written on a pre-printed
diary book) or whether it's the continuation of a sentence from the previous
page (as in the case of Zenas Matthews).

It would be trivial, however, to convert all my <p>s to <ab>s -- that would
eliminate the implicit assertion about paragraphality at the beginning and end
of pages, though it would obliterate the (correct) assertion about paragraphs
in the middle of a page.  What do you think?

Thanks for recommending the switch from <p> to <note> in my <person>
definitions -- that makes a lot of sense.


I was also confused by <s> when I saw it in Zenas Matthews.   FromThePage
will accept HTML tags in transcripts if users want to hand-enter them.  When
I wrote the TEI export feature, I was only aware of the use of <u>, <strike>,
and <table> with its child elements.  I'm converting <u> and <strike> to <hi>
and punting on <table> until I get a database dump of the FromThePage
installation that's using that.  

It appears that <s> is an old-fashioned HTML tag for <strike> which
was entered by the transcriber.  It's also valid TEI, so xmllint didn't catch
it.  I'll try to substitute it with <del>.


[?] is similar to <s> in that it's manually entered by transcribers.  I actually
have strong opinions in favor of print-style apparatus as a user interface,
and am in favor of continuing to allow editorial notations in single square braces.
That said, I'd really like to do an analysis of the places where users have written
"[Jones?]" or "Jones[?]" or "James [possibly Jones?]" etc. in their transcripts
to see if I can convert the most common vernacular notations into legitimate
TEI <unclear> or <gap> tags and suggest those formats in the application's
transcription conventions.

That project bears a lot of similarities to your work with Leiden -> Leiden+,
so I'd be really interested to know how you worked it out.  I still don't want
to create a prescriptive notation language within FromThePage, as users
range from unschooled amateurs to literary scholars to archivists to
bioinformaticians, so I'm worried about clashing disciplinary conventions.

I'm sure I'm not the first person on this list to have attempted to convert
print notations into TEI programmatically, however, and as usual would
welcome suggestions.


Thanks again for your ideas and enthusiasm.

Ben Brumfield
http://manuscripttranscription.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Draft TEI export from FromThePage

Ben Brumfield
In reply to this post by Ben Brumfield
Thanks for your comments, Lou, and for your encouragement
in College Station last year.

Building on my reply to Hugh,
-- I'll swap out the <div1> elements with <div>.
-- If it's bad to use <p> generically, is it better to
use <ab> both in cases where you mean "some kind
of block" and in cases where you really are dealing with
a paragraph?  If you must pick one or the other, which
choice is the least bad?
-- In researching <s> in Zenas Matthews, it looks like it was
intended to be an HTML <strike> in the original, thus I should
replace it with <del>.  Unfortunately I didn't have access to
HTML references on the plane where I coded this feature, so
I wasn't able to figure out the appropriate analogue in TEI.
-- Good point on [?].  That's all user-generated content, so I'm
probably going to have to take a statistical approach to figure out
what they mean before I know how to do the appropriate
transformations.  I originally hoped to let users indicate unclear
text by selecting regions of the facsimile to mark as uncertain,
however it's never risen to the top of the list due, I don't doubt,
to my shakyness with Javascript.

Thanks again for your encouragement.  Do please pass along
any further suggestions/criticism after your train ride.

Ben Brumfield
http://manuscripttranscription.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Draft TEI export from FromThePage

Lou Burnard-6
On 12/10/13 04:06, Ben Brumfield wrote:
> -- If it's bad to use <p> generically, is it better to
> use <ab> both in cases where you mean "some kind
> of block" and in cases where you really are dealing with
> a paragraph?  If you must pick one or the other, which
> choice is the least bad?

Here's an annoying suggestion. Why bother to treat the whole page as a
block at all? You have a <pb/> to mark its start which you can also use
to tie in the page image, and to show where someone started work on
capturing that bit. Doesn't wrapping the page in a div or a p or an ab
cause more problems than it solves? In these documents surely the most
appropriate/useful block to mark up would be the entry (plus maybe
paragraphs or lists within that) -- would it be hard to get your
data-capturers (hunters?) to flag that boundary in some way you could
then transform to useful markup?
Reply | Threaded
Open this post in threaded view
|

Re: Draft TEI export from FromThePage

Kevin Hawkins
In reply to this post by Ben Brumfield
Ben,

On 10/12/13 9:18 AM, Lou Burnard wrote:

> On 12/10/13 04:06, Ben Brumfield wrote:
>> -- If it's bad to use <p> generically, is it better to
>> use <ab> both in cases where you mean "some kind
>> of block" and in cases where you really are dealing with
>> a paragraph?  If you must pick one or the other, which
>> choice is the least bad?
>
> Here's an annoying suggestion. Why bother to treat the whole page as a
> block at all? You have a <pb/> to mark its start which you can also use
> to tie in the page image, and to show where someone started work on
> capturing that bit. Doesn't wrapping the page in a div or a p or an ab
> cause more problems than it solves? In these documents surely the most
> appropriate/useful block to mark up would be the entry (plus maybe
> paragraphs or lists within that) -- would it be hard to get your
> data-capturers (hunters?) to flag that boundary in some way you could
> then transform to useful markup?

I see <ab> as a block-level item unmarked (in the linguistic sense of
the term) for "paragraphness"; thus, I would use <ab> when you don't
know which of the two to pick.

In any case, if you are unable to follow Lou's suggestion to tag
further, I would use a single <ab> for the whole text rather than one
for each page.  See:

http://www.tei-c.org/SIG/Libraries/teiinlibraries/main-driver.html#L1AHex

I also notice that you've used <head> for the page numbers; these should
instead be in <fw>.

--Kevin
Reply | Threaded
Open this post in threaded view
|

Re: Draft TEI export from FromThePage

Ben Brumfield
In reply to this post by Ben Brumfield
Thanks to you all for the comments and suggestions you've given me
both on-list and off-list.  

I believe that I've addressed most of the issues that were raised in
the thread in these comments:
 - sorted changes in descending order
 - Link changes to the pages that were changed
 - Converted HEAD to FW elements for page titles.
 - convert obsolete HTML S tags to DEL
 - Converted div1 to div
 - converted person and place to use note
(Full commit log here: https://github.com/emacadie/fromthepage-rails_3/commits/master )

I've updated the gists with the second draft export results:
 - Brumfield Diary: https://gist.github.com/benwbrum/6933603
 - Matthews Diary: https://gist.github.com/benwbrum/6933615
(Clicking on the "Revisions" link in the left navigation bar will load a highlighted
diff between the first and second drafts of each file.)


There is one huge issue that most of you raised, which is the issue with P, PB,
and page-level DIVs.  

Here I'm a bit stumped.  The reason for the present structure is that I developed
FromThePage around Julia Brumfield's diaries,  in which physical pages, logical entries,
page facsimiles, and containers for real paragraphs are all the same thing.
(cf. http://fromthepage.com/display/display_page?page_id=754 for a single
page view and http://fromthepage.com/display/read_work?work_id=3 for
the layout of the whole text.)

In that idyllic situation, a DIV-per-page organization makes sense.
The pre-printed page heading (now in FW elements) is meaningful.
The mark-up of terms only once inside each page is adequate for
generating an index.  Even better, there is little chance that a logical
paragraph will be broken by the page break, since the diarist confined
each entry to a single page.   Finally, it's easy for the software to
present each page in isolation from the rest of the text.

However, once we generalize and apply the tool to something like
Zenas Matthews' Mexican War Diary, we find that the approach leaves
a lot to be desired.  Physical pages are not coterminous with diary
entries, paragraphs may span page boundaries, and the page titles
needed by the transcription tool really aren't meaningful enough to
go into FW elements.

Fixing this requires some changes to the transcription system.  One
option is to allow the administrator uploading the work to choose from
a set of options describing how the work should be organized--as a continuous
text with meaningless page breaks (as with Matthews) or a page-per-entry
format.  The export could then read that configuration and adjust the
TEI XML it generates accordingly.

Another option is to abandon page-level organization altogether and
introduce section headings to the UI to describe the beginning of an entry
or section.  MediaWiki has a pretty simple and popular syntax for section
headings, and as I've borrowed interface elements from them before,
that may be a good approach.  (I know that Wikisource has dealt with the
paragraph vs. page clash through their transclusion feature, which is another
hopeful sign.)  

Letting users specify section headings would also help with documents
in which an entry may span pages or a page may contain multiple entries.
There are lots of things I could do with that outside of TEI goals, contextualizing
subjects with date or places to mine the results for timelines and maps.

Thanks again for all of your suggestions.  As before, I welcome any
further comments.

Ben Brumfield
http://manuscripttranscription.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Draft TEI export from FromThePage

James Cummings-4
Hi Ben,

On 16/10/13 00:07, Ben Brumfield wrote:
> There is one huge issue that most of you raised, which is the issue with P, PB,
> and page-level DIVs.

This is the choice a lot of people make about choosing the
intellectual structure over the the physical structure. It used
to be much more difficult to switch between hierarchies, but for
the majority of encoders choosing the intellectual structure is
usually the right choice.  In the TEI-C Stylesheets we have a
tool 'procespb.xsl' which deals with simple cases of switching
from a standard flowing div structure with <pb/> elements to
having pages enclosed.

https://github.com/TEIC/Stylesheets/blob/master/tools/processpb.xsl

I mention it just in case others find this useful (or wish to
improve upon it, because it certainly isn't perfect).

> Here I'm a bit stumped.  The reason for the present structure is that I developed
> FromThePage around Julia Brumfield's diaries,  in which physical pages, logical entries,
> page facsimiles, and containers for real paragraphs are all the same thing.
> (cf. http://fromthepage.com/display/display_page?page_id=754 for a single
> page view and http://fromthepage.com/display/read_work?work_id=3 for
> the layout of the whole text.)

This was true for me for William Godwin's Diary which meant that
breaking up things by year
http://godwindiary.bodleian.ox.ac.uk/diary/1797.html or month
http://godwindiary.bodleian.ox.ac.uk/diary/1797-08.html or day
http://godwindiary.bodleian.ox.ac.uk/diary/1797-08-30.html was
particularly easy.
But even there, where I treated years as individual files,
because the diary structure was (for the most part) divided into
pre-ruled entries (e.g.
http://godwindiary.bodleian.ox.ac.uk/folio/e.203_0025v ) I could
have encapsulated pages in some element, perhaps <div>, but I
chose not to so that I didn't run into problems in those few
cases where there was some cross-page entry. I.e. since I was
displaying by calendar date (year, month, day) rather than pages,
at the express request of the project, I didn't include that as
an option.

> Another option is to abandon page-level organization altogether and
> introduce section headings to the UI to describe the beginning of an entry
> or section.  MediaWiki has a pretty simple and popular syntax for section
> headings, and as I've borrowed interface elements from them before,
> that may be a good approach.  (I know that Wikisource has dealt with the
> paragraph vs. page clash through their transclusion feature, which is another
> hopeful sign.)

I'd suggest that catering for two systems is problematic and you
should choose one or other other, allowing for export to the
other once transcription is complete.

Just some random thoughts,
-James


--
Dr James Cummings, [hidden email]
Academic IT Services, University of Oxford
Reply | Threaded
Open this post in threaded view
|

Re: Draft TEI export from FromThePage

Robinson, Peter
In reply to this post by Ben Brumfield
Just for fun, and because I had a little spare time, I converted Ben's Matthews War Diary file (https://gist.github.com/benwbrum/6933615) into something like the form advocated by several of us -- that is, with <div> elements holding separate diary entries, and thus running across page breaks in the classic overlapping hierarchy mode, as in this example:

<div n="May 26th 1846">
<p n="1">[Sunday?] 26 1846
<lb/>[This?][morning?] at the Navidad we <note type="ed" resp="#U93">The black jack and post oak mentioned are of course the Blackjack Oak, and the Post Oak trees. </note>
<pb xml:id="F2644" n="5" facs="wPage04.jpg"/>
<fw place="tm" type="pageNum">4</fw>
<lb/><del rend="overstrike">overtook</del> came up with 2
<lb/>persons going to the army crossed the
<lb/>Navidad thence [through?] [postoaks?] some
etc..

You can see the XML for this at http://www.sd-editions.com/FromThePage/.

I then took that XML and put it into our textual communities system, along with the images, which is designed precisely to handle the overlapping hierarchies.  You can see this at
Click on a page number in the left side and you'll see the transcription and image in the right.  If you want to see it without all the XML, click on preview (this will also parse the page against the P5 DTD).  You can change it and save it even -- but the changes will appear to be done by me (please don't try the commit button -- we are still testing that).

A few things to note:
-- this shows the page view.  You can view the document by diary entry if you click on 'Collations' in the left hand panel; you can then drill down in each diary entry down to the individual p elements and then inspect the text of each <p>, inclusive of whatever <pb/> elements it might cross. (similarly, you can navigate to a page by diary entry if you go to document>items)

-- at the beginning of many transcription pages you'll see something like:
<text>
<body>
<div n="June 1st 1846" prev="urn:det:TCUSask:ZMD:document=1846:Folio=7:Line=31:entity=June 1st 1846">
<p n="1" prev="urn:det:TCUSask:ZMD:document=1846:Folio=7:Line=31:entity=June 1st 1846:para=1">

<text> and <body> are standard TEI of course.  Now, the magic comes in the use of @prev on div and p: these indicate that the text in these particular elements is continuing the text from a previous page.  Thus, on the first div:
prev="urn:det:TCUSask:ZMD:document=1846:Folio=7:Line=31:entity=June 1st 1846" -- tells us that this fragment contains text of the div 'June 1st 1846' which began on Folio 7 Line 31 -- ie the previous page.  The same for the <p> element.

-- the TC system is here using the n attributes on <pb/>  <div> and <div>/<p> to construct these urn identifiers (according to xpath expressions given in the refsDecl in the header).
These urns are the real heart of the system, so:
urn:det:TCUSask:ZMD:document=1846:Folio=7:Line=31 -- points to the text contained in line 31 of folio 7 of the document '1846', inside the ZMD community in the University of Saskatchewan collection of textual communities (this is a 'document' in our system)
urn:det:TCUSask:ZMD:entity=June 1st 1846:para=1 -- points to the text contained in the first paragraph of the diary entry for June 1st 1846, inside the ZMD community (there could be many different documents with text of this diary entry, in fact, though here we have just one) -- this is an 'entity' in our system
urn:det:TCUSask:ZMD:document=1846:Folio=7:Line=31:entity=June 1st 1846:para=1 -- points to the text of the diary entry for June 1st 1846 as contained in line 31 of folio 7 of document 1846. (this is a 'text' in our system)

This has many advantages.  You could use these urns to attach annotations, images, alternative transcriptions, etc etc. Those familiar with the 'canonical text system' used by the Homer multitext will see it's very similar to theirs (but different in that ours is built on explicit distinction and linking of documents/entities/texts).

Now, here is a cunning bit.  This is built now with an api, and all text within the system is available under CC attribution share-alike (but NOT the images).  So anyone can go make their own interface to these texts through the api, so:
http://textualcommunities.usask.ca/api/communities/ brings us back a list of communities in the USask system
from this we learn that this Matthews diary is community 91, and http://textualcommunities.usask.ca/api/communities/91/ gives information about it
http://textualcommunities.usask.ca/api/communities/91/docs/ tells us all the documents in this community -- only one, with id 1497295
http://textualcommunities.usask.ca/api/docs/1497295/has_parts/ lists all the pages in the document; for example id 1497299 is folio 4

http://textualcommunities.usask.ca/api/communities/91/entities/  tells us all the entities in this community (ie, all the diary entries), eg 620366  is May 25th 1846
http://textualcommunities.usask.ca/api/entities/620366/has_parts/ tells us the paragraphs in this entity (three of them)
http://textualcommunities.usask.ca/api/entities/620366/has_text_of/ tells us what texts there are of this diary entry (only one, with the id 3006519)
http://textualcommunities.usask.ca/api/texts/3006519/xml/ gives us the xml for the whole of this diary entry in the document 1846, all through paras (could be a pb in this page, actually there isn't)

should get you started.  I will personally buy a small drink to the first person who uses these calls to make their own version of this diary. Have fun.  (hint, the file indexajax.html, buried in the life ray system, shows all this at work.  hint2, use jquery and $.getJSON calls to stay sane.  But you all know these things already)

all the best
Peter



On 15 Oct 2013, at 17:07, Ben Brumfield wrote:

Thanks to you all for the comments and suggestions you've given me
both on-list and off-list.  

I believe that I've addressed most of the issues that were raised in
the thread in these comments:
- sorted changes in descending order
- Link changes to the pages that were changed
- Converted HEAD to FW elements for page titles.
- convert obsolete HTML S tags to DEL
- Converted div1 to div
- converted person and place to use note
(Full commit log here: https://github.com/emacadie/fromthepage-rails_3/commits/master )

I've updated the gists with the second draft export results:
- Brumfield Diary: https://gist.github.com/benwbrum/6933603
- Matthews Diary: https://gist.github.com/benwbrum/6933615
(Clicking on the "Revisions" link in the left navigation bar will load a highlighted
diff between the first and second drafts of each file.)


There is one huge issue that most of you raised, which is the issue with P, PB,
and page-level DIVs.  

Here I'm a bit stumped.  The reason for the present structure is that I developed
FromThePage around Julia Brumfield's diaries,  in which physical pages, logical entries,
page facsimiles, and containers for real paragraphs are all the same thing.
(cf. http://fromthepage.com/display/display_page?page_id=754 for a single
page view and http://fromthepage.com/display/read_work?work_id=3 for
the layout of the whole text.)

In that idyllic situation, a DIV-per-page organization makes sense.
The pre-printed page heading (now in FW elements) is meaningful.
The mark-up of terms only once inside each page is adequate for
generating an index.  Even better, there is little chance that a logical
paragraph will be broken by the page break, since the diarist confined
each entry to a single page.   Finally, it's easy for the software to
present each page in isolation from the rest of the text.

However, once we generalize and apply the tool to something like
Zenas Matthews' Mexican War Diary, we find that the approach leaves
a lot to be desired.  Physical pages are not coterminous with diary
entries, paragraphs may span page boundaries, and the page titles
needed by the transcription tool really aren't meaningful enough to
go into FW elements.

Fixing this requires some changes to the transcription system.  One
option is to allow the administrator uploading the work to choose from
a set of options describing how the work should be organized--as a continuous
text with meaningless page breaks (as with Matthews) or a page-per-entry
format.  The export could then read that configuration and adjust the
TEI XML it generates accordingly.

Another option is to abandon page-level organization altogether and
introduce section headings to the UI to describe the beginning of an entry
or section.  MediaWiki has a pretty simple and popular syntax for section
headings, and as I've borrowed interface elements from them before,
that may be a good approach.  (I know that Wikisource has dealt with the
paragraph vs. page clash through their transclusion feature, which is another
hopeful sign.)   

Letting users specify section headings would also help with documents
in which an entry may span pages or a page may contain multiple entries.
There are lots of things I could do with that outside of TEI goals, contextualizing
subjects with date or places to mine the results for timelines and maps.

Thanks again for all of your suggestions.  As before, I welcome any
further comments.

Ben Brumfield
http://manuscripttranscription.blogspot.com/

Peter Robinson
Bateman Professor of English
#311, Arts Building, 9 Campus Drive, University of Saskatchewan
Saskatoon SK S7N 5A5, Canada
ph. (+1) 306 966 5491







Reply | Threaded
Open this post in threaded view
|

Re: Draft TEI export from FromThePage

Ben Brumfield
Thank you, Peter -- this is incredibly generous.

I don't have as much time as I'd like (and that your effort deserves) at the
moment, so rather than bottling a massive response up for later, I'd like
to reply in parts as I can.

I'm starting with your hand-converted XML, then will move to the Textual
Communities application (in another email) and the APIs (which
I'm drooling over).

Comparing https://gist.github.com/benwbrum/6933615 with
http://www.sd-editions.com/FromThePage/Matthews.xml , the following
things struck me as notable.

msIdentifier: Thanks so much for expanding this -- I had no idea that
it could be composed of something other than a call number, and the
elements you used are fields I need to be collecting anyway to support
the libraries, archives, and museums using FromThePage who don't
have clear ways to point transcription/edition users back to their institutional
presences.  None of the child elements were clear to me from the
examples in the Guidelines.

refsDecl: I'm not clear what the "det:"-prefixed attributes refer to, how
anyone would use them, or why they're there.  Can you give me any
pointers?

encodingDesc: Unless I misunderstand the documentation, this
element--which had escaped my notice until you made your conversion--
would be the perfect place to 1) dump the transcription conventions
which FromThePage presents to the user for this text (those conventions
being defined by the organization that uploaded the manuscript
facsimiles and started the transcription project, and 2) describe the
export process used to convert FromThePage's RDBMS+XML to TEI.
Is that right?  Because I would LOVE to be able to make all of that
transparent.

I see that you eliminated the "change" elements and the "person" and
"place" elements from the header.  The former is no biggie, but I'm a bit
concerned about the latter.  Marking up persons, places, things and events
is the heart of FromThePage, so I feel that it's important to get that right
in the TEI export.  (I've been worried that there's no "thingology" to match
entries for persons and places, so that the user-generated notes associated
with subjects like "cutting match" or "firing tobacco" don't make it
to the export)

Did you skip the listPerson and listPlace conversion for lack of time--in which
case I'll breathe a sigh of relief--or because I was doing something seriously
incorrect which needs to be addressed?

I've run out of time, and have only gotten through the header of your
hand-converted TEI file.  We'll have to save the next set of questions
and comments for another day.

Thank you again for doing this!

Ben Brumfield



On Thu, Oct 17, 2013 at 8:31 AM, Peter Robinson <[hidden email]> wrote:

> Just for fun, and because I had a little spare time, I converted Ben's
> Matthews War Diary file (https://gist.github.com/benwbrum/6933615) into
> something like the form advocated by several of us -- that is, with <div>
> elements holding separate diary entries, and thus running across page breaks
> in the classic overlapping hierarchy mode, as in this example:
>
> <div n="May 26th 1846">
> <p n="1">[Sunday?] 26 1846
> <lb/>[This?][morning?] at the Navidad we <note type="ed" resp="#U93">The
> black jack and post oak mentioned are of course the Blackjack Oak, and the
> Post Oak trees. </note>
> <pb xml:id="F2644" n="5" facs="wPage04.jpg"/>
> <fw place="tm" type="pageNum">4</fw>
> <lb/><del rend="overstrike">overtook</del> came up with 2
> <lb/>persons going to the army crossed the
> <lb/>Navidad thence [through?] [postoaks?] some
> etc..
>
> You can see the XML for this at http://www.sd-editions.com/FromThePage/.
>
> I then took that XML and put it into our textual communities system, along
> with the images, which is designed precisely to handle the overlapping
> hierarchies.  You can see this at
> http://www.textualcommunities.usask.ca/web/matthews/viewer.
> Click on a page number in the left side and you'll see the transcription and
> image in the right.  If you want to see it without all the XML, click on
> preview (this will also parse the page against the P5 DTD).  You can change
> it and save it even -- but the changes will appear to be done by me (please
> don't try the commit button -- we are still testing that).
>
> A few things to note:
> -- this shows the page view.  You can view the document by diary entry if
> you click on 'Collations' in the left hand panel; you can then drill down in
> each diary entry down to the individual p elements and then inspect the text
> of each <p>, inclusive of whatever <pb/> elements it might cross.
> (similarly, you can navigate to a page by diary entry if you go to
> document>items)
>
> -- at the beginning of many transcription pages you'll see something like:
> <text>
> <body>
> <div n="June 1st 1846"
> prev="urn:det:TCUSask:ZMD:document=1846:Folio=7:Line=31:entity=June 1st
> 1846">
> <p n="1" prev="urn:det:TCUSask:ZMD:document=1846:Folio=7:Line=31:entity=June
> 1st 1846:para=1">
>
> <text> and <body> are standard TEI of course.  Now, the magic comes in the
> use of @prev on div and p: these indicate that the text in these particular
> elements is continuing the text from a previous page.  Thus, on the first
> div:
> prev="urn:det:TCUSask:ZMD:document=1846:Folio=7:Line=31:entity=June 1st
> 1846" -- tells us that this fragment contains text of the div 'June 1st
> 1846' which began on Folio 7 Line 31 -- ie the previous page.  The same for
> the <p> element.
>
> -- the TC system is here using the n attributes on <pb/>  <div> and
> <div>/<p> to construct these urn identifiers (according to xpath expressions
> given in the refsDecl in the header).
> These urns are the real heart of the system, so:
> urn:det:TCUSask:ZMD:document=1846:Folio=7:Line=31 -- points to the text
> contained in line 31 of folio 7 of the document '1846', inside the ZMD
> community in the University of Saskatchewan collection of textual
> communities (this is a 'document' in our system)
> urn:det:TCUSask:ZMD:entity=June 1st 1846:para=1 -- points to the text
> contained in the first paragraph of the diary entry for June 1st 1846,
> inside the ZMD community (there could be many different documents with text
> of this diary entry, in fact, though here we have just one) -- this is an
> 'entity' in our system
> urn:det:TCUSask:ZMD:document=1846:Folio=7:Line=31:entity=June 1st
> 1846:para=1 -- points to the text of the diary entry for June 1st 1846 as
> contained in line 31 of folio 7 of document 1846. (this is a 'text' in our
> system)
>
> This has many advantages.  You could use these urns to attach annotations,
> images, alternative transcriptions, etc etc. Those familiar with the
> 'canonical text system' used by the Homer multitext will see it's very
> similar to theirs (but different in that ours is built on explicit
> distinction and linking of documents/entities/texts).
>
> Now, here is a cunning bit.  This is built now with an api, and all text
> within the system is available under CC attribution share-alike (but NOT the
> images).  So anyone can go make their own interface to these texts through
> the api, so:
> http://textualcommunities.usask.ca/api/communities/ brings us back a list of
> communities in the USask system
> from this we learn that this Matthews diary is community 91, and
> http://textualcommunities.usask.ca/api/communities/91/ gives information
> about it
> http://textualcommunities.usask.ca/api/communities/91/docs/ tells us all the
> documents in this community -- only one, with id 1497295
> http://textualcommunities.usask.ca/api/docs/1497295/ tells us more about the
> document
> http://textualcommunities.usask.ca/api/docs/1497295/has_parts/ lists all the
> pages in the document; for example id 1497299 is folio 4
> http://textualcommunities.usask.ca/api/docs/1497299/xml/ gives us the xml
> for this page
>
> http://textualcommunities.usask.ca/api/communities/91/entities/  tells us
> all the entities in this community (ie, all the diary entries), eg 620366
> is May 25th 1846
> http://textualcommunities.usask.ca/api/entities/620366/has_parts/ tells us
> the paragraphs in this entity (three of them)
> http://textualcommunities.usask.ca/api/entities/620366/has_text_of/ tells us
> what texts there are of this diary entry (only one, with the id 3006519)
> http://textualcommunities.usask.ca/api/texts/3006519/xml/ gives us the xml
> for the whole of this diary entry in the document 1846, all through paras
> (could be a pb in this page, actually there isn't)
>
> should get you started.  I will personally buy a small drink to the first
> person who uses these calls to make their own version of this diary. Have
> fun.  (hint, the file indexajax.html, buried in the life ray system, shows
> all this at work.  hint2, use jquery and $.getJSON calls to stay sane.  But
> you all know these things already)
>
> all the best
> Peter
>
>
>
> On 15 Oct 2013, at 17:07, Ben Brumfield wrote:
>
> Thanks to you all for the comments and suggestions you've given me
> both on-list and off-list.
>
> I believe that I've addressed most of the issues that were raised in
> the thread in these comments:
> - sorted changes in descending order
> - Link changes to the pages that were changed
> - Converted HEAD to FW elements for page titles.
> - convert obsolete HTML S tags to DEL
> - Converted div1 to div
> - converted person and place to use note
> (Full commit log here:
> https://github.com/emacadie/fromthepage-rails_3/commits/master )
>
> I've updated the gists with the second draft export results:
> - Brumfield Diary: https://gist.github.com/benwbrum/6933603
> - Matthews Diary: https://gist.github.com/benwbrum/6933615
> (Clicking on the "Revisions" link in the left navigation bar will load a
> highlighted
> diff between the first and second drafts of each file.)
>
>
>
> There is one huge issue that most of you raised, which is the issue with P,
> PB,
> and page-level DIVs.
>
> Here I'm a bit stumped.  The reason for the present structure is that I
> developed
> FromThePage around Julia Brumfield's diaries,  in which physical pages,
> logical entries,
> page facsimiles, and containers for real paragraphs are all the same thing.
> (cf. http://fromthepage.com/display/display_page?page_id=754 for a single
> page view and http://fromthepage.com/display/read_work?work_id=3 for
> the layout of the whole text.)
>
> In that idyllic situation, a DIV-per-page organization makes sense.
> The pre-printed page heading (now in FW elements) is meaningful.
> The mark-up of terms only once inside each page is adequate for
> generating an index.  Even better, there is little chance that a logical
> paragraph will be broken by the page break, since the diarist confined
> each entry to a single page.   Finally, it's easy for the software to
> present each page in isolation from the rest of the text.
>
> However, once we generalize and apply the tool to something like
> Zenas Matthews' Mexican War Diary, we find that the approach leaves
> a lot to be desired.  Physical pages are not coterminous with diary
> entries, paragraphs may span page boundaries, and the page titles
> needed by the transcription tool really aren't meaningful enough to
> go into FW elements.
>
> Fixing this requires some changes to the transcription system.  One
> option is to allow the administrator uploading the work to choose from
> a set of options describing how the work should be organized--as a
> continuous
> text with meaningless page breaks (as with Matthews) or a page-per-entry
> format.  The export could then read that configuration and adjust the
> TEI XML it generates accordingly.
>
>
> Another option is to abandon page-level organization altogether and
> introduce section headings to the UI to describe the beginning of an entry
> or section.  MediaWiki has a pretty simple and popular syntax for section
> headings, and as I've borrowed interface elements from them before,
> that may be a good approach.  (I know that Wikisource has dealt with the
> paragraph vs. page clash through their transclusion feature, which is
> another
> hopeful sign.)
>
> Letting users specify section headings would also help with documents
> in which an entry may span pages or a page may contain multiple entries.
> There are lots of things I could do with that outside of TEI goals,
> contextualizing
> subjects with date or places to mine the results for timelines and maps.
>
> Thanks again for all of your suggestions.  As before, I welcome any
> further comments.
>
> Ben Brumfield
> http://manuscripttranscription.blogspot.com/
>
>
> Peter Robinson
> Bateman Professor of English
> #311, Arts Building, 9 Campus Drive, University of Saskatchewan
> Saskatoon SK S7N 5A5, Canada
> ph. (+1) 306 966 5491
>
>
>
>
>
>
>