looking for advice on creating external information resources for an edition

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

looking for advice on creating external information resources for an edition

ron.vandenbranden
Administrator
Hi,

A digital edition project we're assisting in is trying to engage a
limited group of volunteers for transcribing letters, creating basic
annotations, and identifying named entities in the texts. The project
context is quite challenging, since budget is limited, the project is
building on previous efforts which had produced basic Word
transcriptions, and the volunteers are domain experts without any desire
to extend their human-computer interaction beyond basic office software.
Hence, we've tried to accommodate this in a word processor-based
workflow for the "volunteer phase", after which the docs will be
transformed and the TEI life of these texts begins.

Against this backdrop, I'm looking for a way to enable the volunteers to
identify named entities in the texts. Transcription-wise, I think this
is feasible in a word processor by identifying them as hyperlinks and
have the URL point at least to an unique ID code, or ideally to a valid
URL where the available information can be viewed. Either the ID code or
a field in each record can then contain type information that could help
in the transformation to TEI. Of course, this requires an external data
source to link to, and this is still a major concern. Given the fact
that a lot of these names won't feature in any existing resources, and
that they will probably require specific information tailored to this
edition anyhow, it seems to make most sense to construct a
project-specific resource which can hold the required information for
the different entities (persons, organizations, places, titles,...), of
course providing space for pointing to existing resources when
available. Ideally, the volunteers would be able to look up whether an
entry exists already, copy the ID/URL and use it in the transcription;
or create a new one for people, places, ... that haven't been described
yet. Also, if needed, it should be possible to edit existing
descriptions if e.g. more information becomes available along the way
(of course, without touching the original ID/URL). On top of that,
querying and entering new information should ideally be as intuitive as
possible for the volunteers. Summing up, the main requirements would be:
     -collaborative: volunteers should be able to create and/or modify
entries when needed
     -intuitive input/query form
     -ability to import existing data + export (to CSV or XLSX)

Since a basic spreadsheet could be sufficient to store this information
(e.g. different sheets per name type, with name-specific information
fields in separate columns), I've been looking into Google Sheets, but
I'm not sure if that allows to view individual "records" (i.e. rows in a
sheet), and if the forms component everything needed to
query/create/edit existing records.

I realize this is probably a terribly basic and peripheral question
which I've long hesitated to ask here, but how do others do this (after
all, it's such a basic component of any edition project)? We've been
advised to look into a Mediawiki direction, but that seems too
unstructured, hard to navigate in existing information and quite complex
to enter new information.

As might be clear at this point, (non-XML) databases are not my field of
expertise, we don't have any IT-departmental back-up, and I'm a bit at a
loss. Are there any known lightweight (and preferably free) solutions
available for facilitating this task? Or what would be the most sensible
direction to look into?

Many thanks for any advice,

Ron
Reply | Threaded
Open this post in threaded view
|

Re: looking for advice on creating external information resources for an edition

Mandell, Laura
Dear Ron:

I don't know exactly what you are looking for, but CWRC writer takes in TEI documents and then allows users to tag entities using named authority look-ups including VIAF. The entities are saved, I believe, as standoff markup or as markup embedded in the TEI, when no hierarchies are overlapped. You could easily train students to use the tool and load your documents in it. You could contact Susan Brown and Kim Martin about this, copied above.

Best, Laura

November 9, 2017 at 9:22 AM
Hi,

A digital edition project we're assisting in is trying to engage a limited group of volunteers for transcribing letters, creating basic annotations, and identifying named entities in the texts. The project context is quite challenging, since budget is limited, the project is building on previous efforts which had produced basic Word transcriptions, and the volunteers are domain experts without any desire to extend their human-computer interaction beyond basic office software. Hence, we've tried to accommodate this in a word processor-based workflow for the "volunteer phase", after which the docs will be transformed and the TEI life of these texts begins.

Against this backdrop, I'm looking for a way to enable the volunteers to identify named entities in the texts. Transcription-wise, I think this is feasible in a word processor by identifying them as hyperlinks and have the URL point at least to an unique ID code, or ideally to a valid URL where the available information can be viewed. Either the ID code or a field in each record can then contain type information that could help in the transformation to TEI. Of course, this requires an external data source to link to, and this is still a major concern. Given the fact that a lot of these names won't feature in any existing resources, and that they will probably require specific information tailored to this edition anyhow, it seems to make most sense to construct a project-specific resource which can hold the required information for the different entities (persons, organizations, places, titles,...), of course providing space for pointing to existing resources when available. Ideally, the volunteers would be able to look up whether an entry exists already, copy the ID/URL and use it in the transcription; or create a new one for people, places, ... that haven't been described yet. Also, if needed, it should be possible to edit existing descriptions if e.g. more information becomes available along the way (of course, without touching the original ID/URL). On top of that, querying and entering new information should ideally be as intuitive as possible for the volunteers. Summing up, the main requirements would be:
    -collaborative: volunteers should be able to create and/or modify entries when needed
    -intuitive input/query form
    -ability to import existing data + export (to CSV or XLSX)

Since a basic spreadsheet could be sufficient to store this information (e.g. different sheets per name type, with name-specific information fields in separate columns), I've been looking into Google Sheets, but I'm not sure if that allows to view individual "records" (i.e. rows in a sheet), and if the forms component everything needed to query/create/edit existing records.

I realize this is probably a terribly basic and peripheral question which I've long hesitated to ask here, but how do others do this (after all, it's such a basic component of any edition project)? We've been advised to look into a Mediawiki direction, but that seems too unstructured, hard to navigate in existing information and quite complex to enter new information.

As might be clear at this point, (non-XML) databases are not my field of expertise, we don't have any IT-departmental back-up, and I'm a bit at a loss. Are there any known lightweight (and preferably free) solutions available for facilitating this task? Or what would be the most sensible direction to look into?

Many thanks for any advice,

Ron

--
Laura Mandell
Professor of English
Interim Director, Melbern G. Glasscock Center for Humanities Research
http://glasscock.tamu.edu/
Director, Initiative for Digital Humanities, Media, and Culture
http://idhmc.tamu.edu
513-560-7860

Reply | Threaded
Open this post in threaded view
|

Re: looking for advice on creating external information resources for an edition

Omar Siam-2
In reply to this post by ron.vandenbranden
Hi,

For a project of ours (https://mecmua.acdh.oeaw.ac.at) I used an
approach where I (ab)used Words comments function for annotation data
and character styles for the type of entity. The DOCX documents were
processed using a customized variant of the TEI XSL styleheets
(https://github.com/simar0at/TEI-Stylesheets/tree/mecmua) in oXygen XML.
There was no central data source for the entities but only the rule that
an entity should be annotated with all available data at first occurence
in the document and then for any further occurence with as much
information as is needed to distinguish it from an entity with the same
name.

This approach worked somewhat for 2 domain experts annotating in Word
but in the end a lot of proof reading the annotations was necessary
because the annotators made hardly visible mistakes.

In hindsight I would have invested more time to customize Word for
stricter input checking and suggestions.

If you find this useful please take it as an inspiration.

Best regards

Omar Siam


Am 09.11.2017 um 16:22 schrieb Ron Van den Branden:

> Hi,
>
> A digital edition project we're assisting in is trying to engage a
> limited group of volunteers for transcribing letters, creating basic
> annotations, and identifying named entities in the texts. The project
> context is quite challenging, since budget is limited, the project is
> building on previous efforts which had produced basic Word
> transcriptions, and the volunteers are domain experts without any
> desire to extend their human-computer interaction beyond basic office
> software. Hence, we've tried to accommodate this in a word
> processor-based workflow for the "volunteer phase", after which the
> docs will be transformed and the TEI life of these texts begins.
>
> Against this backdrop, I'm looking for a way to enable the volunteers
> to identify named entities in the texts. Transcription-wise, I think
> this is feasible in a word processor by identifying them as hyperlinks
> and have the URL point at least to an unique ID code, or ideally to a
> valid URL where the available information can be viewed. Either the ID
> code or a field in each record can then contain type information that
> could help in the transformation to TEI. Of course, this requires an
> external data source to link to, and this is still a major concern.
> Given the fact that a lot of these names won't feature in any existing
> resources, and that they will probably require specific information
> tailored to this edition anyhow, it seems to make most sense to
> construct a project-specific resource which can hold the required
> information for the different entities (persons, organizations,
> places, titles,...), of course providing space for pointing to
> existing resources when available. Ideally, the volunteers would be
> able to look up whether an entry exists already, copy the ID/URL and
> use it in the transcription; or create a new one for people, places,
> ... that haven't been described yet. Also, if needed, it should be
> possible to edit existing descriptions if e.g. more information
> becomes available along the way (of course, without touching the
> original ID/URL). On top of that, querying and entering new
> information should ideally be as intuitive as possible for the
> volunteers. Summing up, the main requirements would be:
>     -collaborative: volunteers should be able to create and/or modify
> entries when needed
>     -intuitive input/query form
>     -ability to import existing data + export (to CSV or XLSX)
>
> Since a basic spreadsheet could be sufficient to store this
> information (e.g. different sheets per name type, with name-specific
> information fields in separate columns), I've been looking into Google
> Sheets, but I'm not sure if that allows to view individual "records"
> (i.e. rows in a sheet), and if the forms component everything needed
> to query/create/edit existing records.
>
> I realize this is probably a terribly basic and peripheral question
> which I've long hesitated to ask here, but how do others do this
> (after all, it's such a basic component of any edition project)? We've
> been advised to look into a Mediawiki direction, but that seems too
> unstructured, hard to navigate in existing information and quite
> complex to enter new information.
>
> As might be clear at this point, (non-XML) databases are not my field
> of expertise, we don't have any IT-departmental back-up, and I'm a bit
> at a loss. Are there any known lightweight (and preferably free)
> solutions available for facilitating this task? Or what would be the
> most sensible direction to look into?
>
> Many thanks for any advice,
>
> Ron
Reply | Threaded
Open this post in threaded view
|

Re: looking for advice on creating external information resources for an edition

Ben Brumfield
In reply to this post by ron.vandenbranden
Dear Ron,

Let me offer three projects that did low-cost entity mark-up in my own experience which might be helpful.

The Civil War Governors of Kentucky Digital Documentary Edition had transcribed correspondence in TEI-XML created using a combination of DocTracker and Oxygen.  These were available for further mark-up in an early access website[1] (based on Omeka, with the documents lightly converted into HTML) as well of in XML source residing in a Github repository.  Their goal was to mark up the people, places, organizations and geographic features mentioned within the documents, to identify and document those entities, and to connect the references within the documents to the entries on the entities themselves.  

We had graduate students use Hypothes.is to mark up the entities within each document on the Early Access site.  We then wrote an open-source system [2][3] to programmatically ingest the Hypothes.is annotations and present them for identification and documentation.  We are in the very last stages of the project now, publishing the entities, their biographies and bibliographies, the links between documents and entities, and the network visualization of entities and their relationships.  You can see more detail about the project at the presentation we gave at DH2017 this year.[4]

I don't know enough about your project's resources, but Hypothes.is was an easy, inexpensive way to do the mark-up itself, and if installing Mashbill (and modifying it to remove the CWGK-specific code) is too much, you might ask your users to put URIs for entities hosted elsewhere into the annotation bodies.

======

FromThePage[5] is an open-source[6] collaborative transcription and annotation platform I developed to do almost exactly what your project is attempting.  Users are presented with document facsimiles on a webpage and transcribe them into a data-entry box next to the facsimile image.  The mark-up allowed is limited compared with the richness of TEI, but the system is optimized for entity tagging, identification, documentation and indexing.  Users use wiki-links to mark up entities mentioned within a transcript, specifying a canonical name for the entity and the verbatim text within the document referring to it[7], as [[canonical name|verbatim text]] (e.g. [[Sally Smith Jones (1756-1823)|Rev. Jones wife]].  When users save a page containing linked subjects, a database record for the subject is created if it does not already exist, and an index entry created linking the page to the subject.  All these are visible as HTML links[8] and are transformed into rs and person tags in the TEI export.

I am biased, of course, but I'd think this platform solves the use cases you've described.  I'm not sure how you'd convince your transcribers to move from Word to the web, however.

======

Another option might be to cut-and-paste the existing transcripts into MediaWiki sites like pbworks or wikia.  The transcripts could be linked to articles about subjects using wiki-links (as in FromThePage).  This would be pretty low cost, but the big challenge there would be in getting the data back out again, so you'd want to figure that out first.   You'd also face the challenge of getting your users to start using the web.

Best of luck,

Ben

[1] Early Access publication: http://discovery.civilwargovernors.org/
[2] CWGK description of Mashbill: http://discovery.civilwargovernors.org/mashbill
[3] Source code for Mashbill: https://github.com/CivilWarGovernorsOfKentucky/Mashbill
[4] "Beyond Coocurrence: Network Visualization in the Civil War Governors of Kentucky Digital Documentary Edition" http://manuscripttranscription.blogspot.com/2017/08/beyond-coocurrence.html
[5] Commercially hosted version: https://fromthepage.com/
[6] Source code for FromThePage: https://github.com/benwbrum/fromthepage
[7] More detail on wiki-links at "Wiki-links in FromThePage": http://manuscripttranscription.blogspot.com/2014/03/wiki-links-in-fromthepage.html
[8] See links at https://fromthepage.com/yaquinalights/1871-1900-yaquina-head-lighthouse-letter-books/vol-439-cook-appt-1875/display/17170

--
Ben W. Brumfield
Partner. Brumfield Labs LLC
Creators of FromThePage
Reply | Threaded
Open this post in threaded view
|

Re: looking for advice on creating external information resources for an edition

ron.vandenbranden
Administrator
Hi all,

First off, sincere apologies for my late response, which does not imply
any lack of appreciation at all for your kind replies. It's nice to see
such a wide variety of "input workflows" for TEI projects in only a
couple of reactions!

Concerning the "non-Word perspectives": thanks, Laura, for putting
CRWC-Writer on my radar! The demo at http://208.75.74.217 is looking
impressive (with very elegant in-browser solutions for XML editing
operations and a potential for intuitive and rich in-browser TEI
markup), and the documentation of the Github repository/ies suggests it
should be possible to self-host the system. I'll keep an eye on the
project, and hope to find the time to experiment. Ditto for Ben's wealth
of links and projects, which I'm still digesting, many thanks!
Hypothes.is looks like a wonderful tool; it's great to see a clever
example of how to use it in a project. From your description of Mashbill
and FromThePage, these seem spot-on and make me want to try them all at
once. It's amazing and refreshing to see such great annotation-oriented
projects integrating annotations and identification/description of named
entities. In the current project, we're still in a pilot phase for a
limited corpus, so if we can find better ways for the input workflow,
this is definitely worth investigating. For one thing, even though the
project budget is limited, we do have access to a dedicated server, so
there is some hope here...

Last but not least, thanks, Omar, for sharing your code and thoughts on
the Word approach in your project. Your "discovery" method seems clever,
though probably only usable in an even more tightly controlled input
scenario than ours. I realize your words of caution hold for any
approach trying to get structured output from a completely loose input
environment like a text processor.

All this said, I guess what I'm looking for in the short run is a
easy/intuitive way to create an external data source for identifying and
describing persons, places, etc. (which in a TEI workflow would
typically take the shape of person, place, ... listings in separate TEI
files), whose records can be linked to from Word transcriptions. A
collaborative spreadheet would probably be sufficient for representing
the data, but falls short w.r.t. ease of input and ways to query and
view individual records. Still investigating!

Kind regards,

Ron


On 14/11/2017 17:59, Ben Brumfield wrote:

> Dear Ron,
>
> Let me offer three projects that did low-cost entity mark-up in my own experience which might be helpful.
>
> The Civil War Governors of Kentucky Digital Documentary Edition had transcribed correspondence in TEI-XML created using a combination of DocTracker and Oxygen.  These were available for further mark-up in an early access website[1] (based on Omeka, with the documents lightly converted into HTML) as well of in XML source residing in a Github repository.  Their goal was to mark up the people, places, organizations and geographic features mentioned within the documents, to identify and document those entities, and to connect the references within the documents to the entries on the entities themselves.
>
> We had graduate students use Hypothes.is to mark up the entities within each document on the Early Access site.  We then wrote an open-source system [2][3] to programmatically ingest the Hypothes.is annotations and present them for identification and documentation.  We are in the very last stages of the project now, publishing the entities, their biographies and bibliographies, the links between documents and entities, and the network visualization of entities and their relationships.  You can see more detail about the project at the presentation we gave at DH2017 this year.[4]
>
> I don't know enough about your project's resources, but Hypothes.is was an easy, inexpensive way to do the mark-up itself, and if installing Mashbill (and modifying it to remove the CWGK-specific code) is too much, you might ask your users to put URIs for entities hosted elsewhere into the annotation bodies.
>
> ======
>
> FromThePage[5] is an open-source[6] collaborative transcription and annotation platform I developed to do almost exactly what your project is attempting.  Users are presented with document facsimiles on a webpage and transcribe them into a data-entry box next to the facsimile image.  The mark-up allowed is limited compared with the richness of TEI, but the system is optimized for entity tagging, identification, documentation and indexing.  Users use wiki-links to mark up entities mentioned within a transcript, specifying a canonical name for the entity and the verbatim text within the document referring to it[7], as [[canonical name|verbatim text]] (e.g. [[Sally Smith Jones (1756-1823)|Rev. Jones wife]].  When users save a page containing linked subjects, a database record for the subject is created if it does not already exist, and an index entry created linking the page to the subject.  All these are visible as HTML links[8] and are transformed into rs and person tags in the TEI export.
>
> I am biased, of course, but I'd think this platform solves the use cases you've described.  I'm not sure how you'd convince your transcribers to move from Word to the web, however.
>
> ======
>
> Another option might be to cut-and-paste the existing transcripts into MediaWiki sites like pbworks or wikia.  The transcripts could be linked to articles about subjects using wiki-links (as in FromThePage).  This would be pretty low cost, but the big challenge there would be in getting the data back out again, so you'd want to figure that out first.   You'd also face the challenge of getting your users to start using the web.
>
> Best of luck,
>
> Ben
>
> [1] Early Access publication: http://discovery.civilwargovernors.org/
> [2] CWGK description of Mashbill: http://discovery.civilwargovernors.org/mashbill
> [3] Source code for Mashbill: https://github.com/CivilWarGovernorsOfKentucky/Mashbill
> [4] "Beyond Coocurrence: Network Visualization in the Civil War Governors of Kentucky Digital Documentary Edition" http://manuscripttranscription.blogspot.com/2017/08/beyond-coocurrence.html
> [5] Commercially hosted version: https://fromthepage.com/
> [6] Source code for FromThePage: https://github.com/benwbrum/fromthepage
> [7] More detail on wiki-links at "Wiki-links in FromThePage": http://manuscripttranscription.blogspot.com/2014/03/wiki-links-in-fromthepage.html
> [8] See links at https://fromthepage.com/yaquinalights/1871-1900-yaquina-head-lighthouse-letter-books/vol-439-cook-appt-1875/display/17170
>