Test projects needed for OCR->TEI

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Test projects needed for OCR->TEI

Ben Brumfield

I'm getting ready to extend FromThePage to support OCR correction, and am looking for collaborators to test and inform the design of this feature. 

Background

FromThePage originated as an open-source tool for simple editions, supporting transcribing, indexing and annotating manuscript material.  In 2010, I added support for works hosted on the Internet Archive, and in 2013 created a draft TEI-XML export.

Over the next few weeks, I'll be adding the ability for FromThePage to ingest the OCR automatically performed by the Internet Archive, so that editors of print sources won't need to start with a blank transcript.  I'll also be revisiting the TEI-XML export format to incorporate some of the suggestions made on this list and add support for the needs of printed sources.

Projects Needed

I'm looking for people working with print sources to create TEI-XML editions to try out the new feature. 

You'll provide:

  • Out-of-copyright, public, printed source texts which have already been scanned.  (It'd be nice if the text is already on the Internet Archive, but I can help you with the upload if you have your own image files.)
  • Time and labor to correct OCR errors in the text within FromThePage.
  • Opinions about TEI-XML and the appropriate mark-up for your text.
  • Patience for dealing with a feature in development.

I'll provide:

  • A platform for getting your edition from facsimile to TEI-XML.
  • Help dealing with scanned images and the Internet Archive.
  • A mechanism for incorporating your ideas about the right way to do TEI into an open-source tool.
  • Technical support using FromThePage.
  • TEI-XML P5-compliant versions of your text.

Anyone interested in participating (or in offering advice from the side-lines) should contact me directly at [hidden email]

Of course I welcome comments here on the list as well.

Ben W. Brumfield
http://manuscripttranscription.blogspot.com/
http://fromthepage.com/