TEI + spaCy

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

TEI + spaCy

Andrew Janco
Following our spaCy workshop at DH2019, David Lassner ([hidden email]) and I are interested in developing new tools to facilitate work with TEI and the spaCy NLP library.  We don't want to replicate existing tools and we would like to meet community needs as best we can. We would welcome all comments, critique, and requests from the TEI community.  Thank you in advance and we look forward to hearing from you!  

Our initial pitch:
spacy-tei bridges the gap between TEI XML and the spaCy processing pipeline.  We plan spacy-tei to be either a library for loading, processing and saving TEIdocuments or a fine-tuned spacy model with a custom component for loading and storing TEI XML.  

Either way, when loading a TEI file, it automatically updates the spaCy language model with TEI tags and attributes from the file.  Document metadata and headers are added to the spaCy doc object.  Tokens and spans are updated to include relevant data from the TEImarkup. Following named entity recognition, text categorization, and other operations, the updated doc object can then be saved back to valid TEI.

         Andy Janco & David Lassner

[*]  Andrew Paul Janco
[*]  Digital Scholarship Librarian
[*]  Haverford College