Source TEI documents with listPerson data

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Source TEI documents with listPerson data

richard light

Hi,

Following on from the previous discussion (and turning things on their head), I'm thinking of developing a service which harvests listPerson data and publishes it as a Linked Data resource.  I've found the TEI examples page:

https://wiki.tei-c.org/index.php/Samples_of_TEI_texts

but this isn't massively helpful in giving me programmatic access to a set of freely licensed TEI documents to scan.  Is there a 'VoID sitemap' of the TEI world, and if not, shouldn't there be?

Richard

--
Richard Light
Reply | Threaded
Open this post in threaded view
|

Re: Source TEI documents with listPerson data

richard light

Lou,

Fair question: it's something from the Linked Data world (which I would like the TEI community to be a part of, in some sense):

https://en.wikipedia.org/wiki/VoID

https://www.w3.org/TR/void/

They're talking there about referencing RDF datasets, but the principle could equally apply to TEI documents.  Think XML Catalogs with additional useful metadata about each document.

The core idea with Linked Data is that entities of interest have fixed, dereferenceable URIs. Dereferenceable means that you can request the URI (possibly in a special way) and get machine-processible data back.  I would like to see TEI documents having this level of accessibility.  It needn't be hard: just post the XML document on the web and publish its URL.

When I went looking for examples of <listPerson> last night, I found the TEI samples page [1]. I have no idea which of the resources listed there contains <listPerson> examples, so I had to go down the page and:

  • read the descriptive text, to judge whether it was worth having a look at the resource
  • check the licence information (where given) to see if I could use it, were it to be useful
  • follow the link to the resource
  • [typically,] unzip the .gz archive it points to, and then further unzip the tar archive that contains
  • explore the directory structure and files I have now found, opening each individually in my favourite XML editor
  • scan each XML document for <listPerson> elements (mostly without success)

This is hardly 'programmatic access'.  A key idea with Linked Data is that you (or rather a software agent) can 'follow its nose', looking up Linked Data URIs, parsing the resources they point to, finding mentions of further resources within the first result and moving on to those.  It's treating the Web as a distributed database, and I would like TEI documents to be BLOB fields within that distributed database.  One obvious use case would be Open Annotation [2].

Best wishes,

Richard

[1] https://wiki.tei-c.org/index.php/Samples_of_TEI_texts

[2] https://www.w3.org/community/openannotation/


On 21/11/2017 23:17, Lou Burnard wrote:
On 21/11/17 21:30, Richard Light wrote:
  Is there a 'VoID sitemap' of the TEI world, and if not, shouldn't there be?


Hello Richard

OK, I'll bite. WTF is a "VoID sitemap" ?

.


--
Richard Light
Reply | Threaded
Open this post in threaded view
|

Re: Source TEI documents with listPerson data

Lou Burnard-6



On 22/11/17 09:18, Richard Light wrote:

The core idea with Linked Data is that entities of interest have fixed,
dereferenceable URIs. Dereferenceable means that you can request the URI
(possibly in a special way) and get machine-processible data back.  I
would like to see TEI documents having this level of accessibility.  It
needn't be hard: just post the XML document on the web and publish its URL.

I vaguely remember that Stuart Yeates had  a scripted procedure for automatically detecting TEI files available on the internet a few years back : (his project at https://github.com/stuartyeates/sampler is mentioned on the wiki page you quote) -- he might have some helpful comment here. The major problem I see, in practice, is the strange reluctance lots of TEI projects still have to expose their TEI source directly.


When I went looking for examples of <listPerson> last night, I found the
TEI samples page [1]. I have no idea which of the resources listed there
contains <listPerson> examples,

well., of course if you download the source you could always do an xPath to find this pretty quickly. You don't even need to download the whole resource, if you can believe what its <tagUsage> element says.
 
Reply | Threaded
Open this post in threaded view
|

Re: Source TEI documents with listPerson data

Paterson, Duncan
In reply to this post by richard light
Wouldn’t Tapas http://www.tapasproject.org be a natural candidate for VoID sitemap?
I would very much welcome something along these line. 




On 22. Nov 2017, at 10:18, Richard Light <[hidden email]> wrote:

Lou,

Fair question: it's something from the Linked Data world (which I would like the TEI community to be a part of, in some sense):

https://en.wikipedia.org/wiki/VoID

https://www.w3.org/TR/void/

They're talking there about referencing RDF datasets, but the principle could equally apply to TEI documents.  Think XML Catalogs with additional useful metadata about each document.

The core idea with Linked Data is that entities of interest have fixed, dereferenceable URIs. Dereferenceable means that you can request the URI (possibly in a special way) and get machine-processible data back.  I would like to see TEI documents having this level of accessibility.  It needn't be hard: just post the XML document on the web and publish its URL.

When I went looking for examples of <listPerson> last night, I found the TEI samples page [1]. I have no idea which of the resources listed there contains <listPerson> examples, so I had to go down the page and:

  • read the descriptive text, to judge whether it was worth having a look at the resource
  • check the licence information (where given) to see if I could use it, were it to be useful
  • follow the link to the resource
  • [typically,] unzip the .gz archive it points to, and then further unzip the tar archive that contains
  • explore the directory structure and files I have now found, opening each individually in my favourite XML editor
  • scan each XML document for <listPerson> elements (mostly without success)

This is hardly 'programmatic access'.  A key idea with Linked Data is that you (or rather a software agent) can 'follow its nose', looking up Linked Data URIs, parsing the resources they point to, finding mentions of further resources within the first result and moving on to those.  It's treating the Web as a distributed database, and I would like TEI documents to be BLOB fields within that distributed database.  One obvious use case would be Open Annotation [2].

Best wishes,

Richard

[1] https://wiki.tei-c.org/index.php/Samples_of_TEI_texts

[2] https://www.w3.org/community/openannotation/


On 21/11/2017 23:17, Lou Burnard wrote:
On 21/11/17 21:30, Richard Light wrote:
  Is there a 'VoID sitemap' of the TEI world, and if not, shouldn't there be?


Hello Richard

OK, I'll bite. WTF is a "VoID sitemap" ?

.


--
Richard Light

Reply | Threaded
Open this post in threaded view
|

Re: Source TEI documents with listPerson data

richard light

On 24/11/2017 11:42, Paterson, Duncan wrote:
Wouldn’t Tapas http://www.tapasproject.org be a natural candidate for VoID sitemap?
I would very much welcome something along these line.
Duncan,

One would think so.  Though a quick glance at this resource suggests that its name is sadly indicative of the content - lots of 'samples' in there.  Also the advanced search falls over when asked to find things.

Richard





On 22. Nov 2017, at 10:18, Richard Light <[hidden email]> wrote:

Lou,

Fair question: it's something from the Linked Data world (which I would like the TEI community to be a part of, in some sense):

https://en.wikipedia.org/wiki/VoID

https://www.w3.org/TR/void/

They're talking there about referencing RDF datasets, but the principle could equally apply to TEI documents.  Think XML Catalogs with additional useful metadata about each document.

The core idea with Linked Data is that entities of interest have fixed, dereferenceable URIs. Dereferenceable means that you can request the URI (possibly in a special way) and get machine-processible data back.  I would like to see TEI documents having this level of accessibility.  It needn't be hard: just post the XML document on the web and publish its URL.

When I went looking for examples of <listPerson> last night, I found the TEI samples page [1]. I have no idea which of the resources listed there contains <listPerson> examples, so I had to go down the page and:

  • read the descriptive text, to judge whether it was worth having a look at the resource
  • check the licence information (where given) to see if I could use it, were it to be useful
  • follow the link to the resource
  • [typically,] unzip the .gz archive it points to, and then further unzip the tar archive that contains
  • explore the directory structure and files I have now found, opening each individually in my favourite XML editor
  • scan each XML document for <listPerson> elements (mostly without success)

This is hardly 'programmatic access'.  A key idea with Linked Data is that you (or rather a software agent) can 'follow its nose', looking up Linked Data URIs, parsing the resources they point to, finding mentions of further resources within the first result and moving on to those.  It's treating the Web as a distributed database, and I would like TEI documents to be BLOB fields within that distributed database.  One obvious use case would be Open Annotation [2].

Best wishes,

Richard

[1] https://wiki.tei-c.org/index.php/Samples_of_TEI_texts

[2] https://www.w3.org/community/openannotation/


On 21/11/2017 23:17, Lou Burnard wrote:
On 21/11/17 21:30, Richard Light wrote:
  Is there a 'VoID sitemap' of the TEI world, and if not, shouldn't there be?


Hello Richard

OK, I'll bite. WTF is a "VoID sitemap" ?

.


--
Richard Light


--
Richard Light
Reply | Threaded
Open this post in threaded view
|

Re: Source TEI documents with listPerson data

Elisa Beshero-Bondar-2
Dear Richard and list,
There has been some incentive to work on interchangeability of data from digital edition projects lately, and without getting into details I can think of an Andrew Mellon grant opportunity whose deadline I missed last July that might have incentivized the construction of shared practices within a community of datasets—at least that was how I’d have interpreted the grant. As I understood it, we might consider a community to be a topical area that works with data from digital editions and archives in much the same way and relies on each other’s work for its scholarly discourse, where federated searching would make sense. The problem with imagining the entire TEI community’s prosopography data as part of a single linked open data set is that we don’t always store the same kinds of data in the same fashion—TEI gives us perhaps a little too much freedom for that, although it’s certainly possible to build a kind of digital scaffolding that might communicate by building “cross-walks”. I imagine that most of us are committed by institutions and funding models to concentrate on building our own archives in our own ways, and while we recognize the benefits of federated searching and linked open data, we haven’t project-by-project expressed a practical commitment to it, though we should.

This is something I know that I want to work on in the projects I’m part of. The Digital Mitford project, my most distributed project (shared with participants at multiple institutions with no single institution as its home), has as its backbone and probably most compelling raison d’etre a prosopography list that we’re developing from named entities we pull from letters, poetry, drama, and other literary texts—with historical persons, fictional and archetypal characters, and even named animals as “persons”, as well as places, named events, many kinds of named documents as a set of thousands of entries that serves as the central “nexus” point studying and interlinking the digital editions we’re preparing. In some ways, we’ve probably spent just as much time compiling, pruning, correcting, de-duping, and disambiguating this giant mess of a prosopography “spinal column” as we’ve been able to devote to our TEI representation of manuscripts and published documents. (See for my part http://digitalmitford.org/si.xml ). I’m sure our project isn’t alone in the effort we’ve devoted, and the inconsistencies within it). What’s specifically scary to me about RDF is a kind of commitment that this requires—one’s data, it seems, must ideally be “finished” and reliable to contribute reliably to the LOD world, when we’re always in progress and revising and checking for errors. I’d like to be building the scaffolding to make my data linkable, with the freedom to make repairs—and perhaps that’s the reason I study what I need to do to contribute to RDF and then step back a while until I’ve reliably identified the data that I know is most reliable for that linking. 

Were we somehow to incentivize interchangeability, I would caution us against making all projects from TEI to be expressing linkages the same way. There’s a lot of diversity in our code base related to the many different eras and kinds of “documents” (or marked structures, monuments, etc) that we work with. I’d rather see regions and topics emerge that share ways of expressing the linkability of their data—in ways that make it easier for us to work together within, say, our “kinship network” of related projects. I wonder, Richard, if you have advice or thoughts on this!

Cheers,
Elisa
-- 
Elisa Beshero-Bondar, PhD
Director, Center for the Digital Text | Associate Professor of English
University of Pittsburgh at Greensburg | Humanities Division
150 Finoli Drive
Greensburg, PA  15601  USA
E-mail:[hidden email]
Development site: http://newtfire.org






On Nov 25, 2017, at 7:27 AM, Richard Light <[hidden email]> wrote:


On 24/11/2017 11:42, Paterson, Duncan wrote:
Wouldn’t Tapas http://www.tapasproject.org be a natural candidate for VoID sitemap?
I would very much welcome something along these line. 
Duncan,

One would think so.  Though a quick glance at this resource suggests that its name is sadly indicative of the content - lots of 'samples' in there.  Also the advanced search falls over when asked to find things.

Richard





On 22. Nov 2017, at 10:18, Richard Light <[hidden email]> wrote:

Lou,

Fair question: it's something from the Linked Data world (which I would like the TEI community to be a part of, in some sense):

https://en.wikipedia.org/wiki/VoID

https://www.w3.org/TR/void/

They're talking there about referencing RDF datasets, but the principle could equally apply to TEI documents.  Think XML Catalogs with additional useful metadata about each document.

The core idea with Linked Data is that entities of interest have fixed, dereferenceable URIs. Dereferenceable means that you can request the URI (possibly in a special way) and get machine-processible data back.  I would like to see TEI documents having this level of accessibility.  It needn't be hard: just post the XML document on the web and publish its URL.

When I went looking for examples of <listPerson> last night, I found the TEI samples page [1]. I have no idea which of the resources listed there contains <listPerson> examples, so I had to go down the page and:

  • read the descriptive text, to judge whether it was worth having a look at the resource
  • check the licence information (where given) to see if I could use it, were it to be useful
  • follow the link to the resource 
  • [typically,] unzip the .gz archive it points to, and then further unzip the tar archive that contains
  • explore the directory structure and files I have now found, opening each individually in my favourite XML editor
  • scan each XML document for <listPerson> elements (mostly without success)

This is hardly 'programmatic access'.  A key idea with Linked Data is that you (or rather a software agent) can 'follow its nose', looking up Linked Data URIs, parsing the resources they point to, finding mentions of further resources within the first result and moving on to those.  It's treating the Web as a distributed database, and I would like TEI documents to be BLOB fields within that distributed database.  One obvious use case would be Open Annotation [2].

Best wishes,

Richard

[1] https://wiki.tei-c.org/index.php/Samples_of_TEI_texts

[2] https://www.w3.org/community/openannotation/


On 21/11/2017 23:17, Lou Burnard wrote:
On 21/11/17 21:30, Richard Light wrote: 
  Is there a 'VoID sitemap' of the TEI world, and if not, shouldn't there be? 


Hello Richard 

OK, I'll bite. WTF is a "VoID sitemap" ? 

. 


-- 
Richard Light


-- 
Richard Light

Reply | Threaded
Open this post in threaded view
|

Re: Source TEI documents with listPerson data

richard light

On 25/11/2017 15:20, Elisa Beshero-Bondar wrote:
Dear Richard and list,
There has been some incentive to work on interchangeability of data from digital edition projects lately, and without getting into details I can think of an Andrew Mellon grant opportunity whose deadline I missed last July that might have incentivized the construction of shared practices within a community of datasets—at least that was how I’d have interpreted the grant. As I understood it, we might consider a community to be a topical area that works with data from digital editions and archives in much the same way and relies on each other’s work for its scholarly discourse, where federated searching would make sense. The problem with imagining the entire TEI community’s prosopography data as part of a single linked open data set is that we don’t always store the same kinds of data in the same fashion—TEI gives us perhaps a little too much freedom for that, although it’s certainly possible to build a kind of digital scaffolding that might communicate by building “cross-walks”. I imagine that most of us are committed by institutions and funding models to concentrate on building our own archives in our own ways, and while we recognize the benefits of federated searching and linked open data, we haven’t project-by-project expressed a practical commitment to it, though we should.
I think that diversity is a strength, not a problem.  The challenge, as I see it, is to establish as much of a shared frame of reference as is feasible and desirable. 

In this case, we want enough information about each person to enable us to make a judgement as to whether the 'George Abbot' mentioned by another project is the same person as 'our' George Abbot.  Ideally the biographical details will be provided in a machine-processible format, such that software agents can make that judgement (at least on a probabilistic basis).  Deathless prose would, however, be a good second best.  The commitment to a common framework should, I suggest, be stronger where there is shared interest in the material. 

The world of literature, and the people who inhabit it, would be an obvious use case for sharing in this way.

This is something I know that I want to work on in the projects I’m part of. The Digital Mitford project, my most distributed project (shared with participants at multiple institutions with no single institution as its home), has as its backbone and probably most compelling raison d’etre a prosopography list that we’re developing from named entities we pull from letters, poetry, drama, and other literary texts—with historical persons, fictional and archetypal characters, and even named animals as “persons”, as well as places, named events, many kinds of named documents as a set of thousands of entries that serves as the central “nexus” point studying and interlinking the digital editions we’re preparing. In some ways, we’ve probably spent just as much time compiling, pruning, correcting, de-duping, and disambiguating this giant mess of a prosopography “spinal column” as we’ve been able to devote to our TEI representation of manuscripts and published documents. (See for my part http://digitalmitford.org/si.xml ). I’m sure our project isn’t alone in the effort we’ve devoted, and the inconsistencies within it). What’s specifically scary to me about RDF is a kind of commitment that this requires—one’s data, it seems, must ideally be “finished” and reliable to contribute reliably to the LOD world, when we’re always in progress and revising and checking for errors. I’d like to be building the scaffolding to make my data linkable, with the freedom to make repairs—and perhaps that’s the reason I study what I need to do to contribute to RDF and then step back a while until I’ve reliably identified the data that I know is most reliable for that linking.
Thanks for the link: I will have a go at processing it when time permits.

There is no need for a death-or-glory leap in the direction of RDF.  What I am currently doing is to grab a <listPerson> element as an XML document, and then import its content into our Modes database, converting each <person> element into a free-standing XML record/document in a biographical format.  An XSLT transform converts the TEI markup into our 'person' application markup.  As and when I will look to publish this resource as Linked Data, another XSLT transform will convert this XML to suitably structured RDF on the fly.  That transform can be adjusted over time as and when the requirements for RDF evolve.  So the data has only made a transition from one XML format to another.

Were we somehow to incentivize interchangeability, I would caution us against making all projects from TEI to be expressing linkages the same way. There’s a lot of diversity in our code base related to the many different eras and kinds of “documents” (or marked structures, monuments, etc) that we work with. I’d rather see regions and topics emerge that share ways of expressing the linkability of their data—in ways that make it easier for us to work together within, say, our “kinship network” of related projects. I wonder, Richard, if you have advice or thoughts on this!
I'm involved with the Linked Pasts project/workgroup (an outgrowth of the Pelagios project), which is aiming to develop an 'interconnection format' for person data to complement the format it is successfully using for historical gazetteers. This works on the assumption that each project will work in its own way, but that there is a core set of data which it is highly desirable to have for the purposes of disambiguating places (or people).

Best wishes,

Richard

Cheers,
Elisa
-- 
Elisa Beshero-Bondar, PhD
Director, Center for the Digital Text | Associate Professor of English
University of Pittsburgh at Greensburg | Humanities Division
150 Finoli Drive
Greensburg, PA  15601  USA
E-mail:[hidden email]
Development site: http://newtfire.org






On Nov 25, 2017, at 7:27 AM, Richard Light <[hidden email]> wrote:


On 24/11/2017 11:42, Paterson, Duncan wrote:
Wouldn’t Tapas http://www.tapasproject.org be a natural candidate for VoID sitemap?
I would very much welcome something along these line. 
Duncan,

One would think so.  Though a quick glance at this resource suggests that its name is sadly indicative of the content - lots of 'samples' in there.  Also the advanced search falls over when asked to find things.

Richard





On 22. Nov 2017, at 10:18, Richard Light <[hidden email]> wrote:

Lou,

Fair question: it's something from the Linked Data world (which I would like the TEI community to be a part of, in some sense):

https://en.wikipedia.org/wiki/VoID

https://www.w3.org/TR/void/

They're talking there about referencing RDF datasets, but the principle could equally apply to TEI documents.  Think XML Catalogs with additional useful metadata about each document.

The core idea with Linked Data is that entities of interest have fixed, dereferenceable URIs. Dereferenceable means that you can request the URI (possibly in a special way) and get machine-processible data back.  I would like to see TEI documents having this level of accessibility.  It needn't be hard: just post the XML document on the web and publish its URL.

When I went looking for examples of <listPerson> last night, I found the TEI samples page [1]. I have no idea which of the resources listed there contains <listPerson> examples, so I had to go down the page and:

  • read the descriptive text, to judge whether it was worth having a look at the resource
  • check the licence information (where given) to see if I could use it, were it to be useful
  • follow the link to the resource 
  • [typically,] unzip the .gz archive it points to, and then further unzip the tar archive that contains
  • explore the directory structure and files I have now found, opening each individually in my favourite XML editor
  • scan each XML document for <listPerson> elements (mostly without success)

This is hardly 'programmatic access'.  A key idea with Linked Data is that you (or rather a software agent) can 'follow its nose', looking up Linked Data URIs, parsing the resources they point to, finding mentions of further resources within the first result and moving on to those.  It's treating the Web as a distributed database, and I would like TEI documents to be BLOB fields within that distributed database.  One obvious use case would be Open Annotation [2].

Best wishes,

Richard

[1] https://wiki.tei-c.org/index.php/Samples_of_TEI_texts

[2] https://www.w3.org/community/openannotation/


On 21/11/2017 23:17, Lou Burnard wrote:
On 21/11/17 21:30, Richard Light wrote: 
  Is there a 'VoID sitemap' of the TEI world, and if not, shouldn't there be? 


Hello Richard 

OK, I'll bite. WTF is a "VoID sitemap" ? 

. 


-- 
Richard Light


-- 
Richard Light


--
Richard Light