Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Michael.Dahnke
Dear honourable list,

as part of http://kallimachos.de/kallimachos/index.php/Narragonien we're about to create a list of all people, they may be fictional or non-fictional, places, and sources, used by authors or else mentioned in context of any copy anyhow. So far, so good usual editorial practice. Approach at this stage is it to store text (and metadata for header, sure) of every single copy of each edition in a separate xml file with separate header etc., ie. one xml file is one copy.

Is it possible writing <listPerson>, <listPlace>, <listBibl>—or any further <list***> Guidelines are offering—in one separate xml file and referencing to entries within it from every ›copy‹-xml?

Maybe, joint paragon of wisdom and knowledge, this seems a laughable simple-silly question not worth noticing, yet I'd be grateful for any hint of yours.

Humble knight thanks in advance

Michael Dahnke
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
michael.dahnke@bibliothek.uni-wuerzburg.de
Reply | Threaded
Open this post in threaded view
|

Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Elisa Beshero-Bondar-2
Dear Michael,
It sounds like what you plan is very similar to our procedure on the Digital Mitford project, and it seems a fairly common and useful practice--that is, to store prosopography and bibliography information in a single file (or perhaps a single cluster of files). Each entry in your various lists would be given a unique xml:id, and your encoding of mentions of people, places, books, etc. would point with @ref (and/or other pointing attributes) to that file as a canonical set of identifiers and source of centralized information. Your list entries will also likely refer to other entries in the file, and also point outward to good resources of information on the web. Here's a working, in progress example of a file I designed, in case it's helpful: 

That file is sort of the "backbone" of our project, integrating our files together--it's a basis for designing our schema rules that ensure the correct id's are being referenced across our project files. Its individual entries are pulled into displays of annotations on editions published on our site. And when we want to study information about, say, how many artists were part of our network, or how many publications appeared from X location, we pull data and make graphs right out of that file. So such a file has many functions to serve in a big project.

The work you are planning here is also, by the way, an excellent basis for contributing to the web of Linked Open Data, since the pointing of files to your lists constitutes expressing relationships that can be expressed in various "triple store" vocabularies: X is identical to Y. A is a member of B, etc. Many kinds of expressions of relationship can be pulled from the encoding you build here, and there are lots of great possibilities for visualizing and analyzing those relationships, too.  

Hope that helps! 
Elisa

--
Elisa Beshero-Bondar, PhD 
Director, Center for the Digital Text
Associate Professor of English 
University of Pittsburgh at Greensburg
150 Finoli Drive, Greensburg, PA 15601 USA
E-mail: [hidden email] | Development site: http://newtfire.org

Typeset by hand on my iPad

On Aug 16, 2017, at 10:15 AM, Michael.Dahnke <[hidden email]> wrote:

Dear honourable list,

as part of http://kallimachos.de/kallimachos/index.php/Narragonien we're
about to create a list of all people, they may be fictional or
non-fictional, places, and sources, used by authors or else mentioned in
context of any copy anyhow. So far, so good usual editorial practice.
Approach at this stage is it to store text (and metadata for header, sure)
of every single copy of each edition in a separate xml file with separate
header etc., ie. one xml file is one copy.

Is it possible writing <listPerson>, <listPlace>, <listBibl>—or any further
<list***> Guidelines are offering—in one separate xml file and referencing
to entries within it from every ›copy‹-xml?

Maybe, joint paragon of wisdom and knowledge, this seems a laughable
simple-silly question not worth noticing, yet I'd be grateful for any hint
of yours.

Humble knight thanks in advance

Michael Dahnke
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
[hidden email]




--
View this message in context: http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982.html
Sent from the tei-l mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Michael.Dahnke
Dear Elisa

thanks for such an extraordinary rapid reply, so to speak ;-)

Try to say thanks and explaining why I'm in need of such as nutshell-let as possible.

1. Want and will study your xml file tomorrow—as well as your e-mail—when awake and alert again, closely (we're some six hrs or so ahead).

2. I've developed before merely a single simple freelancing project creating an TEI P5 catalogue of all radio and telly recordings of an author, interpreting each recording as a single work within a corpus, for http://www.uwe-johnson-werkausgabe.de/ In other words, I'm still a bloody beginner by myself.

Now, as—obviously—sole sod—here in new position since march17—who has ever seen any TEI code at all, stumbled by chance in an already for some years running edition project. Thus, I'm still trying to figure out EdGuidelines on the one hand whilst on the other hand supporting workmate who writes an Java converter for taking stuff from another format to TEI P5. And one more time TEI community turns out being quick and supportive ;-)

»It sounds like what you plan is very similar to our procedure on theDigital Mitford project, and it seems a fairly common and useful practice--that is, to store prosopography and bibliography information in a single file (or perhaps a single cluster of files).«

Elisa, such sentences are exactly what I need, assuring my unexperienced—with respect to TEI—colleague as well as scholars in charge of digEdition themselves, that my visceral hands-on approach is absolute state of the art and sincerely recommended by TEI-experts.

»Each entry in your various lists would be given a unique xml:id, and your encoding of mentions of people, places, books, etc. would point with @ref (and/or other pointing attributes) to that file as a canonical set of identifiers and source of centralized information.«

Again, that's exactly what I consider as aim of the game, however, against scepticism of unexperienced workmate of mine and old school non-digital scholars around nearly dared to dream in my wildest dreams.

»Your list entries will also likely refer to other entries in the file, and also point outward to good resources of information on the web. Here's a working, in progress example of a file I designed, in case it's helpful: http://digitalmitford.org/si.xml«

Thanks, this is section of your reply I have to study closely, tomorrow. Last, following paragraph I can by the time being comment solely with a silent »Hä?« (which means: Haven't got it yet, either ;-(

The work you are planning here is also, by the way, an excellent basis for contributing to the web of Linked Open Data, since the pointing of files to your lists constitutes expressing relationships that can be expressed in various "triple store" vocabularies: X is identical to Y. A is a member of B, etc. Many kinds of expressions of relationship can be pulled from the encoding you build here, and there are lots of great possibilities for visualizing and analyzing those relationships, too.

Last attempt making this at least a bit looking like a nut-shell:

Again, thanks a lot and enjoy your home time!

Michael
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Digitalisierung | Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
[hidden email]
On 16 Aug 2017, at 16:49, Elisa Beshero-Bondar-2 [via tei-l] wrote:

> Dear Michael,
> It sounds like what you plan is very similar to our procedure on the Digital Mitford project, and it seems a fairly common and useful practice--that is, to store prosopography and bibliography information in a single file (or perhaps a single cluster of files). Each entry in your various lists would be given a unique xml:id, and your encoding of mentions of people, places, books, etc. would point with @ref (and/or other pointing attributes) to that file as a canonical set of identifiers and source of centralized information. Your list entries will also likely refer to other entries in the file, and also point outward to good resources of information on the web. Here's a working, in progress example of a file I designed, in case it's helpful:
> http://digitalmitford.org/si.xml
>
> That file is sort of the "backbone" of our project, integrating our files together--it's a basis for designing our schema rules that ensure the correct id's are being referenced across our project files. Its individual entries are pulled into displays of annotations on editions published on our site. And when we want to study information about, say, how many artists were part of our network, or how many publications appeared from X location, we pull data and make graphs right out of that file. So such a file has many functions to serve in a big project.
>
> The work you are planning here is also, by the way, an excellent basis for contributing to the web of Linked Open Data, since the pointing of files to your lists constitutes expressing relationships that can be expressed in various "triple store" vocabularies: X is identical to Y. A is a member of B, etc. Many kinds of expressions of relationship can be pulled from the encoding you build here, and there are lots of great possibilities for visualizing and analyzing those relationships, too.  
>
> Hope that helps!
> Elisa
>
> --
> Elisa Beshero-Bondar, PhD
> Director, Center for the Digital Text
> Associate Professor of English
> University of Pittsburgh at Greensburg
> 150 Finoli Drive, Greensburg, PA 15601 USA
> E-mail: [hidden email] | Development site: http://newtfire.org
>
> Typeset by hand on my iPad
>
> On Aug 16, 2017, at 10:15 AM, Michael.Dahnke <[hidden email]> wrote:
>
>> Dear honourable list,
>>
>> as part of http://kallimachos.de/kallimachos/index.php/Narragonien we're
>> about to create a list of all people, they may be fictional or
>> non-fictional, places, and sources, used by authors or else mentioned in
>> context of any copy anyhow. So far, so good usual editorial practice.
>> Approach at this stage is it to store text (and metadata for header, sure)
>> of every single copy of each edition in a separate xml file with separate
>> header etc., ie. one xml file is one copy.
>>
>> Is it possible writing <listPerson>, <listPlace>, <listBibl>—or any further
>> <list***> Guidelines are offering—in one separate xml file and referencing
>> to entries within it from every ›copy‹-xml?
>>
>> Maybe, joint paragon of wisdom and knowledge, this seems a laughable
>> simple-silly question not worth noticing, yet I'd be grateful for any hint
>> of yours.
>>
>> Humble knight thanks in advance
>>
>> Michael Dahnke
>> --
>> Dr. Michael Dahnke
>> WMA zur Vermittlung von Digitalisierungskompetenz
>> Abteilung Digitalisierung
>> Kallimachos-Zentrum für Digitial Humanities
>>
>> Universitätsbibliothek Würzburg
>> Am Hubland
>> D-97074 Würzburg
>>
>> Ruf: +49-931-31-88562
>> [hidden email]
>>
>>
>>
>>
>> --
>> View this message in context: http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982.html
>> Sent from the tei-l mailing list archive at Nabble.com.
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029983.html
> To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
> NAML

Reply | Threaded
Open this post in threaded view
|

Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Michael.Dahnke
In reply to this post by Elisa Beshero-Bondar-2
Dear Elisa and entire TEI community,

while skating home and considering structure of id's in file you've linked in your reply I've stumbled upon two things—even if you've done it this way surely by good reason—namely, a) using solely letters and b) creating id's of different length; eg.

<person xml:id="ab"> [l. 58], whilst Mrs. Colombo receives three letters <person xml:id="ajc">, [l.113], poor Alison ;-(

Here comes my counterproposal and I'd be really happy U could explain me after your reading 1. why you have done what you have done & 2. in which way I'm awfully wrong.

1. If id consists sole of numbers, they are much easier computable by any pc prog; cf. 3. item. Are they not?

2. Fixed length, filling with any—best identical but by all means pref-defined—kind of digit if necessary. Makes first check of input possible by asking user: Is this correct length of phrase which has to be tipped in at all?

3. Append check-digit, computed by given numbers, for second safety check (we humans so are fallible, are we not?). Calculation of check digit is much easier if solely numbers are given and not a string or both letters and digits. Besides, how interpreting letters for computing check digit?

Possible way creating ID for humble self of author's of this lines using both his fore- and surname:

»MichaelDahnke«

a) michaeldahnke (or v.v. turning it all into capitals, would be, likely, possibly, too, I guess. Theoretically, U don't need to transform them this way at all, but it seems more clearly for humans)

b) Turning letters into digits by their position in alphabet (else U could use their position in ascii code for transforming, instead, too):

130903010512040108141105

c) Given pre-defined length of id of 30 numbers one has to add for further digits, in this case for the sake of brevity:

13090301051204010814110500000

d) Adding all 30 digits (or any other simply computably procedure, multiplying):

1+3+0+9+0+3+0+1+0+5+1+2+0+4+0+1+0+8+1+4+1+1+0+5+0+0+0+0+0 = 50

After, using either entire resulting number ›50‹ or merely first or 2nd digit as check digit is really mere a matter of taste, by MHO.

But not matter of taste at all is it that by this one

i) save time because system computes, after b) solely with ints, which is much faster than with letters, additionally, yet

ii) how calculating a check digit from a string or letters at all?

iii) Use of check id prevents twisting numbers when searching for an ID, because your input will be checked before automatically. Fallibility! Fallibility! Fallibility!

This is now merely tipped in a hurry and in case I got it wrong—confusing simple craftsmanship above with serious science—and I'm nearly totally convinced I got s.t. profound profoundly wrong, please, let me know.

Dear entire TEI community, I really want to learn, so please, your say!

Michael
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
[hidden email]

On 16 Aug 2017, at 16:49, Elisa Beshero-Bondar-2 [via tei-l] wrote:

> Dear Michael,
> It sounds like what you plan is very similar to our procedure on the Digital Mitford project, and it seems a fairly common and useful practice--that is, to store prosopography and bibliography information in a single file (or perhaps a single cluster of files). Each entry in your various lists would be given a unique xml:id, and your encoding of mentions of people, places, books, etc. would point with @ref (and/or other pointing attributes) to that file as a canonical set of identifiers and source of centralized information. Your list entries will also likely refer to other entries in the file, and also point outward to good resources of information on the web. Here's a working, in progress example of a file I designed, in case it's helpful:
> http://digitalmitford.org/si.xml
>
> That file is sort of the "backbone" of our project, integrating our files together--it's a basis for designing our schema rules that ensure the correct id's are being referenced across our project files. Its individual entries are pulled into displays of annotations on editions published on our site. And when we want to study information about, say, how many artists were part of our network, or how many publications appeared from X location, we pull data and make graphs right out of that file. So such a file has many functions to serve in a big project.
>
> The work you are planning here is also, by the way, an excellent basis for contributing to the web of Linked Open Data, since the pointing of files to your lists constitutes expressing relationships that can be expressed in various "triple store" vocabularies: X is identical to Y. A is a member of B, etc. Many kinds of expressions of relationship can be pulled from the encoding you build here, and there are lots of great possibilities for visualizing and analyzing those relationships, too.  
>
> Hope that helps!
> Elisa
>
> --
> Elisa Beshero-Bondar, PhD
> Director, Center for the Digital Text
> Associate Professor of English
> University of Pittsburgh at Greensburg
> 150 Finoli Drive, Greensburg, PA 15601 USA
> E-mail: [hidden email] | Development site: http://newtfire.org
>
> Typeset by hand on my iPad
>
> On Aug 16, 2017, at 10:15 AM, Michael.Dahnke <[hidden email]> wrote:
>
>> Dear honourable list,
>>
>> as part of http://kallimachos.de/kallimachos/index.php/Narragonien we're
>> about to create a list of all people, they may be fictional or
>> non-fictional, places, and sources, used by authors or else mentioned in
>> context of any copy anyhow. So far, so good usual editorial practice.
>> Approach at this stage is it to store text (and metadata for header, sure)
>> of every single copy of each edition in a separate xml file with separate
>> header etc., ie. one xml file is one copy.
>>
>> Is it possible writing <listPerson>, <listPlace>, <listBibl>—or any further
>> <list***> Guidelines are offering—in one separate xml file and referencing
>> to entries within it from every ›copy‹-xml?
>>
>> Maybe, joint paragon of wisdom and knowledge, this seems a laughable
>> simple-silly question not worth noticing, yet I'd be grateful for any hint
>> of yours.
>>
>> Humble knight thanks in advance
>>
>> Michael Dahnke
>> --
>> Dr. Michael Dahnke
>> WMA zur Vermittlung von Digitalisierungskompetenz
>> Abteilung Digitalisierung
>> Kallimachos-Zentrum für Digitial Humanities
>>
>> Universitätsbibliothek Würzburg
>> Am Hubland
>> D-97074 Würzburg
>>
>> Ruf: +49-931-31-88562
>> [hidden email]
>>
>>
>>
>>
>> --
>> View this message in context: http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982.html
>> Sent from the tei-l mailing list archive at Nabble.com.
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029983.html
> To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
> NAML

Reply | Threaded
Open this post in threaded view
|

Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Birnbaum, David J
Dear TEI-L,

@xml:id values are of type NCName and therefore are not permitted to begin with a digit. 

Best,

David

From: TEI-L <[hidden email]> on behalf of "Michael.Dahnke" <[hidden email]>
Reply-To: "Michael.Dahnke" <[hidden email]>
Date: Wednesday, August 16, 2017 at 2:40 PM
To: TEI-L <[hidden email]>
Subject: Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Dear Elisa and entire TEI community,

while skating home and considering structure of id's in file you've linked in your reply I've stumbled upon two things—even if you've done it this way surely by good reason—namely, a) using solely letters and b) creating id's of different length; eg.

<person xml:id="ab"> [l. 58], whilst Mrs. Colombo receives three letters <person xml:id="ajc">, [l.113], poor Alison ;-(

Here comes my counterproposal and I'd be really happy U could explain me after your reading 1. why you have done what you have done & 2. in which way I'm awfully wrong.

1. If id consists sole of numbers, they are much easier computable by any pc prog; cf. 3. item. Are they not?

2. Fixed length, filling with any—best identical but by all means pref-defined—kind of digit if necessary. Makes first check of input possible by asking user: Is this correct length of phrase which has to be tipped in at all?

3. Append check-digit, computed by given numbers, for second safety check (we humans so are fallible, are we not?). Calculation of check digit is much easier if solely numbers are given and not a string or both letters and digits. Besides, how interpreting letters for computing check digit?

Possible way creating ID for humble self of author's of this lines using both his fore- and surname:

»MichaelDahnke«

a) michaeldahnke (or v.v. turning it all into capitals, would be, likely, possibly, too, I guess. Theoretically, U don't need to transform them this way at all, but it seems more clearly for humans)

b) Turning letters into digits by their position in alphabet (else U could use their position in ascii code for transforming, instead, too):

130903010512040108141105

c) Given pre-defined length of id of 30 numbers one has to add for further digits, in this case for the sake of brevity:

13090301051204010814110500000

d) Adding all 30 digits (or any other simply computably procedure, multiplying):

1+3+0+9+0+3+0+1+0+5+1+2+0+4+0+1+0+8+1+4+1+1+0+5+0+0+0+0+0 = 50

After, using either entire resulting number ›50‹ or merely first or 2nd digit as check digit is really mere a matter of taste, by MHO.

But not matter of taste at all is it that by this one

i) save time because system computes, after b) solely with ints, which is much faster than with letters, additionally, yet

ii) how calculating a check digit from a string or letters at all?

iii) Use of check id prevents twisting numbers when searching for an ID, because your input will be checked before automatically. Fallibility! Fallibility! Fallibility!

This is now merely tipped in a hurry and in case I got it wrong—confusing simple craftsmanship above with serious science—and I'm nearly totally convinced I got s.t. profound profoundly wrong, please, let me know.

Dear entire TEI community, I really want to learn, so please, your say!

Michael
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
[hidden email]

On 16 Aug 2017, at 16:49, Elisa Beshero-Bondar-2 [via tei-l] wrote:

> Dear Michael,
> It sounds like what you plan is very similar to our procedure on the Digital Mitford project, and it seems a fairly common and useful practice--that is, to store prosopography and bibliography information in a single file (or perhaps a single cluster of files). Each entry in your various lists would be given a unique xml:id, and your encoding of mentions of people, places, books, etc. would point with @ref (and/or other pointing attributes) to that file as a canonical set of identifiers and source of centralized information. Your list entries will also likely refer to other entries in the file, and also point outward to good resources of information on the web. Here's a working, in progress example of a file I designed, in case it's helpful:
> http://digitalmitford.org/si.xml
>
> That file is sort of the "backbone" of our project, integrating our files together--it's a basis for designing our schema rules that ensure the correct id's are being referenced across our project files. Its individual entries are pulled into displays of annotations on editions published on our site. And when we want to study information about, say, how many artists were part of our network, or how many publications appeared from X location, we pull data and make graphs right out of that file. So such a file has many functions to serve in a big project.
>
> The work you are planning here is also, by the way, an excellent basis for contributing to the web of Linked Open Data, since the pointing of files to your lists constitutes expressing relationships that can be expressed in various "triple store" vocabularies: X is identical to Y. A is a member of B, etc. Many kinds of expressions of relationship can be pulled from the encoding you build here, and there are lots of great possibilities for visualizing and analyzing those relationships, too.  
>
> Hope that helps!
> Elisa
>
> --
> Elisa Beshero-Bondar, PhD
> Director, Center for the Digital Text
> Associate Professor of English
> University of Pittsburgh at Greensburg
> 150 Finoli Drive, Greensburg, PA 15601 USA
> E-mail: [hidden email] | Development site: http://newtfire.org
>
> Typeset by hand on my iPad
>
> On Aug 16, 2017, at 10:15 AM, Michael.Dahnke <[hidden email]> wrote:
>
>> Dear honourable list,
>>
>> as part of http://kallimachos.de/kallimachos/index.php/Narragonien we're
>> about to create a list of all people, they may be fictional or
>> non-fictional, places, and sources, used by authors or else mentioned in
>> context of any copy anyhow. So far, so good usual editorial practice.
>> Approach at this stage is it to store text (and metadata for header, sure)
>> of every single copy of each edition in a separate xml file with separate
>> header etc., ie. one xml file is one copy.
>>
>> Is it possible writing <listPerson>, <listPlace>, <listBibl>—or any further
>> <list***> Guidelines are offering—in one separate xml file and referencing
>> to entries within it from every ›copy‹-xml?
>>
>> Maybe, joint paragon of wisdom and knowledge, this seems a laughable
>> simple-silly question not worth noticing, yet I'd be grateful for any hint
>> of yours.
>>
>> Humble knight thanks in advance
>>
>> Michael Dahnke
>> --
>> Dr. Michael Dahnke
>> WMA zur Vermittlung von Digitalisierungskompetenz
>> Abteilung Digitalisierung
>> Kallimachos-Zentrum für Digitial Humanities
>>
>> Universitätsbibliothek Würzburg
>> Am Hubland
>> D-97074 Würzburg
>>
>> Ruf: +49-931-31-88562
>> [hidden email]
>>
>>
>>
>>
>> --
>> View this message in context: http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982.html
>> Sent from the tei-l mailing list archive at Nabble.com.
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029983.html
> To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
> NAML



View this message in context: Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file
Sent from the tei-l mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Elisa Beshero-Bondar
Dear Michael and all,
You can certainly use digits in @xml:id values, but as David indicated, the value can't be only made of digits. It can't contain white spaces either, and needs to be unique within the document. I know that many projects use automatic, random-generated ids, and there are good reasons to do that! In our project, we decided to make the ids be human-readable because frequently our coders are searching for particular strings of characters or selecting ids from drop-down menus. It's just our decision to work that way, and we rely on validating the ids we propose to make sure they are actually unique. 

Hope that helps!
Elisa

On Wed, Aug 16, 2017 at 2:54 PM, Birnbaum, David J <[hidden email]> wrote:
Dear TEI-L,

@xml:id values are of type NCName and therefore are not permitted to begin with a digit. 

Best,

David

From: TEI-L <[hidden email]> on behalf of "Michael.Dahnke" <[hidden email]>
Reply-To: "Michael.Dahnke" <[hidden email]>
Date: Wednesday, August 16, 2017 at 2:40 PM
To: TEI-L <[hidden email]>
Subject: Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Dear Elisa and entire TEI community,

while skating home and considering structure of id's in file you've linked in your reply I've stumbled upon two things—even if you've done it this way surely by good reason—namely, a) using solely letters and b) creating id's of different length; eg.

<person xml:id="ab"> [l. 58], whilst Mrs. Colombo receives three letters <person xml:id="ajc">, [l.113], poor Alison ;-(

Here comes my counterproposal and I'd be really happy U could explain me after your reading 1. why you have done what you have done & 2. in which way I'm awfully wrong.

1. If id consists sole of numbers, they are much easier computable by any pc prog; cf. 3. item. Are they not?

2. Fixed length, filling with any—best identical but by all means pref-defined—kind of digit if necessary. Makes first check of input possible by asking user: Is this correct length of phrase which has to be tipped in at all?

3. Append check-digit, computed by given numbers, for second safety check (we humans so are fallible, are we not?). Calculation of check digit is much easier if solely numbers are given and not a string or both letters and digits. Besides, how interpreting letters for computing check digit?

Possible way creating ID for humble self of author's of this lines using both his fore- and surname:

»MichaelDahnke«

a) michaeldahnke (or v.v. turning it all into capitals, would be, likely, possibly, too, I guess. Theoretically, U don't need to transform them this way at all, but it seems more clearly for humans)

b) Turning letters into digits by their position in alphabet (else U could use their position in ascii code for transforming, instead, too):

130903010512040108141105

c) Given pre-defined length of id of 30 numbers one has to add for further digits, in this case for the sake of brevity:

13090301051204010814110500000

d) Adding all 30 digits (or any other simply computably procedure, multiplying):

1+3+0+9+0+3+0+1+0+5+1+2+0+4+0+1+0+8+1+4+1+1+0+5+0+0+0+0+0 = 50

After, using either entire resulting number ›50‹ or merely first or 2nd digit as check digit is really mere a matter of taste, by MHO.

But not matter of taste at all is it that by this one

i) save time because system computes, after b) solely with ints, which is much faster than with letters, additionally, yet

ii) how calculating a check digit from a string or letters at all?

iii) Use of check id prevents twisting numbers when searching for an ID, because your input will be checked before automatically. Fallibility! Fallibility! Fallibility!

This is now merely tipped in a hurry and in case I got it wrong—confusing simple craftsmanship above with serious science—and I'm nearly totally convinced I got s.t. profound profoundly wrong, please, let me know.

Dear entire TEI community, I really want to learn, so please, your say!

Michael
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: <a href="tel:+49%20931%203188562" value="+499313188562" target="_blank">+49-931-31-88562
[hidden email]

On 16 Aug 2017, at 16:49, Elisa Beshero-Bondar-2 [via tei-l] wrote:

> Dear Michael,
> It sounds like what you plan is very similar to our procedure on the Digital Mitford project, and it seems a fairly common and useful practice--that is, to store prosopography and bibliography information in a single file (or perhaps a single cluster of files). Each entry in your various lists would be given a unique xml:id, and your encoding of mentions of people, places, books, etc. would point with @ref (and/or other pointing attributes) to that file as a canonical set of identifiers and source of centralized information. Your list entries will also likely refer to other entries in the file, and also point outward to good resources of information on the web. Here's a working, in progress example of a file I designed, in case it's helpful:
> http://digitalmitford.org/si.xml
>
> That file is sort of the "backbone" of our project, integrating our files together--it's a basis for designing our schema rules that ensure the correct id's are being referenced across our project files. Its individual entries are pulled into displays of annotations on editions published on our site. And when we want to study information about, say, how many artists were part of our network, or how many publications appeared from X location, we pull data and make graphs right out of that file. So such a file has many functions to serve in a big project.
>
> The work you are planning here is also, by the way, an excellent basis for contributing to the web of Linked Open Data, since the pointing of files to your lists constitutes expressing relationships that can be expressed in various "triple store" vocabularies: X is identical to Y. A is a member of B, etc. Many kinds of expressions of relationship can be pulled from the encoding you build here, and there are lots of great possibilities for visualizing and analyzing those relationships, too.  
>
> Hope that helps!
> Elisa
>
> --
> Elisa Beshero-Bondar, PhD
> Director, Center for the Digital Text
> Associate Professor of English
> University of Pittsburgh at Greensburg
> 150 Finoli Drive, Greensburg, PA 15601 USA
> E-mail: [hidden email] | Development site: http://newtfire.org
>
> Typeset by hand on my iPad
>
> On Aug 16, 2017, at 10:15 AM, Michael.Dahnke <[hidden email]> wrote:
>
>> Dear honourable list,
>>
>> as part of http://kallimachos.de/kallimachos/index.php/Narragonien we're
>> about to create a list of all people, they may be fictional or
>> non-fictional, places, and sources, used by authors or else mentioned in
>> context of any copy anyhow. So far, so good usual editorial practice.
>> Approach at this stage is it to store text (and metadata for header, sure)
>> of every single copy of each edition in a separate xml file with separate
>> header etc., ie. one xml file is one copy.
>>
>> Is it possible writing <listPerson>, <listPlace>, <listBibl>—or any further
>> <list***> Guidelines are offering—in one separate xml file and referencing
>> to entries within it from every ›copy‹-xml?
>>
>> Maybe, joint paragon of wisdom and knowledge, this seems a laughable
>> simple-silly question not worth noticing, yet I'd be grateful for any hint
>> of yours.
>>
>> Humble knight thanks in advance
>>
>> Michael Dahnke
>> --
>> Dr. Michael Dahnke
>> WMA zur Vermittlung von Digitalisierungskompetenz
>> Abteilung Digitalisierung
>> Kallimachos-Zentrum für Digitial Humanities
>>
>> Universitätsbibliothek Würzburg
>> Am Hubland
>> D-97074 Würzburg
>>
>> Ruf: <a href="tel:+49%20931%203188562" value="+499313188562" target="_blank">+49-931-31-88562
>> [hidden email]
>>
>>
>>
>>
>> --
>> View this message in context: http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982.html
>> Sent from the tei-l mailing list archive at Nabble.com.
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029983.html
> To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
> NAML



View this message in context: Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file
Sent from the tei-l mailing list archive at Nabble.com.



--
Elisa Beshero-Bondar, PhD
Director, Center for the Digital Text | Associate Professor of English
University of Pittsburgh at Greensburg | Humanities Division
150 Finoli Drive
Greensburg, PA  15601  USA
E-mail:[hidden email]
Development site: http://newtfire.org
Reply | Threaded
Open this post in threaded view
|

Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Elisa Beshero-Bondar
PS: To read more about this and see more examples, take a look at the TEI Guidelines examples of xml:ids in Ch. 3: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CORS1 \

Elisa

On Wed, Aug 16, 2017 at 2:59 PM, Elisa Beshero-Bondar <[hidden email]> wrote:
Dear Michael and all,
You can certainly use digits in @xml:id values, but as David indicated, the value can't be only made of digits. It can't contain white spaces either, and needs to be unique within the document. I know that many projects use automatic, random-generated ids, and there are good reasons to do that! In our project, we decided to make the ids be human-readable because frequently our coders are searching for particular strings of characters or selecting ids from drop-down menus. It's just our decision to work that way, and we rely on validating the ids we propose to make sure they are actually unique. 

Hope that helps!
Elisa

On Wed, Aug 16, 2017 at 2:54 PM, Birnbaum, David J <[hidden email]> wrote:
Dear TEI-L,

@xml:id values are of type NCName and therefore are not permitted to begin with a digit. 

Best,

David

From: TEI-L <[hidden email]> on behalf of "Michael.Dahnke" <[hidden email]>
Reply-To: "Michael.Dahnke" <[hidden email]>
Date: Wednesday, August 16, 2017 at 2:40 PM
To: TEI-L <[hidden email]>
Subject: Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Dear Elisa and entire TEI community,

while skating home and considering structure of id's in file you've linked in your reply I've stumbled upon two things—even if you've done it this way surely by good reason—namely, a) using solely letters and b) creating id's of different length; eg.

<person xml:id="ab"> [l. 58], whilst Mrs. Colombo receives three letters <person xml:id="ajc">, [l.113], poor Alison ;-(

Here comes my counterproposal and I'd be really happy U could explain me after your reading 1. why you have done what you have done & 2. in which way I'm awfully wrong.

1. If id consists sole of numbers, they are much easier computable by any pc prog; cf. 3. item. Are they not?

2. Fixed length, filling with any—best identical but by all means pref-defined—kind of digit if necessary. Makes first check of input possible by asking user: Is this correct length of phrase which has to be tipped in at all?

3. Append check-digit, computed by given numbers, for second safety check (we humans so are fallible, are we not?). Calculation of check digit is much easier if solely numbers are given and not a string or both letters and digits. Besides, how interpreting letters for computing check digit?

Possible way creating ID for humble self of author's of this lines using both his fore- and surname:

»MichaelDahnke«

a) michaeldahnke (or v.v. turning it all into capitals, would be, likely, possibly, too, I guess. Theoretically, U don't need to transform them this way at all, but it seems more clearly for humans)

b) Turning letters into digits by their position in alphabet (else U could use their position in ascii code for transforming, instead, too):

130903010512040108141105

c) Given pre-defined length of id of 30 numbers one has to add for further digits, in this case for the sake of brevity:

13090301051204010814110500000

d) Adding all 30 digits (or any other simply computably procedure, multiplying):

1+3+0+9+0+3+0+1+0+5+1+2+0+4+0+1+0+8+1+4+1+1+0+5+0+0+0+0+0 = 50

After, using either entire resulting number ›50‹ or merely first or 2nd digit as check digit is really mere a matter of taste, by MHO.

But not matter of taste at all is it that by this one

i) save time because system computes, after b) solely with ints, which is much faster than with letters, additionally, yet

ii) how calculating a check digit from a string or letters at all?

iii) Use of check id prevents twisting numbers when searching for an ID, because your input will be checked before automatically. Fallibility! Fallibility! Fallibility!

This is now merely tipped in a hurry and in case I got it wrong—confusing simple craftsmanship above with serious science—and I'm nearly totally convinced I got s.t. profound profoundly wrong, please, let me know.

Dear entire TEI community, I really want to learn, so please, your say!

Michael
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: <a href="tel:+49%20931%203188562" value="+499313188562" target="_blank">+49-931-31-88562
[hidden email]

On 16 Aug 2017, at 16:49, Elisa Beshero-Bondar-2 [via tei-l] wrote:

> Dear Michael,
> It sounds like what you plan is very similar to our procedure on the Digital Mitford project, and it seems a fairly common and useful practice--that is, to store prosopography and bibliography information in a single file (or perhaps a single cluster of files). Each entry in your various lists would be given a unique xml:id, and your encoding of mentions of people, places, books, etc. would point with @ref (and/or other pointing attributes) to that file as a canonical set of identifiers and source of centralized information. Your list entries will also likely refer to other entries in the file, and also point outward to good resources of information on the web. Here's a working, in progress example of a file I designed, in case it's helpful:
> http://digitalmitford.org/si.xml
>
> That file is sort of the "backbone" of our project, integrating our files together--it's a basis for designing our schema rules that ensure the correct id's are being referenced across our project files. Its individual entries are pulled into displays of annotations on editions published on our site. And when we want to study information about, say, how many artists were part of our network, or how many publications appeared from X location, we pull data and make graphs right out of that file. So such a file has many functions to serve in a big project.
>
> The work you are planning here is also, by the way, an excellent basis for contributing to the web of Linked Open Data, since the pointing of files to your lists constitutes expressing relationships that can be expressed in various "triple store" vocabularies: X is identical to Y. A is a member of B, etc. Many kinds of expressions of relationship can be pulled from the encoding you build here, and there are lots of great possibilities for visualizing and analyzing those relationships, too.  
>
> Hope that helps!
> Elisa
>
> --
> Elisa Beshero-Bondar, PhD
> Director, Center for the Digital Text
> Associate Professor of English
> University of Pittsburgh at Greensburg
> 150 Finoli Drive, Greensburg, PA 15601 USA
> E-mail: [hidden email] | Development site: http://newtfire.org
>
> Typeset by hand on my iPad
>
> On Aug 16, 2017, at 10:15 AM, Michael.Dahnke <[hidden email]> wrote:
>
>> Dear honourable list,
>>
>> as part of http://kallimachos.de/kallimachos/index.php/Narragonien we're
>> about to create a list of all people, they may be fictional or
>> non-fictional, places, and sources, used by authors or else mentioned in
>> context of any copy anyhow. So far, so good usual editorial practice.
>> Approach at this stage is it to store text (and metadata for header, sure)
>> of every single copy of each edition in a separate xml file with separate
>> header etc., ie. one xml file is one copy.
>>
>> Is it possible writing <listPerson>, <listPlace>, <listBibl>—or any further
>> <list***> Guidelines are offering—in one separate xml file and referencing
>> to entries within it from every ›copy‹-xml?
>>
>> Maybe, joint paragon of wisdom and knowledge, this seems a laughable
>> simple-silly question not worth noticing, yet I'd be grateful for any hint
>> of yours.
>>
>> Humble knight thanks in advance
>>
>> Michael Dahnke
>> --
>> Dr. Michael Dahnke
>> WMA zur Vermittlung von Digitalisierungskompetenz
>> Abteilung Digitalisierung
>> Kallimachos-Zentrum für Digitial Humanities
>>
>> Universitätsbibliothek Würzburg
>> Am Hubland
>> D-97074 Würzburg
>>
>> Ruf: <a href="tel:+49%20931%203188562" value="+499313188562" target="_blank">+49-931-31-88562
>> [hidden email]
>>
>>
>>
>>
>> --
>> View this message in context: http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982.html
>> Sent from the tei-l mailing list archive at Nabble.com.
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029983.html
> To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
> NAML



View this message in context: Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file
Sent from the tei-l mailing list archive at Nabble.com.



--
Elisa Beshero-Bondar, PhD
Director, Center for the Digital Text | Associate Professor of English

University of Pittsburgh at Greensburg | Humanities Division
150 Finoli Drive
Greensburg, PA  15601  USA
E-mail:[hidden email]
Development site: http://newtfire.org



--
Elisa Beshero-Bondar, PhD
Director, Center for the Digital Text | Associate Professor of English
University of Pittsburgh at Greensburg | Humanities Division
150 Finoli Drive
Greensburg, PA  15601  USA
E-mail:[hidden email]
Development site: http://newtfire.org
Reply | Threaded
Open this post in threaded view
|

Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Syd Bauman-10
In reply to this post by Michael.Dahnke
Michael --

First thing to realize about IDs is that the computer (i.e., your XML
system) does not care. The suggestions you make (only digits, same
length, use check digit) might help the *humans* and programs they
write, but XML does not give a hoot, and thus TEI does not give a
hoot. (EXCEPT see below about the digits.)

Attaching a check character can be, in fact, quite helpful. But it's
not that much easier to calculate one using only digits than any
constrained set of alphanumerics. (E.g., the WWP system uses a base
27 number whose 'digits' are the 26 English letters A-Z plus one
extra character which is used for anything else.) However, it is
worth noting that we (the WWP) almost never use those check
characters anymore, anyway. Instead of asking "is this a potentially
valid reference -- i.e., is it the right length with a valid check
digit?" I usually just follow the reference and ask "does it point to
the right kind of element?". We used to use the check digit every day
back in the days when it was hard to follow the pointer and find out
what it pointed to. Now, it's just so easy.

As for using all digits, it's not as if you have a choice. @xml:id is
defined as being of type ID, and the W3C (not the TEI) defines that
as meaning it *must* start with a letter, an underscore, or a colon
(bad idea). Note that the letter can be from most any alphabet.


> while skating home and considering structure of id's in file you've linked in your reply I've stumbled upon two things—even if you've done it this way surely by good reason—namely, a) using solely letters and b) creating id's of different length; eg.
>
> <person xml:id="ab"> [l. 58], whilst Mrs. Colombo receives three letters <person xml:id="ajc">, [l.113], poor Alison ;-(
>
> Here comes my counterproposal and I'd be really happy U could explain me after your reading 1. why you have done what you have done & 2. in which way I'm awfully wrong.
>
> 1. If id consists sole of numbers, they are much easier computable by any pc prog; cf. 3. item. Are they not?
>
> 2. Fixed length, filling with any—best identical but by all means pref-defined—kind of digit if necessary. Makes first check of input possible by asking user: Is this correct length of phrase which has to be tipped in at all?
>
> 3. Append check-digit, computed by given numbers, for second safety check (we humans so are fallible, are we not?). Calculation of check digit is much easier if solely numbers are given and not a string or both letters and digits. Besides, how interpreting letters for computing check digit?
>
> Possible way creating ID for humble self of author's of this lines using both his fore- and surname:
>
> »MichaelDahnke«
>
> a) michaeldahnke (or v.v. turning it all into capitals, would be, likely, possibly, too, I guess. Theoretically, U don't need to transform them this way at all, but it seems more clearly for humans)
>
> b) Turning letters into digits by their position in alphabet (else U could use their position in ascii code for transforming, instead, too):
>
> 130903010512040108141105
>
> c) Given pre-defined length of id of 30 numbers one has to add for further digits, in this case for the sake of brevity:
>
> 13090301051204010814110500000
>
> d) Adding all 30 digits (or any other simply computably procedure, multiplying):
>
> 1+3+0+9+0+3+0+1+0+5+1+2+0+4+0+1+0+8+1+4+1+1+0+5+0+0+0+0+0 = 50
>
> After, using either entire resulting number ›50‹ or merely first or 2nd digit as check digit is really mere a matter of taste, by MHO.
>
> But not matter of taste at all is it that by this one
>
> i) save time because system computes, after b) solely with ints, which is much faster than with letters, additionally, yet
>
> ii) how calculating a check digit from a string or letters at all?
>
> iii) Use of check id prevents twisting numbers when searching for an ID, because your input will be checked before automatically. Fallibility! Fallibility! Fallibility!
>
> This is now merely tipped in a hurry and in case I got it wrong—confusing simple craftsmanship above with serious science—and I'm nearly totally convinced I got s.t. profound profoundly wrong, please, let me know.
>
> Dear entire TEI community, I really want to learn, so please, your say!
Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Peter Stadler
Hi Syd,

just wanted to say that I still think check digits are useful, especially when hand-editing TEI documents. I’ve seen a lot of transposed digits in ids and those are really hard to find because mostly they create a valid reference (and will point at something).
I wonder how do you check "does it point to the right kind of element?“

Best
Peter


> Am 16.08.2017 um 21:02 schrieb Syd Bauman <[hidden email]>:
>
> Michael --
>
> First thing to realize about IDs is that the computer (i.e., your XML
> system) does not care. The suggestions you make (only digits, same
> length, use check digit) might help the *humans* and programs they
> write, but XML does not give a hoot, and thus TEI does not give a
> hoot. (EXCEPT see below about the digits.)
>
> Attaching a check character can be, in fact, quite helpful. But it's
> not that much easier to calculate one using only digits than any
> constrained set of alphanumerics. (E.g., the WWP system uses a base
> 27 number whose 'digits' are the 26 English letters A-Z plus one
> extra character which is used for anything else.) However, it is
> worth noting that we (the WWP) almost never use those check
> characters anymore, anyway. Instead of asking "is this a potentially
> valid reference -- i.e., is it the right length with a valid check
> digit?" I usually just follow the reference and ask "does it point to
> the right kind of element?". We used to use the check digit every day
> back in the days when it was hard to follow the pointer and find out
> what it pointed to. Now, it's just so easy.
>
> As for using all digits, it's not as if you have a choice. @xml:id is
> defined as being of type ID, and the W3C (not the TEI) defines that
> as meaning it *must* start with a letter, an underscore, or a colon
> (bad idea). Note that the letter can be from most any alphabet.
>
>
>> while skating home and considering structure of id's in file you've linked in your reply I've stumbled upon two things—even if you've done it this way surely by good reason—namely, a) using solely letters and b) creating id's of different length; eg.
>>
>> <person xml:id="ab"> [l. 58], whilst Mrs. Colombo receives three letters <person xml:id="ajc">, [l.113], poor Alison ;-(
>>
>> Here comes my counterproposal and I'd be really happy U could explain me after your reading 1. why you have done what you have done & 2. in which way I'm awfully wrong.
>>
>> 1. If id consists sole of numbers, they are much easier computable by any pc prog; cf. 3. item. Are they not?
>>
>> 2. Fixed length, filling with any—best identical but by all means pref-defined—kind of digit if necessary. Makes first check of input possible by asking user: Is this correct length of phrase which has to be tipped in at all?
>>
>> 3. Append check-digit, computed by given numbers, for second safety check (we humans so are fallible, are we not?). Calculation of check digit is much easier if solely numbers are given and not a string or both letters and digits. Besides, how interpreting letters for computing check digit?
>>
>> Possible way creating ID for humble self of author's of this lines using both his fore- and surname:
>>
>> »MichaelDahnke«
>>
>> a) michaeldahnke (or v.v. turning it all into capitals, would be, likely, possibly, too, I guess. Theoretically, U don't need to transform them this way at all, but it seems more clearly for humans)
>>
>> b) Turning letters into digits by their position in alphabet (else U could use their position in ascii code for transforming, instead, too):
>>
>> 130903010512040108141105
>>
>> c) Given pre-defined length of id of 30 numbers one has to add for further digits, in this case for the sake of brevity:
>>
>> 13090301051204010814110500000
>>
>> d) Adding all 30 digits (or any other simply computably procedure, multiplying):
>>
>> 1+3+0+9+0+3+0+1+0+5+1+2+0+4+0+1+0+8+1+4+1+1+0+5+0+0+0+0+0 = 50
>>
>> After, using either entire resulting number ›50‹ or merely first or 2nd digit as check digit is really mere a matter of taste, by MHO.
>>
>> But not matter of taste at all is it that by this one
>>
>> i) save time because system computes, after b) solely with ints, which is much faster than with letters, additionally, yet
>>
>> ii) how calculating a check digit from a string or letters at all?
>>
>> iii) Use of check id prevents twisting numbers when searching for an ID, because your input will be checked before automatically. Fallibility! Fallibility! Fallibility!
>>
>> This is now merely tipped in a hurry and in case I got it wrong—confusing simple craftsmanship above with serious science—and I'm nearly totally convinced I got s.t. profound profoundly wrong, please, let me know.
>>
>> Dear entire TEI community, I really want to learn, so please, your say!
Reply | Threaded
Open this post in threaded view
|

Re: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Michael.Dahnke
In reply to this post by Syd Bauman-10
Salu Syd,

thanks for your contribution!

> As for using all digits, it's not as if you have a choice. @xml:id is defined as being of type ID, and the W3C (not the TEI) defines that
> as meaning it *must* start with a letter, an underscore, or a colon (bad idea). Note that the letter can be from most any alphabet.


Sure, this was definitively a rookie mistake, maybe due to time of date of my reply. But what I don't understand it, what it makes it now so easy following the pointer. First two yes: Neither XML nor TEI gives a hoot if predefined length and check digit or not, or whether it is valid or not, respectively. But is creation of XML end in itself? As both we know and Elisa has laid out briefly in last paragraph of her first reply

»That file [http://digitalmitford.org/si.xml M.D.] is sort of the "backbone" of our project, integrating our files together--it's a basis for designing our schema rules that ensure the correct id's are being referenced across our project files. Its individual entries are pulled into displays of annotations on editions published on our site. And when we want to study information about, say, how many artists were part of our network, or how many publications appeared from X location, we pull data and make graphs right out of that file. So such a file has many functions to serve in a big project.«

data will be ›used‹ after mainly by pc progs, not humans in the first place. Humans are wishing to obtain information from files but in most cases won't deal directly with them but data will be computed by machines. Would they not?

> Now, it's just so easy [following the pointer].


Why is it now so easy and for whom, humans or machines? Can you explain, please, why I'm here on wrong track and helping me back on right one, please?

Michael
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
[hidden email]

On 16 Aug 2017, at 21:02, Syd Bauman-10 [via tei-l] wrote:

> Michael --
>
> First thing to realize about IDs is that the computer (i.e., your XML
> system) does not care. The suggestions you make (only digits, same
> length, use check digit) might help the *humans* and programs they
> write, but XML does not give a hoot, and thus TEI does not give a
> hoot. (EXCEPT see below about the digits.)
>
> Attaching a check character can be, in fact, quite helpful. But it's
> not that much easier to calculate one using only digits than any
> constrained set of alphanumerics. (E.g., the WWP system uses a base
> 27 number whose 'digits' are the 26 English letters A-Z plus one
> extra character which is used for anything else.) However, it is
> worth noting that we (the WWP) almost never use those check
> characters anymore, anyway. Instead of asking "is this a potentially
> valid reference -- i.e., is it the right length with a valid check
> digit?" I usually just follow the reference and ask "does it point to
> the right kind of element?". We used to use the check digit every day
> back in the days when it was hard to follow the pointer and find out
> what it pointed to. Now, it's just so easy.
>
> As for using all digits, it's not as if you have a choice. @xml:id is
> defined as being of type ID, and the W3C (not the TEI) defines that
> as meaning it *must* start with a letter, an underscore, or a colon
> (bad idea). Note that the letter can be from most any alphabet.
>
>
> > while skating home and considering structure of id's in file you've linked in your reply I've stumbled upon two things—even if you've done it this way surely by good reason—namely, a) using solely letters and b) creating id's of different length; eg.
> >
> > <person xml:id="ab"> [l. 58], whilst Mrs. Colombo receives three letters <person xml:id="ajc">, [l.113], poor Alison ;-(
> >
> > Here comes my counterproposal and I'd be really happy U could explain me after your reading 1. why you have done what you have done & 2. in which way I'm awfully wrong.
> >
> > 1. If id consists sole of numbers, they are much easier computable by any pc prog; cf. 3. item. Are they not?
> >
> > 2. Fixed length, filling with any—best identical but by all means pref-defined—kind of digit if necessary. Makes first check of input possible by asking user: Is this correct length of phrase which has to be tipped in at all?
> >
> > 3. Append check-digit, computed by given numbers, for second safety check (we humans so are fallible, are we not?). Calculation of check digit is much easier if solely numbers are given and not a string or both letters and digits. Besides, how interpreting letters for computing check digit?
> >
> > Possible way creating ID for humble self of author's of this lines using both his fore- and surname:
> >
> > »MichaelDahnke«
> >
> > a) michaeldahnke (or v.v. turning it all into capitals, would be, likely, possibly, too, I guess. Theoretically, U don't need to transform them this way at all, but it seems more clearly for humans)
> >
> > b) Turning letters into digits by their position in alphabet (else U could use their position in ascii code for transforming, instead, too):
> >
> > 130903010512040108141105
> >
> > c) Given pre-defined length of id of 30 numbers one has to add for further digits, in this case for the sake of brevity:
> >
> > 13090301051204010814110500000
> >
> > d) Adding all 30 digits (or any other simply computably procedure, multiplying):
> >
> > 1+3+0+9+0+3+0+1+0+5+1+2+0+4+0+1+0+8+1+4+1+1+0+5+0+0+0+0+0 = 50
> >
> > After, using either entire resulting number ›50‹ or merely first or 2nd digit as check digit is really mere a matter of taste, by MHO.
> >
> > But not matter of taste at all is it that by this one
> >
> > i) save time because system computes, after b) solely with ints, which is much faster than with letters, additionally, yet
> >
> > ii) how calculating a check digit from a string or letters at all?
> >
> > iii) Use of check id prevents twisting numbers when searching for an ID, because your input will be checked before automatically. Fallibility! Fallibility! Fallibility!
> >
> > This is now merely tipped in a hurry and in case I got it wrong—confusing simple craftsmanship above with serious science—and I'm nearly totally convinced I got s.t. profound profoundly wrong, please, let me know.
> >
> > Dear entire TEI community, I really want to learn, so please, your say!
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029989.html
> To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
> NAML

Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Elisa Beshero-Bondar-2
In reply to this post by Peter Stadler
Peter-- I have a set of Schematron rules that check if a value is a member of, say, the list of fictional characters, it has (whatever) distinct property. (I am not using check digits, but I imagine the answer to your question is a job for Schematron and testing XPath relationships). 

--Elisa

--
Elisa Beshero-Bondar, PhD 
Director, Center for the Digital Text
Associate Professor of English 
University of Pittsburgh at Greensburg
150 Finoli Drive, Greensburg, PA 15601 USA
E-mail: [hidden email] | Development site: http://newtfire.org

Typeset by hand on my iPad

On Aug 17, 2017, at 6:12 AM, Peter Stadler <[hidden email]> wrote:

Hi Syd,

just wanted to say that I still think check digits are useful, especially when hand-editing TEI documents. I’ve seen a lot of transposed digits in ids and those are really hard to find because mostly they create a valid reference (and will point at something).
I wonder how do you check "does it point to the right kind of element?“

Best
Peter


Am 16.08.2017 um 21:02 schrieb Syd Bauman <[hidden email]>:

Michael --

First thing to realize about IDs is that the computer (i.e., your XML
system) does not care. The suggestions you make (only digits, same
length, use check digit) might help the *humans* and programs they
write, but XML does not give a hoot, and thus TEI does not give a
hoot. (EXCEPT see below about the digits.)

Attaching a check character can be, in fact, quite helpful. But it's
not that much easier to calculate one using only digits than any
constrained set of alphanumerics. (E.g., the WWP system uses a base
27 number whose 'digits' are the 26 English letters A-Z plus one
extra character which is used for anything else.) However, it is
worth noting that we (the WWP) almost never use those check
characters anymore, anyway. Instead of asking "is this a potentially
valid reference -- i.e., is it the right length with a valid check
digit?" I usually just follow the reference and ask "does it point to
the right kind of element?". We used to use the check digit every day
back in the days when it was hard to follow the pointer and find out
what it pointed to. Now, it's just so easy.

As for using all digits, it's not as if you have a choice. @xml:id is
defined as being of type ID, and the W3C (not the TEI) defines that
as meaning it *must* start with a letter, an underscore, or a colon
(bad idea). Note that the letter can be from most any alphabet.


while skating home and considering structure of id's in file you've linked in your reply I've stumbled upon two things—even if you've done it this way surely by good reason—namely, a) using solely letters and b) creating id's of different length; eg.

<person xml:id="ab"> [l. 58], whilst Mrs. Colombo receives three letters <person xml:id="ajc">, [l.113], poor Alison ;-(

Here comes my counterproposal and I'd be really happy U could explain me after your reading 1. why you have done what you have done & 2. in which way I'm awfully wrong.

1. If id consists sole of numbers, they are much easier computable by any pc prog; cf. 3. item. Are they not?

2. Fixed length, filling with any—best identical but by all means pref-defined—kind of digit if necessary. Makes first check of input possible by asking user: Is this correct length of phrase which has to be tipped in at all?

3. Append check-digit, computed by given numbers, for second safety check (we humans so are fallible, are we not?). Calculation of check digit is much easier if solely numbers are given and not a string or both letters and digits. Besides, how interpreting letters for computing check digit?

Possible way creating ID for humble self of author's of this lines using both his fore- and surname:

»MichaelDahnke«

a) michaeldahnke (or v.v. turning it all into capitals, would be, likely, possibly, too, I guess. Theoretically, U don't need to transform them this way at all, but it seems more clearly for humans)

b) Turning letters into digits by their position in alphabet (else U could use their position in ascii code for transforming, instead, too):

130903010512040108141105

c) Given pre-defined length of id of 30 numbers one has to add for further digits, in this case for the sake of brevity:

13090301051204010814110500000

d) Adding all 30 digits (or any other simply computably procedure, multiplying):

1+3+0+9+0+3+0+1+0+5+1+2+0+4+0+1+0+8+1+4+1+1+0+5+0+0+0+0+0 = 50

After, using either entire resulting number ›50‹ or merely first or 2nd digit as check digit is really mere a matter of taste, by MHO.

But not matter of taste at all is it that by this one

i) save time because system computes, after b) solely with ints, which is much faster than with letters, additionally, yet

ii) how calculating a check digit from a string or letters at all?

iii) Use of check id prevents twisting numbers when searching for an ID, because your input will be checked before automatically. Fallibility! Fallibility! Fallibility!

This is now merely tipped in a hurry and in case I got it wrong—confusing simple craftsmanship above with serious science—and I'm nearly totally convinced I got s.t. profound profoundly wrong, please, let me know.

Dear entire TEI community, I really want to learn, so please, your say!
Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Michael.Dahnke
Cool, Ma'm,

telepathy? Just the minute before I was about asking exactly that question to both Syd and you.

Thanks!

Michael
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
[hidden email]

On 17 Aug 2017, at 14:10, Elisa Beshero-Bondar-2 [via tei-l] wrote:

> Peter-- I have a set of Schematron rules that check if a value is a member of, say, the list of fictional characters, it has (whatever) distinct property. (I am not using check digits, but I imagine the answer to your question is a job for Schematron and testing XPath relationships).
>
> --Elisa
>
> --
> Elisa Beshero-Bondar, PhD
> Director, Center for the Digital Text
> Associate Professor of English
> University of Pittsburgh at Greensburg
> 150 Finoli Drive, Greensburg, PA 15601 USA
> E-mail: [hidden email] | Development site: http://newtfire.org
>
> Typeset by hand on my iPad
>
> On Aug 17, 2017, at 6:12 AM, Peter Stadler <[hidden email]> wrote:
>
>> Hi Syd,
>>
>> just wanted to say that I still think check digits are useful, especially when hand-editing TEI documents. I’ve seen a lot of transposed digits in ids and those are really hard to find because mostly they create a valid reference (and will point at something).
>> I wonder how do you check "does it point to the right kind of element?“
>>
>> Best
>> Peter
>>
>>
>>> Am 16.08.2017 um 21:02 schrieb Syd Bauman <[hidden email]>:
>>>
>>> Michael --
>>>
>>> First thing to realize about IDs is that the computer (i.e., your XML
>>> system) does not care. The suggestions you make (only digits, same
>>> length, use check digit) might help the *humans* and programs they
>>> write, but XML does not give a hoot, and thus TEI does not give a
>>> hoot. (EXCEPT see below about the digits.)
>>>
>>> Attaching a check character can be, in fact, quite helpful. But it's
>>> not that much easier to calculate one using only digits than any
>>> constrained set of alphanumerics. (E.g., the WWP system uses a base
>>> 27 number whose 'digits' are the 26 English letters A-Z plus one
>>> extra character which is used for anything else.) However, it is
>>> worth noting that we (the WWP) almost never use those check
>>> characters anymore, anyway. Instead of asking "is this a potentially
>>> valid reference -- i.e., is it the right length with a valid check
>>> digit?" I usually just follow the reference and ask "does it point to
>>> the right kind of element?". We used to use the check digit every day
>>> back in the days when it was hard to follow the pointer and find out
>>> what it pointed to. Now, it's just so easy.
>>>
>>> As for using all digits, it's not as if you have a choice. @xml:id is
>>> defined as being of type ID, and the W3C (not the TEI) defines that
>>> as meaning it *must* start with a letter, an underscore, or a colon
>>> (bad idea). Note that the letter can be from most any alphabet.
>>>
>>>
>>>> while skating home and considering structure of id's in file you've linked in your reply I've stumbled upon two things—even if you've done it this way surely by good reason—namely, a) using solely letters and b) creating id's of different length; eg.
>>>>
>>>> <person xml:id="ab"> [l. 58], whilst Mrs. Colombo receives three letters <person xml:id="ajc">, [l.113], poor Alison ;-(
>>>>
>>>> Here comes my counterproposal and I'd be really happy U could explain me after your reading 1. why you have done what you have done & 2. in which way I'm awfully wrong.
>>>>
>>>> 1. If id consists sole of numbers, they are much easier computable by any pc prog; cf. 3. item. Are they not?
>>>>
>>>> 2. Fixed length, filling with any—best identical but by all means pref-defined—kind of digit if necessary. Makes first check of input possible by asking user: Is this correct length of phrase which has to be tipped in at all?
>>>>
>>>> 3. Append check-digit, computed by given numbers, for second safety check (we humans so are fallible, are we not?). Calculation of check digit is much easier if solely numbers are given and not a string or both letters and digits. Besides, how interpreting letters for computing check digit?
>>>>
>>>> Possible way creating ID for humble self of author's of this lines using both his fore- and surname:
>>>>
>>>> »MichaelDahnke«
>>>>
>>>> a) michaeldahnke (or v.v. turning it all into capitals, would be, likely, possibly, too, I guess. Theoretically, U don't need to transform them this way at all, but it seems more clearly for humans)
>>>>
>>>> b) Turning letters into digits by their position in alphabet (else U could use their position in ascii code for transforming, instead, too):
>>>>
>>>> 130903010512040108141105
>>>>
>>>> c) Given pre-defined length of id of 30 numbers one has to add for further digits, in this case for the sake of brevity:
>>>>
>>>> 13090301051204010814110500000
>>>>
>>>> d) Adding all 30 digits (or any other simply computably procedure, multiplying):
>>>>
>>>> 1+3+0+9+0+3+0+1+0+5+1+2+0+4+0+1+0+8+1+4+1+1+0+5+0+0+0+0+0 = 50
>>>>
>>>> After, using either entire resulting number ›50‹ or merely first or 2nd digit as check digit is really mere a matter of taste, by MHO.
>>>>
>>>> But not matter of taste at all is it that by this one
>>>>
>>>> i) save time because system computes, after b) solely with ints, which is much faster than with letters, additionally, yet
>>>>
>>>> ii) how calculating a check digit from a string or letters at all?
>>>>
>>>> iii) Use of check id prevents twisting numbers when searching for an ID, because your input will be checked before automatically. Fallibility! Fallibility! Fallibility!
>>>>
>>>> This is now merely tipped in a hurry and in case I got it wrong—confusing simple craftsmanship above with serious science—and I'm nearly totally convinced I got s.t. profound profoundly wrong, please, let me know.
>>>>
>>>> Dear entire TEI community, I really want to learn, so please, your say!
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029994.html
> To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
> NAML

Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Peter Stadler
In reply to this post by Elisa Beshero-Bondar-2
ok, let’s assume we have a listPerson, with some person entries, each with an xml:id  „person01“ to „person99“ (fixed length). How do you manage to tell whether "person10" is a (mistakenly) transposed „person01“ in some reference like <persName ref=„#person10“>? Both will point at valid entries in your personography.
If you add a check digit, the IDs will have an added character, let’s assume „person01“ will become „person013“ and „person10“ will become „person109“ – depending on your algorithm. Now, (most) permutations like „person103“ or „person301“  are invalid IDs and will be flagged by your check digit validation.

I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.

Best
Pete


> Am 17.08.2017 um 14:10 schrieb Elisa <[hidden email]>:
>
> Peter-- I have a set of Schematron rules that check if a value is a member of, say, the list of fictional characters, it has (whatever) distinct property. (I am not using check digits, but I imagine the answer to your question is a job for Schematron and testing XPath relationships).
>
Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Michael.Dahnke
> I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.


Honestly, I do!
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
[hidden email]

On 17 Aug 2017, at 15:12, Peter Stadler [via tei-l] wrote:

> ok, let’s assume we have a listPerson, with some person entries, each with an xml:id  „person01“ to „person99“ (fixed length). How do you manage to tell whether "person10" is a (mistakenly) transposed „person01“ in some reference like <persName ref=„#person10“>? Both will point at valid entries in your personography.
> If you add a check digit, the IDs will have an added character, let’s assume „person01“ will become „person013“ and „person10“ will become „person109“ – depending on your algorithm. Now, (most) permutations like „person103“ or „person301“  are invalid IDs and will be flagged by your check digit validation.
>
> I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.
>
> Best
> Pete
>
>
> > Am 17.08.2017 um 14:10 schrieb Elisa <[hidden email]>:
> >
> > Peter-- I have a set of Schematron rules that check if a value is a member of, say, the list of fictional characters, it has (whatever) distinct property. (I am not using check digits, but I imagine the answer to your question is a job for Schematron and testing XPath relationships).
> >
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029998.html
> To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
> NAML

Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Elisa Beshero-Bondar-2
In reply to this post by Peter Stadler
Hi Peter--
Well, I don't set fixed lengths on my xml:ids (this method has never seemed necessary to me--but I am curious, if it's what lots of people do, why they do it). From what you show here, I do see the problem. I guess this is why, for my validation work, the string of characters carries the *human* factor of immediately recognizable distinctiveness when pointing at references: a human being can tell quickly if a value of @ref just looks like it must be pointing at the wrong person. Our standing project rule for xml:id definition is that we start with the most individually distinctive part of a name. With persons in the Western European context, that is usually (but not always!) the surname or a distinctive simplification of it, like so:
BronteC would be a Digital Mitford style xml:id for Charlotte Brontë. And TalfourdThos is one in active use for Mitford's friend, Thomas Noon Talfourd. 

Character lengths of xml:ids aren't fixed in our project, but in a relative way, they are meaningful to human coders along with capitalization. In our project, we have a listPerson of only us editors on the project, and our xml:ids are always lower-cased and 2 to 4 characters at most (usually our initials). Historical persons references by Mitford and her contemporaries get longer ids because we agreed on the project team that we want to be able to read them. I typically try to keep these under, say, 15 characters, and TalfourdThos is a pretty long one. Few of our ids actually have numbers (if any?) 

The Schematron constraints I mentioned are designed to prevent pointing to the wrong id for, say, a bibl entry or a placeName on a persName of an historical personage. So if someone accidentally pops in a reference to Russell Square in London instead of George Russell (let's say that the ids are RussellSq vs RussellG), the Schematron rule reports that RussellSq is from the listPlace and must not be referenced on a persName.

So, that is how we work on Digital Mitford ids, of which we've accumulated around 1500 I think. It's one way of working. I never learned the check digit method, and really we just concentrated on the benefits of generating distinct, human readable ids for the benefit of our coding team.

Elisa

--
Elisa Beshero-Bondar, PhD 
Director, Center for the Digital Text
Associate Professor of English 
University of Pittsburgh at Greensburg
150 Finoli Drive, Greensburg, PA 15601 USA
E-mail: [hidden email] | Development site: http://newtfire.org

Typeset by hand on my iPad

On Aug 17, 2017, at 9:11 AM, Peter Stadler <[hidden email]> wrote:

ok, let’s assume we have a listPerson, with some person entries, each with an xml:id  „person01“ to „person99“ (fixed length). How do you manage to tell whether "person10" is a (mistakenly) transposed „person01“ in some reference like <persName ref=„#person10“>? Both will point at valid entries in your personography.
If you add a check digit, the IDs will have an added character, let’s assume „person01“ will become „person013“ and „person10“ will become „person109“ – depending on your algorithm. Now, (most) permutations like „person103“ or „person301“  are invalid IDs and will be flagged by your check digit validation.

I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.

Best
Pete


Am 17.08.2017 um 14:10 schrieb Elisa <[hidden email]>:

Peter-- I have a set of Schematron rules that check if a value is a member of, say, the list of fictional characters, it has (whatever) distinct property. (I am not using check digits, but I imagine the answer to your question is a job for Schematron and testing XPath relationships).


Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Elisa Beshero-Bondar-2
In reply to this post by Michael.Dahnke
PS: Michael and Peter-- The problem we run into on our project is the generation of duplicate entries for the same person, particularly when there's uncertainty over what part of a name is really most distinctive. All members of our team can submit proposals for new entries in the canonical list, and I want to build some more reliable ways (e.g. good search forms that guide a systematic checking process) for them to make sure they (and I) are finding an entry when it already exists. Authors writing under pseudonyms and women whose surnames changed over time have caused us trouble when we are in a hurry to add new entries to the lists, and I've got a little team of student assistants at work helping us to find dupes, and fix any problems/confusions they've generated in the document markup.

Elisa

--
Elisa Beshero-Bondar, PhD 
Director, Center for the Digital Text
Associate Professor of English 
University of Pittsburgh at Greensburg
150 Finoli Drive, Greensburg, PA 15601 USA
E-mail: [hidden email] | Development site: http://newtfire.org

Typeset by hand on my iPad

On Aug 17, 2017, at 9:37 AM, Michael.Dahnke <[hidden email]> wrote:

> I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.


Honestly, I do!
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
[hidden email]

On 17 Aug 2017, at 15:12, Peter Stadler [via tei-l] wrote:

> ok, let’s assume we have a listPerson, with some person entries, each with an xml:id  „person01“ to „person99“ (fixed length). How do you manage to tell whether "person10" is a (mistakenly) transposed „person01“ in some reference like <persName ref=„#person10“>? Both will point at valid entries in your personography.
> If you add a check digit, the IDs will have an added character, let’s assume „person01“ will become „person013“ and „person10“ will become „person109“ – depending on your algorithm. Now, (most) permutations like „person103“ or „person301“  are invalid IDs and will be flagged by your check digit validation.
>
> I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.
>
> Best
> Pete
>
>
> > Am 17.08.2017 um 14:10 schrieb Elisa <[hidden email]>:
> >
> > Peter-- I have a set of Schematron rules that check if a value is a member of, say, the list of fictional characters, it has (whatever) distinct property. (I am not using check digits, but I imagine the answer to your question is a job for Schematron and testing XPath relationships).
> >
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029998.html
> To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
> NAML


View this message in context: Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file
Sent from the tei-l mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Peter Stadler
Hi Elisa,

thanks for this detailed explanation!
It probably depends on the size of the project, but I find it more convenient to have fixed length IDs that carry information about the class (e.g. place or person) not the individual entry (e.g. person01 and place01 rather than RussellG and RussellSq). That way I can do the checks as a simple string comparison and do not need any lookups. I can even enforce that in my ODD as a special datatype and do not need any schematron constraints ;)

Many thanks again
Peter

PS: Don’t get me wrong, I love schematron!


> Am 17.08.2017 um 16:35 schrieb Elisa <[hidden email]>:
>
> PS: Michael and Peter-- The problem we run into on our project is the generation of duplicate entries for the same person, particularly when there's uncertainty over what part of a name is really most distinctive. All members of our team can submit proposals for new entries in the canonical list, and I want to build some more reliable ways (e.g. good search forms that guide a systematic checking process) for them to make sure they (and I) are finding an entry when it already exists. Authors writing under pseudonyms and women whose surnames changed over time have caused us trouble when we are in a hurry to add new entries to the lists, and I've got a little team of student assistants at work helping us to find dupes, and fix any problems/confusions they've generated in the document markup.
>
> Elisa
>
> --
> Elisa Beshero-Bondar, PhD
> Director, Center for the Digital Text
> Associate Professor of English
> University of Pittsburgh at Greensburg
> 150 Finoli Drive, Greensburg, PA 15601 USA
> E-mail: [hidden email] | Development site: http://newtfire.org
>
> Typeset by hand on my iPad
>
> On Aug 17, 2017, at 9:37 AM, Michael.Dahnke <[hidden email]> wrote:
>
>> > I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.
>>
>>
>> Honestly, I do!
>> --
>> Dr. Michael Dahnke
>> WMA zur Vermittlung von Digitalisierungskompetenz
>> Abteilung Digitalisierung
>> Kallimachos-Zentrum für Digitial Humanities
>>
>> Universitätsbibliothek Würzburg
>> Am Hubland
>> D-97074 Würzburg
>>
>> Ruf: +49-931-31-88562
>> [hidden email]
>>
>> On 17 Aug 2017, at 15:12, Peter Stadler [via tei-l] wrote:
>>
>> > ok, let’s assume we have a listPerson, with some person entries, each with an xml:id  „person01“ to „person99“ (fixed length). How do you manage to tell whether "person10" is a (mistakenly) transposed „person01“ in some reference like <persName ref=„#person10“>? Both will point at valid entries in your personography.
>> > If you add a check digit, the IDs will have an added character, let’s assume „person01“ will become „person013“ and „person10“ will become „person109“ – depending on your algorithm. Now, (most) permutations like „person103“ or „person301“  are invalid IDs and will be flagged by your check digit validation.
>> >
>> > I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.
>> >
>> > Best
>> > Pete
>> >
>> >
>> > > Am 17.08.2017 um 14:10 schrieb Elisa <[hidden email]>:
>> > >
>> > > Peter-- I have a set of Schematron rules that check if a value is a member of, say, the list of fictional characters, it has (whatever) distinct property. (I am not using check digits, but I imagine the answer to your question is a job for Schematron and testing XPath relationships).
>> > >
>> >
>> >
>> > If you reply to this email, your message will be added to the discussion below:
>> > http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029998.html
>> > To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
>> > NAML
>>
>>
>> View this message in context: Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file
>> Sent from the tei-l mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Elisa Beshero-Bondar-2
Hi Peter-- We’ve got string comparisons, too, and I can actually do that checking entirely with Schematron, which is easier than trying to update my ODD with new values when we generate them. But the ODD system is helpful because it gives the coders drop-down lists in oXygen to select from. Your system seems to be concentrated on categories first, and then numbers for distinct identities, but as you’ve noticed, it’s really easy to confuse individual entities in your lists. There must be a middle ground here to make this easier to check. Your eyeballs can tell you if someone’s reference is mis-categorized, but not the more likely phenomenon of mis-identified. How do we deal with that? My only thought, once again, is…Schematron! You could perhaps check the text contents of the persName tag holding an @ref, and if no part of that name is listed in the canonical definition entry holding the xml:id that matches, flag a warning. Either, then, a variation on a name hasn’t been recorded in the xml:id file, or the @ref is pointing to the wrong entry!

Embedding Schematron in ODD can be tricky, but of course the two work together all the time, and in my current projects I do work them simultaneously. So I guess I still think this is a job for Schematron! :-)

Elisa
— 
Elisa Beshero-Bondar, PhD
Director, Center for the Digital Text | Associate Professor of English
University of Pittsburgh at Greensburg | Humanities Division
150 Finoli Drive
Greensburg, PA  15601  USA
E-mail:[hidden email]
Development site: http://newtfire.org






On Aug 17, 2017, at 11:28 AM, Peter Stadler <[hidden email]> wrote:

Hi Elisa,

thanks for this detailed explanation!
It probably depends on the size of the project, but I find it more convenient to have fixed length IDs that carry information about the class (e.g. place or person) not the individual entry (e.g. person01 and place01 rather than RussellG and RussellSq). That way I can do the checks as a simple string comparison and do not need any lookups. I can even enforce that in my ODD as a special datatype and do not need any schematron constraints ;)

Many thanks again
Peter

PS: Don’t get me wrong, I love schematron!


Am 17.08.2017 um 16:35 schrieb Elisa <[hidden email]>:

PS: Michael and Peter-- The problem we run into on our project is the generation of duplicate entries for the same person, particularly when there's uncertainty over what part of a name is really most distinctive. All members of our team can submit proposals for new entries in the canonical list, and I want to build some more reliable ways (e.g. good search forms that guide a systematic checking process) for them to make sure they (and I) are finding an entry when it already exists. Authors writing under pseudonyms and women whose surnames changed over time have caused us trouble when we are in a hurry to add new entries to the lists, and I've got a little team of student assistants at work helping us to find dupes, and fix any problems/confusions they've generated in the document markup.

Elisa

--
Elisa Beshero-Bondar, PhD
Director, Center for the Digital Text
Associate Professor of English
University of Pittsburgh at Greensburg
150 Finoli Drive, Greensburg, PA 15601 USA
E-mail: [hidden email] | Development site: http://newtfire.org

Typeset by hand on my iPad

On Aug 17, 2017, at 9:37 AM, Michael.Dahnke <[hidden email]> wrote:

I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.


Honestly, I do!
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
[hidden email]

On 17 Aug 2017, at 15:12, Peter Stadler [via tei-l] wrote:

ok, let’s assume we have a listPerson, with some person entries, each with an xml:id  „person01“ to „person99“ (fixed length). How do you manage to tell whether "person10" is a (mistakenly) transposed „person01“ in some reference like <persName ref=„#person10“>? Both will point at valid entries in your personography.
If you add a check digit, the IDs will have an added character, let’s assume „person01“ will become „person013“ and „person10“ will become „person109“ – depending on your algorithm. Now, (most) permutations like „person103“ or „person301“  are invalid IDs and will be flagged by your check digit validation.

I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.

Best
Pete


Am 17.08.2017 um 14:10 schrieb Elisa <[hidden email]>:

Peter-- I have a set of Schematron rules that check if a value is a member of, say, the list of fictional characters, it has (whatever) distinct property. (I am not using check digits, but I imagine the answer to your question is a job for Schematron and testing XPath relationships).



If you reply to this email, your message will be added to the discussion below:
http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029998.html
To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
NAML


View this message in context: Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file
Sent from the tei-l mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Michael.Dahnke
In reply to this post by Elisa Beshero-Bondar-2
Salu Elisa,

there's a German word of which I can't find appropriate translation now (What's verb for action of someone U call a smart aleck/ass = w(e)isenheimering?) Anyway, I don't wanna called a smart-ass but (but ›but‹ is to confrontational, let's say: and)

and (jumpin' into your other post and back)

> I guess this is why, for my validation work, the string of characters carries the *human* factor of immediately recognizable distinctiveness when pointing at references: a human being can tell quickly if a value of @ref just looks like it must be pointing at the wrong person. Our standing project rule for xml:id definition is that we start with the most individually distinctive part of a name. With persons in the Western European context, that is usually (but not always!) the surname or a distinctive simplification of it, like so: BronteC would be a Digital Mitford style xml:id for Charlotte Brontë. And TalfourdThos is one in active use for Mitford's friend, Thomas Noon Talfourd.


Roughly:

1. *human* factor: Seems charming on the one hand making xml as readable for humans as ever possible. Non-digital Prof, asked me the other day: How can I proofread text of a copy after it is encoded in TEI? Can I create paper print with sole text of edition without all confusing tags? Sure, U can, thanks to XSLT, but that's not really idea of tagging nor what U really want at the end of the day. Following—obviously partly—corrupted text is given:

<p>Trump is great and makes America grateful again. Thanks to the great dealer of American's psycho.<p>

Your student's assistant—committed ›Bernie's‹, so to speak, so eventually they've voted totally frustrated for Mrs. J. Stein—correct as follows:

<p>Trump is great and makes America gr<choice><sic>ateful</sic><corr>eat</corr></choice> again. Thanks to the <choice><sic>great</sic><corr>greatest</corr></choice> <choice><sic>d</sic><corr>h</corr></choice>ealer of America's <choice><sic>psycho</sic><corr>soul</corr></choice>.<p>

Kicking off all tagging results in

Trump is great and makes America gratefuleat again. Thanks to the greatgreatest dhealer of America's psychosoul.

> a human being can tell quickly if a value of @ref just looks like it must be pointing at the wrong person.

Why not XSLT let's check such? If you follow an automated algorithm as suggested last night you can check as follows:

a) Try-out shady ID for both length and check-digit. Can it be valid ID at all? If so:

b) To which entity (person, place, etc.) points ID? Throw that entity's name into ID creation algorithm and see if newly created ID is identical to the one in doubt.

2.

> Our standing project rule for xml:id definition is that we start with the most individually distinctive part of a name.

And this seems, sorry to say, simply still not precisely enough. I'm afraid you're simply running into brickwall of limits of an much too easy hands-on approach, figuratively spoken. Constructively:

> With persons in the Western European context, that is usually (but not always!) the surname or a distinctive simplification of it, like so: BronteC would be a Digital Mitford style xml:id for Charlotte Brontë. And TalfourdThos is one in active use for Mitford's friend, Thomas Noon Talfourd.

2.1. Distinguish between name as natural pointer to a person or place or whatsoever and entity itself. Means: Stalingrad or Wolgograd, different names but same city. Yael Kushner, former Ivanka Marie etc. Guess U know better than me her former surname. Use fixed length of part of name U wanna use, first seven letters, or five letters, but fixed and this length constant for all ID's. Last, an ID cannot—and I really mean, must not—be changed in context of your identification list. Practically:

2.2. Given you created for lady mentioned above following entry:

<person xml:id="trumpIvankaMarie"><forename n="1">Ivanka</forename> <forename n="2">Marie</forename> <surname n="1">Trump</surname></person>

For sake of unambiguous identification U have to write now:

<person xml:id="trumpIvankaMarie"><forename n="1">Yael</forename>  <surname n="1">Kushner</surname></person>

U see break-down between former, distinctive simplification of surname and current name of that person, don't you? An easy and easily comprehensible association between name and ID isn't simply working any longer.
>
> a human being can tell quickly if a value of @ref just looks like it must be pointing at the wrong person. Our standing project rule for xml:id definition is that we start with the most individually distinctive part of a name. With persons in the Western European context, that is usually (but not always!) the surname or a distinctive simplification of it, like so: BronteC would be a Digital Mitford style xml:id for Charlotte Brontë. And TalfourdThos is one in active use for Mitford's friend, Thomas Noon Talfourd.

> All members of our team can submit proposals for new entries in the canonical list,

Can they suggest who shall be listed or how id shall look like? Latter assumed: I'm really a committed democrat (No donkey, I mean with respect to political system per se, guess I've voted for Mrs. Stein) but this doesn't seem a good procedure in this context. U need an algorithm resulting by all means and no matter which human runs it in an unambiguous id.

> and I want to build some more reliable ways (e.g. good search forms that guide a systematic checking process) for them to make sure they (and I) are finding an entry when it already exists.


> Authors writing under pseudonyms and women whose surnames changed over time have caused us trouble when we are in a hurry to add new entries to the lists, and I've got a little team of student assistants at work helping us to find dupes, and fix any problems/confusions they've generated in the document markup.


2.3. Last comment of clever dick:

No matter which algorithm you will create—and, again, no matter who is using it, same name of an entity has to result in identical form of id—I'm in your position pursued following search strategy:

Flush down, sorry, insert form of name of entity, person, place, org etc. you are want to know if it already exists in your list in that algorithm.

a) »Hit.« Job done. Three crosses or »Hail Mary!«

b) »No hit.«

Can mean: Entity hasn't been listed yet. Can mean, alternatively, a lady has married (which is, thanks the Lord a lot, a really rare and seldom event) and changed her surname. Means for your search: Try to figure out all nées or other, former used parts of name of that female and create ids with that parts of name. Retry. Boring? Exhausting?

Best, add in creation of id another distinctive property like birthday or date of foundation/building (organisation, house, compound). That's what I did when I generated an algorithm to create unambiguous id's for people, places etc. for writing a list of German speaking literary awards. I'd like to send U little python app but all commands for user are in German. If you interested, though, let me know!

Wrapped up:

> systematic checking process) for them to make sure they (and I) are finding an entry when it already exists. Authors writing under pseudonyms and women whose surnames changed over time have caused us trouble when we are in a hurry to add new entries to the lists, and I've got a little team of student assistants at work helping us to find dupes, and fix any problems/confusions they've generated in the document markup.


This cries for an automated, algorithm-like solution, doesn't it?

Michael
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: +49-931-31-88562
[hidden email]

On 17 Aug 2017, at 16:35, Elisa wrote:

> PS: Michael and Peter-- The problem we run into on our project is the generation of duplicate entries for the same person, particularly when there's uncertainty over what part of a name is really most distinctive. All members of our team can submit proposals for new entries in the canonical list, and I want to build some more reliable ways (e.g. good search forms that guide a systematic checking process) for them to make sure they (and I) are finding an entry when it already exists. Authors writing under pseudonyms and women whose surnames changed over time have caused us trouble when we are in a hurry to add new entries to the lists, and I've got a little team of student assistants at work helping us to find dupes, and fix any problems/confusions they've generated in the document markup.
>
> Elisa
>
> --
> Elisa Beshero-Bondar, PhD
> Director, Center for the Digital Text
> Associate Professor of English
> University of Pittsburgh at Greensburg
> 150 Finoli Drive, Greensburg, PA 15601 USA
> E-mail: [hidden email] | Development site: http://newtfire.org
>
> Typeset by hand on my iPad
>
> On Aug 17, 2017, at 9:37 AM, Michael.Dahnke <[hidden email]> wrote:
>
>> > I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.
>>
>>
>> Honestly, I do!
>> --
>> Dr. Michael Dahnke
>> WMA zur Vermittlung von Digitalisierungskompetenz
>> Abteilung Digitalisierung
>> Kallimachos-Zentrum für Digitial Humanities
>>
>> Universitätsbibliothek Würzburg
>> Am Hubland
>> D-97074 Würzburg
>>
>> Ruf: +49-931-31-88562
>> [hidden email]
>>
>> On 17 Aug 2017, at 15:12, Peter Stadler [via tei-l] wrote:
>>
>> > ok, let’s assume we have a listPerson, with some person entries, each with an xml:id  „person01“ to „person99“ (fixed length). How do you manage to tell whether "person10" is a (mistakenly) transposed „person01“ in some reference like <persName ref=„#person10“>? Both will point at valid entries in your personography.
>> > If you add a check digit, the IDs will have an added character, let’s assume „person01“ will become „person013“ and „person10“ will become „person109“ – depending on your algorithm. Now, (most) permutations like „person103“ or „person301“  are invalid IDs and will be flagged by your check digit validation.
>> >
>> > I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.
>> >
>> > Best
>> > Pete
>> >
>> >
>> > > Am 17.08.2017 um 14:10 schrieb Elisa <[hidden email]>:
>> > >
>> > > Peter-- I have a set of Schematron rules that check if a value is a member of, say, the list of fictional characters, it has (whatever) distinct property. (I am not using check digits, but I imagine the answer to your question is a job for Schematron and testing XPath relationships).
>> > >
>> >
>> >
>> > If you reply to this email, your message will be added to the discussion below:
>> > http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029998.html
>> > To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
>> > NAML
>>
>>
>> View this message in context: Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file
>> Sent from the tei-l mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file

Elisa Beshero-Bondar
Dear Michael,
Our system on the Mitford project is designed to be human-readable *as well as* systematically checked with Schematron and Relax-NG. By calling for something automated, you are, I take it, advocating for a machine generated set of ids automatically applied. That could be done, actually, by selecting portions of names that we deem the most distinctive.

Just because we run into difficulties doesn't mean the system is broken and unworthy pursuing. Ours has worked for years now, and we aren't about to change the method--because human-readable xml:ids are important to our team. Indeed, we use automated checking on our project to help us identify problems, and we can continually refine it. This is why I am a huge fan of Schematron.  

When we run into the problems that I raised and that you exemplified, we make a decision, and in our project, we decide based on the following factors, documented in our code manual:
1) Once we define an id and it is in prevalent use, we continue with that id. If that id starts with the patronymic of a woman, that simply means we continue with it. 

2) If it's a new person and they have lots of names to choose from, we decide which to prioritize as the opening of the string based on how that person was typically best known in the project.

The rules aren't hard and fast--for royalty, we go with simple things like Chas1 for King Charles I. We keep it as simple and memorable as we can. But if not all xml:ids follow the same system of prioritizing a surname of some kind, they are quite simply each distinct from all the others.

We do, in fact have something like check digits in place to help distinguish ids for fictional characters from those for historic figures. For Alice Liddell (the young friend of Charles Lutwidge Dodson, pseudonym Lewis Carroll), if Mitford had lived long enough to know her), we'd probably give her historic personage the xml:id LiddellA (regardless of her married name), and the fictional persona from Alice in Wonderland would be xml:id Alice_fic, or even
AliceWon_fic.

To some degree what we choose is quite human, but the only thing that really matters here is that we come up with some string that isn't likely to collide with other designations we come up with. And once it's set, it's set. That's all. The machine systems kick in to ensure the distinctiveness and guide the spelling, and help with data extraction, and all that.

It has worked for us for 5 years running, so I suppose there's that.

Best,
Elisa

On Thu, Aug 17, 2017 at 12:29 PM, Michael Dahnke <[hidden email]> wrote:
Salu Elisa,

there's a German word of which I can't find appropriate translation now (What's verb for action of someone U call a smart aleck/ass = w(e)isenheimering?) Anyway, I don't wanna called a smart-ass but (but ›but‹ is to confrontational, let's say: and)

and (jumpin' into your other post and back)

> I guess this is why, for my validation work, the string of characters carries the *human* factor of immediately recognizable distinctiveness when pointing at references: a human being can tell quickly if a value of @ref just looks like it must be pointing at the wrong person. Our standing project rule for xml:id definition is that we start with the most individually distinctive part of a name. With persons in the Western European context, that is usually (but not always!) the surname or a distinctive simplification of it, like so: BronteC would be a Digital Mitford style xml:id for Charlotte Brontë. And TalfourdThos is one in active use for Mitford's friend, Thomas Noon Talfourd.


Roughly:

1. *human* factor: Seems charming on the one hand making xml as readable for humans as ever possible. Non-digital Prof, asked me the other day: How can I proofread text of a copy after it is encoded in TEI? Can I create paper print with sole text of edition without all confusing tags? Sure, U can, thanks to XSLT, but that's not really idea of tagging nor what U really want at the end of the day. Following—obviously partly—corrupted text is given:

<p>Trump is great and makes America grateful again. Thanks to the great dealer of American's psycho.<p>

Your student's assistant—committed ›Bernie's‹, so to speak, so eventually they've voted totally frustrated for Mrs. J. Stein—correct as follows:

<p>Trump is great and makes America gr<choice><sic>ateful</sic><corr>eat</corr></choice> again. Thanks to the <choice><sic>great</sic><corr>greatest</corr></choice> <choice><sic>d</sic><corr>h</corr></choice>ealer of America's <choice><sic>psycho</sic><corr>soul</corr></choice>.<p>

Kicking off all tagging results in

Trump is great and makes America gratefuleat again. Thanks to the greatgreatest dhealer of America's psychosoul.

> a human being can tell quickly if a value of @ref just looks like it must be pointing at the wrong person.

Why not XSLT let's check such? If you follow an automated algorithm as suggested last night you can check as follows:

a) Try-out shady ID for both length and check-digit. Can it be valid ID at all? If so:

b) To which entity (person, place, etc.) points ID? Throw that entity's name into ID creation algorithm and see if newly created ID is identical to the one in doubt.

2.

> Our standing project rule for xml:id definition is that we start with the most individually distinctive part of a name.

And this seems, sorry to say, simply still not precisely enough. I'm afraid you're simply running into brickwall of limits of an much too easy hands-on approach, figuratively spoken. Constructively:

> With persons in the Western European context, that is usually (but not always!) the surname or a distinctive simplification of it, like so: BronteC would be a Digital Mitford style xml:id for Charlotte Brontë. And TalfourdThos is one in active use for Mitford's friend, Thomas Noon Talfourd.

2.1. Distinguish between name as natural pointer to a person or place or whatsoever and entity itself. Means: Stalingrad or Wolgograd, different names but same city. Yael Kushner, former Ivanka Marie etc. Guess U know better than me her former surname. Use fixed length of part of name U wanna use, first seven letters, or five letters, but fixed and this length constant for all ID's. Last, an ID cannot—and I really mean, must not—be changed in context of your identification list. Practically:

2.2. Given you created for lady mentioned above following entry:

<person xml:id="trumpIvankaMarie"><forename n="1">Ivanka</forename> <forename n="2">Marie</forename> <surname n="1">Trump</surname></person>

For sake of unambiguous identification U have to write now:

<person xml:id="trumpIvankaMarie"><forename n="1">Yael</forename>  <surname n="1">Kushner</surname></person>

U see break-down between former, distinctive simplification of surname and current name of that person, don't you? An easy and easily comprehensible association between name and ID isn't simply working any longer.
>
> a human being can tell quickly if a value of @ref just looks like it must be pointing at the wrong person. Our standing project rule for xml:id definition is that we start with the most individually distinctive part of a name. With persons in the Western European context, that is usually (but not always!) the surname or a distinctive simplification of it, like so: BronteC would be a Digital Mitford style xml:id for Charlotte Brontë. And TalfourdThos is one in active use for Mitford's friend, Thomas Noon Talfourd.

> All members of our team can submit proposals for new entries in the canonical list,

Can they suggest who shall be listed or how id shall look like? Latter assumed: I'm really a committed democrat (No donkey, I mean with respect to political system per se, guess I've voted for Mrs. Stein) but this doesn't seem a good procedure in this context. U need an algorithm resulting by all means and no matter which human runs it in an unambiguous id.

> and I want to build some more reliable ways (e.g. good search forms that guide a systematic checking process) for them to make sure they (and I) are finding an entry when it already exists.


> Authors writing under pseudonyms and women whose surnames changed over time have caused us trouble when we are in a hurry to add new entries to the lists, and I've got a little team of student assistants at work helping us to find dupes, and fix any problems/confusions they've generated in the document markup.


2.3. Last comment of clever dick:

No matter which algorithm you will create—and, again, no matter who is using it, same name of an entity has to result in identical form of id—I'm in your position pursued following search strategy:

Flush down, sorry, insert form of name of entity, person, place, org etc. you are want to know if it already exists in your list in that algorithm.

a) »Hit.« Job done. Three crosses or »Hail Mary!«

b) »No hit.«

Can mean: Entity hasn't been listed yet. Can mean, alternatively, a lady has married (which is, thanks the Lord a lot, a really rare and seldom event) and changed her surname. Means for your search: Try to figure out all nées or other, former used parts of name of that female and create ids with that parts of name. Retry. Boring? Exhausting?

Best, add in creation of id another distinctive property like birthday or date of foundation/building (organisation, house, compound). That's what I did when I generated an algorithm to create unambiguous id's for people, places etc. for writing a list of German speaking literary awards. I'd like to send U little python app but all commands for user are in German. If you interested, though, let me know!

Wrapped up:

> systematic checking process) for them to make sure they (and I) are finding an entry when it already exists. Authors writing under pseudonyms and women whose surnames changed over time have caused us trouble when we are in a hurry to add new entries to the lists, and I've got a little team of student assistants at work helping us to find dupes, and fix any problems/confusions they've generated in the document markup.


This cries for an automated, algorithm-like solution, doesn't it?

Michael
--
Dr. Michael Dahnke
WMA zur Vermittlung von Digitalisierungskompetenz
Abteilung Digitalisierung
Kallimachos-Zentrum für Digitial Humanities

Universitätsbibliothek Würzburg
Am Hubland
D-97074 Würzburg

Ruf: <a href="tel:%2B49-931-31-88562" value="+499313188562">+49-931-31-88562
[hidden email]

On 17 Aug 2017, at 16:35, Elisa wrote:

> PS: Michael and Peter-- The problem we run into on our project is the generation of duplicate entries for the same person, particularly when there's uncertainty over what part of a name is really most distinctive. All members of our team can submit proposals for new entries in the canonical list, and I want to build some more reliable ways (e.g. good search forms that guide a systematic checking process) for them to make sure they (and I) are finding an entry when it already exists. Authors writing under pseudonyms and women whose surnames changed over time have caused us trouble when we are in a hurry to add new entries to the lists, and I've got a little team of student assistants at work helping us to find dupes, and fix any problems/confusions they've generated in the document markup.
>
> Elisa
>
> --
> Elisa Beshero-Bondar, PhD
> Director, Center for the Digital Text
> Associate Professor of English
> University of Pittsburgh at Greensburg
> 150 Finoli Drive, Greensburg, PA 15601 USA
> E-mail: [hidden email] | Development site: http://newtfire.org
>
> Typeset by hand on my iPad
>
> On Aug 17, 2017, at 9:37 AM, Michael.Dahnke <[hidden email]> wrote:
>
>> > I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.
>>
>>
>> Honestly, I do!
>> --
>> Dr. Michael Dahnke
>> WMA zur Vermittlung von Digitalisierungskompetenz
>> Abteilung Digitalisierung
>> Kallimachos-Zentrum für Digitial Humanities
>>
>> Universitätsbibliothek Würzburg
>> Am Hubland
>> D-97074 Würzburg
>>
>> Ruf: <a href="tel:%2B49-931-31-88562" value="+499313188562">+49-931-31-88562
>> [hidden email]
>>
>> On 17 Aug 2017, at 15:12, Peter Stadler [via tei-l] wrote:
>>
>> > ok, let’s assume we have a listPerson, with some person entries, each with an xml:id  „person01“ to „person99“ (fixed length). How do you manage to tell whether "person10" is a (mistakenly) transposed „person01“ in some reference like <persName ref=„#person10“>? Both will point at valid entries in your personography.
>> > If you add a check digit, the IDs will have an added character, let’s assume „person01“ will become „person013“ and „person10“ will become „person109“ – depending on your algorithm. Now, (most) permutations like „person103“ or „person301“  are invalid IDs and will be flagged by your check digit validation.
>> >
>> > I really can’t imagine any other way to prevent these permutations so would be very curious to hear more about your and Syd’s approach.
>> >
>> > Best
>> > Pete
>> >
>> >
>> > > Am 17.08.2017 um 14:10 schrieb Elisa <[hidden email]>:
>> > >
>> > > Peter-- I have a set of Schematron rules that check if a value is a member of, say, the list of fictional characters, it has (whatever) distinct property. (I am not using check digits, but I imagine the answer to your question is a job for Schematron and testing XPath relationships).
>> > >
>> >
>> >
>> > If you reply to this email, your message will be added to the discussion below:
>> > http://tei-l.970651.n3.nabble.com/Storing-listPerson-listPlace-listBibl-or-any-further-list-in-separate-xml-file-instead-of-header-of-e-tp4029982p4029998.html
>> > To unsubscribe from Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file, click here.
>> > NAML
>>
>>
>> View this message in context: Re: check digits, was: Storing <listPerson>, <listPlace>, <listBibl> or any further <list***> in separate xml file instead of header of corpus xml file
>> Sent from the tei-l mailing list archive at Nabble.com.




--
Elisa Beshero-Bondar, PhD
Director, Center for the Digital Text | Associate Professor of English
University of Pittsburgh at Greensburg | Humanities Division
150 Finoli Drive
Greensburg, PA  15601  USA
E-mail:[hidden email]
Development site: http://newtfire.org
12