Encoding combination of two versions of text

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Encoding combination of two versions of text

Michael.Dahnke
Dear honourable paragon of wisdom,

for »Narragonien« http://kallimachos.de/kallimachos/index.php/Narragonien we
have digitized different versions of the so-called »Ship of fools«.
Currently, we have two versions of texts, first OCR and second an already
normalized version. Is there any common way of encoding so that connection
of both of them is evident? Our suggestion is following:

<div rend="mainText">
        <div type="normalized">
                <p> Viel, viel sind meiner Tage
                        Durch Sünd entweiht gesunken hinab.
             O großer Richter, frage
             Nicht wie, o lasse ihr Grab
             Erbarmende Vergessenheit
             Laß, Vater der Barmherzigkeit,
             Das Blut des Sohns es decken. </p>
        </div>
    <div type="OCR">
                <p>Ach wenig sind der Tage
             Mit Frömmigkeit gekrönt entflohn,
             Sie sinds, mein Engel, trage
             Sie vor des Ewigen Thron,
             Laß schimmern die geringe Zahl,
             Daß einsten mich des Richters Wahl
             Zu seinen Frommen zähle.</p>
        </div>
</div>

We would be delighted about every suggestion.

Thanks in advance,
Michael



--
Sent from: http://tei-l.970651.n3.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Encoding combination of two versions of text

Syd Bauman-10
Fast-and-lousy answer (I gotta run): see
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SACS
Reply | Threaded
Open this post in threaded view
|

Re: Encoding combination of two versions of text

Staecker
In reply to this post by Michael.Dahnke
Dear Michael,
not so much as a paragon of wisdom, but as a humble TEI encoder I feel a little bit uneasy about such a solution as  you leave out the OCR coordinates that enable you to link sections of the text to the image that has been the basis for the OCR process. By the way, the same counts for the solution the TEI libray SIG offered quite recently over this list which I consider rather dissatifying in that respect. I'd rather would suggest to use <sourcDoc> to accomodate and save OCRed texts together with the coordinates. I'd be really curious to hear the opinions of others about this.
As a matter of course, the output of the OCR process, ideally in ALTO or some other standardized format, has to be converted from this format to a form compliant with
<sourcDoc>. The original format could be added to the TEI <xenodata> section.
Best,
Thomas

Am 20.11.2017 um 14:57 schrieb Michael.Dahnke:
Dear honourable paragon of wisdom,

for »Narragonien« http://kallimachos.de/kallimachos/index.php/Narragonien we
have digitized different versions of the so-called »Ship of fools«.
Currently, we have two versions of texts, first OCR and second an already
normalized version. Is there any common way of encoding so that connection
of both of them is evident? Our suggestion is following:

<div rend="mainText">
	<div type="normalized">
		<p> Viel, viel sind meiner Tage
			Durch Sünd entweiht gesunken hinab.
             O großer Richter, frage
             Nicht wie, o lasse ihr Grab
             Erbarmende Vergessenheit
             Laß, Vater der Barmherzigkeit,
             Das Blut des Sohns es decken. </p>
	</div>
    <div type="OCR">
		<p>Ach wenig sind der Tage
             Mit Frömmigkeit gekrönt entflohn,
             Sie sinds, mein Engel, trage
             Sie vor des Ewigen Thron,
             Laß schimmern die geringe Zahl,
             Daß einsten mich des Richters Wahl
             Zu seinen Frommen zähle.</p>
	</div>
</div>

We would be delighted about every suggestion.

Thanks in advance,
Michael



--
Sent from: http://tei-l.970651.n3.nabble.com/

-- 
***************************************
Prof. Dr. Thomas Stäcker
Direktor der
Universitäts- und Landesbibliothek Darmstadt
Magdalenenstr. 8
64289 Darmstadt
+49 (0)6151 16-76200
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Encoding combination of two versions of text

Michael.Dahnke
In reply to this post by Michael.Dahnke
Dear Thomas

»as a humble TEI encoder«

that's my level exactly, too, peer amongst peers. 2nd thanks a lot sharing
your uneasiness. Could you, please, lay your idea of a possible three-part
solution, still a bit more. I understand so far:

1. »<sourcDoc> to accomodate and save OCRed texts together with the
coordinates«,

2. »output of the OCR process, ideally in ALTO or some other standardized
format, has to be converted from this format to a form compliant with
<sourcDoc>« How, hints? Where are further infos available?

3. »The original format could be added to the TEI <xenodata> section«.
Again, admittedly, I'm nearly still a bloody beginner, thus, any
recommendation for in-depth information?

Cordially

Michael



--
Sent from: http://tei-l.970651.n3.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Encoding combination of two versions of text

Kevin Hawkins
In reply to this post by Michael.Dahnke
Thomas,

I'm not sure which solution you're referring to, but I want to point out
that handling OCR text with coordinates is a feature that we were
interested in incorporating into the Best Practices for TEI in
Libraries.  Indeed, a colleague made a similar suggestion a few years ago:

https://github.com/kshawkin/Best-Practices-for-TEI-in-Libraries/issues/27

... but as you can see we labeled this issue as "dormant" because we our
colleague who suggested it never provided more details, and none of us
working on the revising the BPTL felt knowledgeable enough to come up
with an appropriate solution.

We welcome a proposal from you or others on how to handle this in the BPTL.

Kevin

On 11/24/17 5:11 AM, Staecker wrote:

> Dear Michael,
> not so much as a paragon of wisdom, but as a humble TEI encoder I feel a
> little bit uneasy about such a solution as  you leave out the OCR
> coordinates that enable you to link sections of the text to the image
> that has been the basis for the OCR process. By the way, the same counts
> for the solution the TEI libray SIG offered quite recently over this
> list which I consider rather dissatifying in that respect. I'd rather
> would suggest to use <sourcDoc> to accomodate and save OCRed texts
> together with the coordinates. I'd be really curious to hear the
> opinions of others about this.
> As a matter of course, the output of the OCR process, ideally in ALTO or
> some other standardized format, has to be converted from this format to
> a form compliant with <sourcDoc>. The original format could be added to
> the TEI <xenodata> section.
> Best,
> Thomas
>
> Am 20.11.2017 um 14:57 schrieb Michael.Dahnke:
>> Dear honourable paragon of wisdom,
>>
>> for »Narragonien«http://kallimachos.de/kallimachos/index.php/Narragonien  we
>> have digitized different versions of the so-called »Ship of fools«.
>> Currently, we have two versions of texts, first OCR and second an already
>> normalized version. Is there any common way of encoding so that connection
>> of both of them is evident? Our suggestion is following:
>>
>> <div rend="mainText">
>> <div type="normalized">
>> <p> Viel, viel sind meiner Tage
>> Durch Sünd entweiht gesunken hinab.
>>               O großer Richter, frage
>>               Nicht wie, o lasse ihr Grab
>>               Erbarmende Vergessenheit
>>               Laß, Vater der Barmherzigkeit,
>>               Das Blut des Sohns es decken. </p>
>> </div>
>>      <div type="OCR">
>> <p>Ach wenig sind der Tage
>>               Mit Frömmigkeit gekrönt entflohn,
>>               Sie sinds, mein Engel, trage
>>               Sie vor des Ewigen Thron,
>>               Laß schimmern die geringe Zahl,
>>               Daß einsten mich des Richters Wahl
>>               Zu seinen Frommen zähle.</p>
>> </div>
>> </div>
>>
>> We would be delighted about every suggestion.
>>
>> Thanks in advance,
>> Michael
>>
>>
>>
>> --
>> Sent from:http://tei-l.970651.n3.nabble.com/
>
> --
> ***************************************
> Prof. Dr. Thomas Stäcker
> Direktor der
> Universitäts- und Landesbibliothek Darmstadt
> Magdalenenstr. 8
> 64289 Darmstadt
> +49 (0)6151 16-76200
> [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: Encoding combination of two versions of text

Staecker
Kevin, Michael,

thanks for coming back. Actually, up to now, I have no practical example that might prove what I have in mind, but the application of   <sourceDoc> seems quite evident to me - for serveral reasons:

a) The TEI states that
<sourceDoc> is used to combine transcription with a facsimile by using  an embedded transcription that is dependent on the physical form (TEI 11.2.2). This describes exactly what an OCR engine does. It creates a "transcription" on the bases on a actual physical page.

b) Most OCR Standard formats (such as Page or ALTO) rely on a form that encode the coordinates of the page, the text block, non-text blocks, lines, words etc. Therefore, if you take e.g. the code below from an ALTO file, it should create no difficulty to convert it to a <sourceDoc> compliant form by using <surface><zone><line> with @ulx,
@uly,@lrx,@lry. For words <w> would be  a little bit tricky as coordinates are not allowed in <w>, but all phenomena could be resolved by something like <zone type="word">. If we check the example below we can put it in somthing like this (abbreviated form):
   <a:Page> -> <surface ulx="00" uly="00" lrx="00" lry="00" xml:id="p23">
  <graphic url="image1.jpg" height="2981" width="2000px"/>
 
<a:TextLine> -> <line ulx="00" uly="00" lrx="00" lry="00" xml:id="line123">
 
<a:String  CONTENT="ubi">  -> <zone type="word" ulx="00" uly="00" lrx="00" lry="00">ubi</zone>
 
<a:SP  CONTENT="ubi">  -> <zone type="space" ulx="00" uly="00" lrx="00" lry="00"/>

The ALTO coordinates are offsets and have be to re-calculated to fit the absolute coordinates prescribed by TEI.  The ALTO file could be additionally preserved in a <xenodata> container, if considered necessary.

From the <text> part links could be provided e.g. by <pb corresp="
#p23"> or something more granular to align the OCRed text an other transcriptions.

A standard XSLT for that purpose would be most welcome ;-)

Best,
Thomas
 

---------------------------------------
ALTO - Example

<a:Page ID="ID_BIT_d64de28b-577a-4c9a-a891-c2433f1b1fcd" HEIGHT="2981" WIDTH="2000"
            PHYSICAL_IMG_NR="0" QUALITY="OK"
            PROCESSING="ID_BIT_60994a5a-5cc4-41f9-9e1d-a808c5e6fe08">

[...]
<a:TextLine ID="ID_BIT_8e868a02-a324-40b4-92a8-dc4900eb647a" HEIGHT="61.0"
                        WIDTH="1276.0" HPOS="-18.0" VPOS="353.0">
                        <a:String HEIGHT="36.0" WIDTH="163.0" VPOS="378.0" HPOS="-18.0"
                            CONTENT="oriseclitton"/>
                        <a:SP WIDTH="165.0" VPOS="376.5" HPOS="-18.0"/>
                        <a:String HEIGHT="39.0" WIDTH="5.0" VPOS="375.0" HPOS="147.0" CONTENT="."/>
                        <a:SP WIDTH="18.0" VPOS="374.5" HPOS="147.0"/>
                        <a:String HEIGHT="36.0" WIDTH="53.0" VPOS="374.0" HPOS="165.0" CONTENT="ubi"/>
                        <a:SP WIDTH="62.0" VPOS="373.5" HPOS="165.0"/>
                        <a:String HEIGHT="34.0" WIDTH="106.0" VPOS="373.0" HPOS="227.0"
                            CONTENT="influit"/>
                        <a:SP WIDTH="115.0" VPOS="372.0" HPOS="227.0"/>
                        <a:String HEIGHT="36.0" WIDTH="7.0" VPOS="371.0" HPOS="342.0" CONTENT=","/>
                        <a:SP WIDTH="21.0" VPOS="371.0" HPOS="342.0"/>
                        <a:String HEIGHT="35.0" WIDTH="53.0" VPOS="371.0" HPOS="363.0" CONTENT="non"/>
                        <a:SP WIDTH="62.0" VPOS="370.5" HPOS="363.0"/>
                        <a:String HEIGHT="35.0" WIDTH="28.0" VPOS="370.0" HPOS="425.0" CONTENT="in"/>
                        <a:SP WIDTH="37.0" VPOS="369.5" HPOS="425.0"/>
                    [...]


  


Am 26.11.2017 um 21:27 schrieb Kevin Hawkins:
Thomas,

I'm not sure which solution you're referring to, but I want to point out that handling OCR text with coordinates is a feature that we were interested in incorporating into the Best Practices for TEI in Libraries.  Indeed, a colleague made a similar suggestion a few years ago:

https://github.com/kshawkin/Best-Practices-for-TEI-in-Libraries/issues/27

... but as you can see we labeled this issue as "dormant" because we our colleague who suggested it never provided more details, and none of us working on the revising the BPTL felt knowledgeable enough to come up with an appropriate solution.

We welcome a proposal from you or others on how to handle this in the BPTL.

Kevin

On 11/24/17 5:11 AM, Staecker wrote:
Dear Michael,
not so much as a paragon of wisdom, but as a humble TEI encoder I feel a little bit uneasy about such a solution as  you leave out the OCR coordinates that enable you to link sections of the text to the image that has been the basis for the OCR process. By the way, the same counts for the solution the TEI libray SIG offered quite recently over this list which I consider rather dissatifying in that respect. I'd rather would suggest to use <sourcDoc> to accomodate and save OCRed texts together with the coordinates. I'd be really curious to hear the opinions of others about this.
As a matter of course, the output of the OCR process, ideally in ALTO or some other standardized format, has to be converted from this format to a form compliant with <sourcDoc>. The original format could be added to the TEI <xenodata> section.
Best,
Thomas

Am 20.11.2017 um 14:57 schrieb Michael.Dahnke:
Dear honourable paragon of wisdom,

for »Narragonien«http://kallimachos.de/kallimachos/index.php/Narragonien  we
have digitized different versions of the so-called »Ship of fools«.
Currently, we have two versions of texts, first OCR and second an already
normalized version. Is there any common way of encoding so that connection
of both of them is evident? Our suggestion is following:

<div rend="mainText">
    <div type="normalized">
        <p> Viel, viel sind meiner Tage
            Durch Sünd entweiht gesunken hinab.
              O großer Richter, frage
              Nicht wie, o lasse ihr Grab
              Erbarmende Vergessenheit
              Laß, Vater der Barmherzigkeit,
              Das Blut des Sohns es decken. </p>
    </div>
     <div type="OCR">
        <p>Ach wenig sind der Tage
              Mit Frömmigkeit gekrönt entflohn,
              Sie sinds, mein Engel, trage
              Sie vor des Ewigen Thron,
              Laß schimmern die geringe Zahl,
              Daß einsten mich des Richters Wahl
              Zu seinen Frommen zähle.</p>
    </div>
</div>

We would be delighted about every suggestion.

Thanks in advance,
Michael



--
Sent from:http://tei-l.970651.n3.nabble.com/

-- 
***************************************
Prof. Dr. Thomas Stäcker
Direktor der
Universitäts- und Landesbibliothek Darmstadt
Magdalenenstr. 8
64289 Darmstadt
+49 (0)6151 16-76200
[hidden email]


-- 
***************************************
Prof. Dr. Thomas Stäcker
Direktor der
Universitäts- und Landesbibliothek Darmstadt
Magdalenenstr. 8
64289 Darmstadt
+49 (0)6151 16-76200
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Encoding combination of two versions of text

Michael.Dahnke
Hi Kevin, hi Thomas

briefly: Thanks for consideration and say from both of you. Meet today in
the afternoon an experienced expert eager establishing endless…
Alliterations? Sadly seriously, again:

@Thomas: We'll discuss your suggestion in the afternoon and if it results in
anything of which we think it could both stand even critical consideration
and inspire others too, we surely share.

Best

Michael



--
Sent from: http://tei-l.970651.n3.nabble.com/