adding or deleting tokens in normalization

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

adding or deleting tokens in normalization

Gael Vaamonde
Dear List,

We are working on a collection of private letters (Early Modern Portugal and Spain). We offer both the original text and the corresponding normalized text, so we use <choice>, <orig> and <reg> for tokens that are normalized. For instance:

<w>
     <choice>
           <orig>aver</orig>
           <reg>haber</reg>
     </choice>
</w>

Our problem has to do with tokens that are added or deleted in the normalized version (punctuation, basically), because in these cases we have to deal with one option that has no content (<orig> in additions; <reg> in deletions):

<pc>
     <choice>
           <orig><!-- no content --></orig>
           <reg><!-- punctuation added--></reg>
     </choice>
</pc>

<pc>
     <choice>
           <orig><!--original punctuation --></orig>
           <reg><!-- no content  --></reg>
     </choice>
</pc>

Are there suitable ways of encoding this?

Thanks in advance.

Gael Vaamonde





Libre de virus. www.avast.com
Reply | Threaded
Open this post in threaded view
|

Re: adding or deleting tokens in normalization

Janelle Jenstad

Is it possible for <orig> to have no content?

 

From: TEI (Text Encoding Initiative) public discussion list [mailto:[hidden email]] On Behalf Of Gael Vaamonde
Sent: October 6, 2017 8:32 AM
To: [hidden email]
Subject: adding or deleting tokens in normalization

 

Dear List,

We are working on a collection of private letters (Early Modern Portugal and Spain). We offer both the original text and the corresponding normalized text, so we use <choice>, <orig> and <reg> for tokens that are normalized. For instance:

 

<w>

     <choice>

           <orig>aver</orig>

           <reg>haber</reg>

     </choice>

</w>

 

Our problem has to do with tokens that are added or deleted in the normalized version (punctuation, basically), because in these cases we have to deal with one option that has no content (<orig> in additions; <reg> in deletions):

 

<pc>

     <choice>

           <orig><!-- no content --></orig>

           <reg><!-- punctuation added--></reg>

     </choice>

</pc>

 

<pc>

     <choice>

           <orig><!--original punctuation --></orig>

           <reg><!-- no content  --></reg>

     </choice>

</pc>

 

Are there suitable ways of encoding this?

 

Thanks in advance.

 

Gael Vaamonde

 

 

 

 

 

Libre de virus. www.avast.com

 

Reply | Threaded
Open this post in threaded view
|

Re: adding or deleting tokens in normalization

Piotr Banski
In reply to this post by Gael Vaamonde
Hi Gael,

Do you do any processing of the punctuation marks? In particular, would
it hurt your system if you were to do something like:

<p>Some
<choice><orig><w>abbrv</w><pc>.</pc></orig><reg><w>abbreviation</w></reg></choice>
text here.</p>

I realise that this means taking the <choice> bracket one level higher,
as it were -- just speculating about what "suitable" can mean, in your case.

Good luck,

   Piotr

On 10/06/17 17:32, Gael Vaamonde wrote:

> Dear List,
>
> We are working on a collection of private letters (Early Modern Portugal
> and Spain). We offer both the original text and the corresponding
> normalized text, so we use <choice>, <orig> and <reg> for tokens that
> are normalized. For instance:
>
> <w>
>       <choice>
>             <orig>aver</orig>
>             <reg>haber</reg>
>       </choice>
> </w>
>
> Our problem has to do with tokens that are added or deleted in the
> normalized version (punctuation, basically), because in these cases we
> have to deal with one option that has no content (<orig> in additions;
> <reg> in deletions):
>
> <pc>
>       <choice>
>             <orig><!-- no content --></orig>
>             <reg><!-- punctuation added--></reg>
>       </choice>
> </pc>
>
> <pc>
>       <choice>
>             <orig><!--original punctuation --></orig>
>             <reg><!-- no content  --></reg>
>       </choice>
> </pc>
>
> Are there suitable ways of encoding this?
>
> Thanks in advance.
>
> Gael Vaamonde
>
>
>
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
> Libre de virus. www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>
>
> <#m_-5432440690221736595_m_233534221264392734_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
Reply | Threaded
Open this post in threaded view
|

Re: adding or deleting tokens in normalization

Peter Boot-3
In reply to this post by Gael Vaamonde

Hello Gael,

I would think that 

     <choice>
           <orig><!-- no content --></orig>
           <reg><!-- punctuation added--></reg>
     </choice>

is correct, but means exactly the same as 

     <choice>
           <reg><!-- punctuation added--></reg>
     </choice>

and even as 

           <reg><!-- punctuation added--></reg>

So the choice is between various levels of explicitness / verboseness, all three correct.

Peter​



Van: TEI (Text Encoding Initiative) public discussion list <[hidden email]> namens Gael Vaamonde <[hidden email]>
Verzonden: vrijdag 6 oktober 2017 17:32
Aan: [hidden email]
Onderwerp: [TEI-L] adding or deleting tokens in normalization
 
Dear List,

We are working on a collection of private letters (Early Modern Portugal and Spain). We offer both the original text and the corresponding normalized text, so we use <choice>, <orig> and <reg> for tokens that are normalized. For instance:

<w>
     <choice>
           <orig>aver</orig>
           <reg>haber</reg>
     </choice>
</w>

Our problem has to do with tokens that are added or deleted in the normalized version (punctuation, basically), because in these cases we have to deal with one option that has no content (<orig> in additions; <reg> in deletions):

<pc>
     <choice>
           <orig><!-- no content --></orig>
           <reg><!-- punctuation added--></reg>
     </choice>
</pc>

<pc>
     <choice>
           <orig><!--original punctuation --></orig>
           <reg><!-- no content  --></reg>
     </choice>
</pc>

Are there suitable ways of encoding this?

Thanks in advance.

Gael Vaamonde





Libre de virus. www.avast.com
Reply | Threaded
Open this post in threaded view
|

Re: adding or deleting tokens in normalization

Gael Vaamonde
In reply to this post by Gael Vaamonde
Thanks for your answer, Piotr.

Do you do any processing of the punctuation marks?

We use a web-based platform to semiautomatically apply normalization and linguistic annotation to each token (including punctuation). And the manually revised output is automatically transformed into TEIP5. So the TEIP5 versions are the final step in our workflow. We do not do any processing after that.

just speculating about what "suitable" can mean, in your case.

So with "suitable" we are thinking on two purposes:

a) suitable for other users who want to use our data (and they could be interested, among other things, in recovering/processing punctuation marks both in the original and in the normalized versions).

b) suitable in terms of interoperability, so we prefer to follow the same strategy as other similar corpora.
Reply | Threaded
Open this post in threaded view
|

Re: adding or deleting tokens in normalization

Christian Thomas
In reply to this post by Gael Vaamonde
Dear all, since no one suggested it yet, I wondered, then got insecure and now have to ask: why not tag missing or superfluous punctuation using empty tags, for example, in case of a clear misprint that I want to correct:

<choice>
   <sic/>
   <corr>!</corr>
</choice>

<choice>
   <sic>!</sic>
   <corr/>
</choice>

or, in case I want to regularise these things (for sth. that would e.g. confuse human readers and/or my automated detection of sentence boundaries):

<choice>
   <orig/>
   <reg>!</reg>
</choice>

<choice>
   <orig>!</orig>
   <reg/>
</choice>

Would you approve of that? 

Best wishes
Christian 


Christian Thomas
E-Mail: [hidden email]
--

Am 06.10.2017 20:55 schrieb Peter Boot <[hidden email]>:

Hello Gael,

I would think that 

     <choice>
           <orig><!-- no content --></orig>
           <reg><!-- punctuation added--></reg>
     </choice>

is correct, but means exactly the same as 

     <choice>
           <reg><!-- punctuation added--></reg>
     </choice>

and even as 

           <reg><!-- punctuation added--></reg>

So the choice is between various levels of explicitness / verboseness, all three correct.

Peter​



Van: TEI (Text Encoding Initiative) public discussion list <[hidden email]> namens Gael Vaamonde <[hidden email]>
Verzonden: vrijdag 6 oktober 2017 17:32
Aan: [hidden email]
Onderwerp: [TEI-L] adding or deleting tokens in normalization
 
Dear List,

We are working on a collection of private letters (Early Modern Portugal and Spain). We offer both the original text and the corresponding normalized text, so we use <choice>, <orig> and <reg> for tokens that are normalized. For instance:

<w>
     <choice>
           <orig>aver</orig>
           <reg>haber</reg>
     </choice>
</w>

Our problem has to do with tokens that are added or deleted in the normalized version (punctuation, basically), because in these cases we have to deal with one option that has no content (<orig> in additions; <reg> in deletions):

<pc>
     <choice>
           <orig><!-- no content --></orig>
           <reg><!-- punctuation added--></reg>
     </choice>
</pc>

<pc>
     <choice>
           <orig><!--original punctuation --></orig>
           <reg><!-- no content  --></reg>
     </choice>
</pc>

Are there suitable ways of encoding this?

Thanks in advance.

Gael Vaamonde





Libre de virus. www.avast.com

Reply | Threaded
Open this post in threaded view
|

Re: adding or deleting tokens in normalization

Syd Bauman-10
Yes, I would certainly approve. I don't think the _Guidelines_ are
very clear on this issue, but I think of

   <choice>
     <sic>!</sic>
     <corr/>
   </choice>

as being *very* similar to just

   <sic>!</sic>

Semantically they are, I suppose, slightly different. The former says
"there was a '!' in the source that I think was odd or incorrect, and
where I (the encoder or editor) think there should have been
nothing". The latter just says "there was a '!' in the source that I
think was odd or incorrect". Thus the latter permits multiple
inferences:
 * that I think there should have been nothing there, or
 * that I do not know what should have been there, or
 * that I don't feel like telling you what I think should have been
   there.

Despite that tiny ambiguity, I just use <sic>!</sic>.


> Dear all, since no one suggested it yet, I wondered, then got
> insecure and now have to ask: why not tag missing or superfluous
> punctuation using empty tags, for example, in case of a clear
> misprint that I want to correct:
>   <choice>
>     <sic/>
>     <corr>!</corr>
>   </choice>
> [or]
>   <choice>
>     <sic>!</sic>
>     <corr/>
>   </choice>
> or, in case I want to regularise these things (for sth. that would
> e.g. confuse human readers and/or my automated detection of
> sentence boundaries):
>   <choice>
>     <orig/>
>     <reg>!</reg>
>   </choice>
> [or]
>   <choice>
>     <orig>!</orig>
>     <reg/>
>   </choice>
> Would you approve of that?
Reply | Threaded
Open this post in threaded view
|

AW: adding or deleting tokens in normalization

Gerrit Brüning
Dear Christian,

For text that should not be there I would use <surplus>.
This is more precise than <sic>, I think.
For text that should be there but is not I would use <supplied>.

Best,

Gerrit

---
Dr. Gerrit Brüning
Freies Deutsches Hochstift | Historisch-kritische Edition von Goethes Faust
| beta.faustedition.net
Goethe-Universität Frankfurt am Main | Institut für deutsche Literatur und
ihre Didaktik | IG-Hochhaus 1.155


> -----Ursprüngliche Nachricht-----
> Von: TEI (Text Encoding Initiative) public discussion list [mailto:TEI-
> [hidden email]] Im Auftrag von Syd Bauman
> Gesendet: Samstag, 14. Oktober 2017 17:44
> An: [hidden email]
> Betreff: Re: adding or deleting tokens in normalization
>
> Yes, I would certainly approve. I don't think the _Guidelines_ are very
clear

> on this issue, but I think of
>
>    <choice>
>      <sic>!</sic>
>      <corr/>
>    </choice>
>
> as being *very* similar to just
>
>    <sic>!</sic>
>
> Semantically they are, I suppose, slightly different. The former says
"there
> was a '!' in the source that I think was odd or incorrect, and where I
(the
> encoder or editor) think there should have been nothing". The latter just
says
> "there was a '!' in the source that I think was odd or incorrect". Thus
the

> latter permits multiple
> inferences:
>  * that I think there should have been nothing there, or
>  * that I do not know what should have been there, or
>  * that I don't feel like telling you what I think should have been
>    there.
>
> Despite that tiny ambiguity, I just use <sic>!</sic>.
>
>
> > Dear all, since no one suggested it yet, I wondered, then got insecure
> > and now have to ask: why not tag missing or superfluous punctuation
> > using empty tags, for example, in case of a clear misprint that I want
> > to correct:
> >   <choice>
> >     <sic/>
> >     <corr>!</corr>
> >   </choice>
> > [or]
> >   <choice>
> >     <sic>!</sic>
> >     <corr/>
> >   </choice>
> > or, in case I want to regularise these things (for sth. that would
> > e.g. confuse human readers and/or my automated detection of sentence
> > boundaries):
> >   <choice>
> >     <orig/>
> >     <reg>!</reg>
> >   </choice>
> > [or]
> >   <choice>
> >     <orig>!</orig>
> >     <reg/>
> >   </choice>
> > Would you approve of that?