TEI element for a grapheme?

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

TEI element for a grapheme?

Emmanuelle Morlock
Dear TEI list,

Which tei element would you use to represent a grapheme (as opposed to a
phoneme)?

If I get it well, a <fs> would be used for phoneme cf.
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html#FSFL

Is using<c> (character) possible when the grapheme is composed of more 1
graphic character?

For example, if I want to say: "Dupond is reading <th> instead of <b>"

- Dupond is reading <g type="grapheme'>th</g> instead of <g
type="grapheme'>b</g>

- Dupond is reading <c>th</c> instead of <c>b</c>

- Dupond is reading <g type="grapheme'>th</d> instead of <c>b</c>

Thanks for your help!

Best,

--
Emmanuelle Morlock
IE CNRS - Humanités numériques & TEI
UMR 5189 HISoMA
http://www.hisoma.mom.fr

06 85 84 69 16
@emma_morlock

----------------------------------------------
Membre du comité de coordination d'Humanistica,
    association francophone des Humanités numériques
    <http://www.humanisti.ca>

Page HiSoMA : http://www.hisoma.mom.fr/annuaire/morlock-emmanuelle
Page HAL : https://cv.archives-ouvertes.fr/emmanuelle-morlock
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Martin Holmes
Hi Emmanuelle,

There's one thing I'm not clear on in your question: my understanding of
a grapheme is that it's the smallest unit in a writing system, so I'm
not sure what this means:

<g type="grapheme">th</g>

Is there a single grapheme in this particular script (thorn or eth, for
instance) which is what you're referencing here? I don't see how a
combination of two glyphs (apologies for loose terminology) can
constitute a single grapheme.

Cheers,
Martin

On 2016-09-27 02:27 AM, Emmanuelle Morlock wrote:

> Dear TEI list,
>
> Which tei element would you use to represent a grapheme (as opposed to a
> phoneme)?
>
> If I get it well, a <fs> would be used for phoneme cf.
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html#FSFL
>
> Is using<c> (character) possible when the grapheme is composed of more 1
> graphic character?
>
> For example, if I want to say: "Dupond is reading <th> instead of <b>"
>
> - Dupond is reading <g type="grapheme'>th</g> instead of <g
> type="grapheme'>b</g>
>
> - Dupond is reading <c>th</c> instead of <c>b</c>
>
> - Dupond is reading <g type="grapheme'>th</d> instead of <c>b</c>
>
> Thanks for your help!
>
> Best,
>
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Paul Schaffner
I observe that the Guidelines use the term 'grapheme' once or twice
(e.g. in defining a <c> element as containing a string of graphemes),
without defining it. But from that example I think we can infer the
intended meaning: an 'atom' (or minimal unit) of the formal writing
system;
whereas a character is a minimal unit of the semantic/semiotic writing
system. E.g. in Welsh, "ll" and "ch" are usually thought of as
*characters*
in their own right: dictionaries, for example, separate words beginning
with "ll- " and "ch- " from those beginning with "l- " and "c- ". But
'll' and 'ch' are not graphemes: they are characters composed of two
graphemes ('l' + 'l' or 'c' + 'h') each. And indeed, in alternative
Welsh
orthographies (such as that invented by Salesbury), the characters
in question still exist, but are composed of different graphic units
(an l with a dot over it, for example). Under that understanding, surely
the TEI method would be to capture "ch" as a character (<c>) and
the 'c' and 'h', if we wished to encode them at all, as opposed to  
relying on the Unicode inventory, as <g> elements.

But I am probably misunderstanding. In my experience, all discussions
of characters, glyphs, graphs, and symbols end in a metaphysical
muddle.

pfs

On Tue, Sep 27, 2016, at 11:49, Martin Holmes wrote:

> Hi Emmanuelle,
>
> There's one thing I'm not clear on in your question: my understanding of
> a grapheme is that it's the smallest unit in a writing system, so I'm
> not sure what this means:
>
> <g type="grapheme">th</g>
>
> Is there a single grapheme in this particular script (thorn or eth, for
> instance) which is what you're referencing here? I don't see how a
> combination of two glyphs (apologies for loose terminology) can
> constitute a single grapheme.
>
> Cheers,
> Martin
>
> On 2016-09-27 02:27 AM, Emmanuelle Morlock wrote:
> > Dear TEI list,
> >
> > Which tei element would you use to represent a grapheme (as opposed to a
> > phoneme)?
> >
> > If I get it well, a <fs> would be used for phoneme cf.
> > http://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html#FSFL
> >
> > Is using<c> (character) possible when the grapheme is composed of more 1
> > graphic character?
> >
> > For example, if I want to say: "Dupond is reading <th> instead of <b>"
> >
> > - Dupond is reading <g type="grapheme'>th</g> instead of <g
> > type="grapheme'>b</g>
> >
> > - Dupond is reading <c>th</c> instead of <c>b</c>
> >
> > - Dupond is reading <g type="grapheme'>th</d> instead of <c>b</c>
> >
> > Thanks for your help!
> >
> > Best,
> >
--
Paul Schaffner  Digital Library Production Service
[hidden email] | http://www.umich.edu/~pfs/
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Emmanuelle Morlock
In reply to this post by Martin Holmes
Thanks Martin,

Yes in my example it's a single grapheme.

But whereas graphemes are the smallest unit in a writing system they can
be composed of more than one letter. For ex. [o] = o, au, eau in French.
The use case comes from a transliteration of an inscribed text, that's
why a single glyph drawn on a page or inscribed on a stone, may be
represented by a combination of letters in a transliteration...

Is it clearer? apologies too for the loose terminology; I'd be glad to
here more from linguists on that topic. I understand that depending on
the linguistic school of thoughts the segmentation of a word in
graphemes might vary....

Would <g type="graphematic">th</g> sound better?

Cheers
Emmanuelle


Le 27/09/2016 à 17:49, Martin Holmes a écrit :

> Hi Emmanuelle,
>
> There's one thing I'm not clear on in your question: my understanding
> of a grapheme is that it's the smallest unit in a writing system, so
> I'm not sure what this means:
>
> <g type="grapheme">th</g>
>
> Is there a single grapheme in this particular script (thorn or eth,
> for instance) which is what you're referencing here? I don't see how a
> combination of two glyphs (apologies for loose terminology) can
> constitute a single grapheme.
>
> Cheers,
> Martin
>
> On 2016-09-27 02:27 AM, Emmanuelle Morlock wrote:
>> Dear TEI list,
>>
>> Which tei element would you use to represent a grapheme (as opposed to a
>> phoneme)?
>>
>> If I get it well, a <fs> would be used for phoneme cf.
>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html#FSFL
>>
>> Is using<c> (character) possible when the grapheme is composed of more 1
>> graphic character?
>>
>> For example, if I want to say: "Dupond is reading <th> instead of <b>"
>>
>> - Dupond is reading <g type="grapheme'>th</g> instead of <g
>> type="grapheme'>b</g>
>>
>> - Dupond is reading <c>th</c> instead of <c>b</c>
>>
>> - Dupond is reading <g type="grapheme'>th</d> instead of <c>b</c>
>>
>> Thanks for your help!
>>
>> Best,
>>


--
Emmanuelle Morlock
IE CNRS - Humanités numériques & TEI
UMR 5189 HISoMA
http://www.hisoma.mom.fr

06 85 84 69 16
@emma_morlock

----------------------------------------------
Membre du comité de coordination d'Humanistica,
    association francophone des Humanités numériques
    <http://www.humanisti.ca>

Page HiSoMA : http://www.hisoma.mom.fr/annuaire/morlock-emmanuelle
Page HAL : https://cv.archives-ouvertes.fr/emmanuelle-morlock
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Martin Holmes
Hi Emmanuelle,

It sounds like what you mean by grapheme might be what I understand by
phoneme, so I would guess that the usage of these terms in your field
differs from what I'm used to.

Cheers,
Martin

On 2016-09-27 09:38 AM, Emmanuelle Morlock wrote:

> Thanks Martin,
>
> Yes in my example it's a single grapheme.
>
> But whereas graphemes are the smallest unit in a writing system they can
> be composed of more than one letter. For ex. [o] = o, au, eau in French.
> The use case comes from a transliteration of an inscribed text, that's
> why a single glyph drawn on a page or inscribed on a stone, may be
> represented by a combination of letters in a transliteration...
>
> Is it clearer? apologies too for the loose terminology; I'd be glad to
> here more from linguists on that topic. I understand that depending on
> the linguistic school of thoughts the segmentation of a word in
> graphemes might vary....
>
> Would <g type="graphematic">th</g> sound better?
>
> Cheers
> Emmanuelle
>
>
> Le 27/09/2016 à 17:49, Martin Holmes a écrit :
>> Hi Emmanuelle,
>>
>> There's one thing I'm not clear on in your question: my understanding
>> of a grapheme is that it's the smallest unit in a writing system, so
>> I'm not sure what this means:
>>
>> <g type="grapheme">th</g>
>>
>> Is there a single grapheme in this particular script (thorn or eth,
>> for instance) which is what you're referencing here? I don't see how a
>> combination of two glyphs (apologies for loose terminology) can
>> constitute a single grapheme.
>>
>> Cheers,
>> Martin
>>
>> On 2016-09-27 02:27 AM, Emmanuelle Morlock wrote:
>>> Dear TEI list,
>>>
>>> Which tei element would you use to represent a grapheme (as opposed to a
>>> phoneme)?
>>>
>>> If I get it well, a <fs> would be used for phoneme cf.
>>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html#FSFL
>>>
>>> Is using<c> (character) possible when the grapheme is composed of more 1
>>> graphic character?
>>>
>>> For example, if I want to say: "Dupond is reading <th> instead of <b>"
>>>
>>> - Dupond is reading <g type="grapheme'>th</g> instead of <g
>>> type="grapheme'>b</g>
>>>
>>> - Dupond is reading <c>th</c> instead of <c>b</c>
>>>
>>> - Dupond is reading <g type="grapheme'>th</d> instead of <c>b</c>
>>>
>>> Thanks for your help!
>>>
>>> Best,
>>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Emmanuelle Morlock
Hi Martin, and Paul,

Yes the definition I used (taken from a French dictionary http://www.cnrtl.fr/definition/graph%C3%A8me) defines a grapheme as the smallest unit in a writing system  transcribing phonemes. But it also says that some other linguists consider that a grapheme is aligned to a letter. For example: The word "temps" (time) = 2 graphemes according to the first definition (t-emps), 4 graphemes in the latter (t-em-p-s)...

Seems like the pessimistic conclusion of Paul's answer receives another confirmation?
I guess it's as usual a question of defining the categories one uses, and trying to be consistant in your practice...
Thanks a lot!

Cheers


Le 27/09/2016 à 18:51, Martin Holmes a écrit :
Hi Emmanuelle,

It sounds like what you mean by grapheme might be what I understand by phoneme, so I would guess that the usage of these terms in your field differs from what I'm used to.

Cheers,
Martin

On 2016-09-27 09:38 AM, Emmanuelle Morlock wrote:
Thanks Martin,

Yes in my example it's a single grapheme.

But whereas graphemes are the smallest unit in a writing system they can
be composed of more than one letter. For ex. [o] = o, au, eau in French.
The use case comes from a transliteration of an inscribed text, that's
why a single glyph drawn on a page or inscribed on a stone, may be
represented by a combination of letters in a transliteration...

Is it clearer? apologies too for the loose terminology; I'd be glad to
here more from linguists on that topic. I understand that depending on
the linguistic school of thoughts the segmentation of a word in
graphemes might vary....

Would <g type="graphematic">th</g> sound better?

Cheers
Emmanuelle


Le 27/09/2016 à 17:49, Martin Holmes a écrit :
Hi Emmanuelle,

There's one thing I'm not clear on in your question: my understanding
of a grapheme is that it's the smallest unit in a writing system, so
I'm not sure what this means:

<g type="grapheme">th</g>

Is there a single grapheme in this particular script (thorn or eth,
for instance) which is what you're referencing here? I don't see how a
combination of two glyphs (apologies for loose terminology) can
constitute a single grapheme.

Cheers,
Martin

On 2016-09-27 02:27 AM, Emmanuelle Morlock wrote:
Dear TEI list,

Which tei element would you use to represent a grapheme (as opposed to a
phoneme)?

If I get it well, a <fs> would be used for phoneme cf.
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html#FSFL

Is using<c> (character) possible when the grapheme is composed of more 1
graphic character?

For example, if I want to say: "Dupond is reading <th> instead of <b>"

- Dupond is reading <g type="grapheme'>th</g> instead of <g
type="grapheme'>b</g>

- Dupond is reading <c>th</c> instead of <c>b</c>

- Dupond is reading <g type="grapheme'>th</d> instead of <c>b</c>

Thanks for your help!

Best,





-- 
Emmanuelle Morlock
IE CNRS - Humanités numériques & TEI
UMR 5189 HISoMA 
http://www.hisoma.mom.fr

06 85 84 69 16
@emma_morlock

----------------------------------------------
Membre du comité de coordination d'Humanistica, 
   association francophone des Humanités numériques 
   <http://www.humanisti.ca>

Page HiSoMA : http://www.hisoma.mom.fr/annuaire/morlock-emmanuelle
Page HAL : https://cv.archives-ouvertes.fr/emmanuelle-morlock
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Gioele Barabucci-2
In reply to this post by Paul Schaffner
Hello,

On 27/09/2016 18:32, Paul Schaffner wrote:
> But I am probably misunderstanding. In my experience, all discussions
> of characters, glyphs, graphs, and symbols end in a metaphysical
> muddle.

Unicode has a definition for all these terms. In the case of grapheme,
the key definitions is that of _grapheme cluster_. Please see
<http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>,
especially the part about Hangul and Devanagari. Reasoning over
non-Latin scripts makes that discussion always more scientific and less
metaphysical. :)

Regards,

--
Gioele Barabucci <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Janusz S. Bien
On Tue, Sep 27 2016 at 19:17 CEST, [hidden email] writes:
> Hello,
>
> On 27/09/2016 18:32, Paul Schaffner wrote:
>> But I am probably misunderstanding. In my experience, all discussions
>> of characters, glyphs, graphs, and symbols end in a metaphysical
>> muddle.
>
> Unicode has a definition for all these terms.

Really? What is the Unicode definition of graphs? Is there a Unicode
definition of symbols?

> In the case of grapheme,

In the case of grapheme the only Unicode definition I know of is that
from the glossary of Unicode terms (http://www.unicode.org/glossary/),
which I quote below.

> the key definitions is that of _grapheme cluster_.

Exactly. There is even a definition of extended grapheme cluster. The
term "grapheme" doesn't occur in the standard alone.

> Please see
> <http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>,
> especially the part about Hangul and Devanagari. Reasoning over
> non-Latin scripts makes that discussion always more scientific and less
> metaphysical. :)

The title of the quoted Unicode® Standard Annex #29 is "UNICODE TEXT
SEGMENTATION" and grapheme clusters are just fragment of some
specific texts. It's unclear for me whether they are clusters of
(undefined) grahemes or just graphemic clusters in some vague relation
to the non-Unicode meaning of the term "grapheme".

Let me quote my recent posting to the Unicode mailing list in the thread

http://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0068.html

entitled "graphemes":

On Wed, Sep 21 2016 at  7:09 CEST, [hidden email] writes:

[...]

> Let me remind the issues which started the thread:
>
>
> On Sun, Sep 18 2016 at 12:26 CEST, [hidden email] writes:
>> Quote/Cytat - Christoph Päper <[hidden email]> (pią, 16
>> wrz 2016, 23:51:38):
>>
>>> Janusz S. Bień <[hidden email]>:
>>>>
>>>> 1. Graphemes, if I understand correctly, are language dependent, …
>>>
>>> That’s true in linguistic terminology – well, at least within the
>>> more popular schools of thought –, but not in technical (i.e.
>>> Unicode) jargon.
>
> And what is "grapheme" in "technical (i.e. Unicode) jargon"?
>
>>
>> From the Unicode glossary:
>>
>> Grapheme. (1) A minimally distinctive unit of writing in the context
>> of a particular writing system.[...] (2) What a user thinks of as a
>> character.
>>
>> As for (2), cf.
>>
>> User-Perceived Character. What everyone thinks of as a character in
>> their script.
>>
>> So we have "a user" versus "everyone...in their script" - is the
>> difference intentional? Probably not. Anyway the definitions are
>> language/locale dependent.
>
> Does 'Grapheme' (2) make sense with "a (single?) user"?
>
> BTW, it is rather well know that the term "phoneme" was proposed first
> by a Polish linguist Jan Niecisław Ignacy Baudouin de Courtenay (13
> March 1845 – 3 November 1929), cf. e.g
> https://en.wikipedia.org/wiki/Jan_Baudouin_de_Courtenay.  It is much
> less know that he proposed also the term "grapheme". Let me quote
> Alexander Berg's "English Historical Linguistics vol. I" page 230 from
> Google Books:
>
>        Since the introduction of the term grapheme by Baudouin de
>        Courtenay in 1901 (Ruszkiewicz 1976:24-37, 1981 [1978], 20-34),
>        it has been defined in various ways:
>
>        [...]
>
>        As can be seen from these quotatioms, the available definitions
>        can be divided into two groups, corresponding to two main senses,
>        and reflecting "conflicting linguistics views of the status of
>        writing" (Henderson 1985:142):
>
>        1. a letter or cluster of letters referring to or corresponding with a
>        single phoneme;
>
>        2. the minimal distinctive unit of a writing system.
>
> For me the first meaning (not mentioned at all in English Wikipedia) is
> the primary, i.e. more useful, meaning, as is has some practical
> applications e.g. for describing Polish hyphenation rules.

Best regards

Janusz

--
                           ,  
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
[hidden email], [hidden email], http://fleksem.klf.uw.edu.pl/~jsbien/
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Martin Holmes
This is a really interesting discussion, but I must confess that this:

"Graphemes are sequences of one or more encoded characters that
correspond to what users think of as characters."

is one of the most unhelpful definitions imaginable. Who is a user, and
who can tell what he or she might choose to think of as a "character"?

Cheers,
Martin

On 2016-09-27 11:21 AM, Janusz S. Bień wrote:

> On Tue, Sep 27 2016 at 19:17 CEST, [hidden email] writes:
>> Hello,
>>
>> On 27/09/2016 18:32, Paul Schaffner wrote:
>>> But I am probably misunderstanding. In my experience, all discussions
>>> of characters, glyphs, graphs, and symbols end in a metaphysical
>>> muddle.
>>
>> Unicode has a definition for all these terms.
>
> Really? What is the Unicode definition of graphs? Is there a Unicode
> definition of symbols?
>
>> In the case of grapheme,
>
> In the case of grapheme the only Unicode definition I know of is that
> from the glossary of Unicode terms (http://www.unicode.org/glossary/),
> which I quote below.
>
>> the key definitions is that of _grapheme cluster_.
>
> Exactly. There is even a definition of extended grapheme cluster. The
> term "grapheme" doesn't occur in the standard alone.
>
>> Please see
>> <http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>,
>> especially the part about Hangul and Devanagari. Reasoning over
>> non-Latin scripts makes that discussion always more scientific and less
>> metaphysical. :)
>
> The title of the quoted Unicode® Standard Annex #29 is "UNICODE TEXT
> SEGMENTATION" and grapheme clusters are just fragment of some
> specific texts. It's unclear for me whether they are clusters of
> (undefined) grahemes or just graphemic clusters in some vague relation
> to the non-Unicode meaning of the term "grapheme".
>
> Let me quote my recent posting to the Unicode mailing list in the thread
>
> http://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0068.html
>
> entitled "graphemes":
>
> On Wed, Sep 21 2016 at  7:09 CEST, [hidden email] writes:
>
> [...]
>
>> Let me remind the issues which started the thread:
>>
>>
>> On Sun, Sep 18 2016 at 12:26 CEST, [hidden email] writes:
>>> Quote/Cytat - Christoph Päper <[hidden email]> (pią, 16
>>> wrz 2016, 23:51:38):
>>>
>>>> Janusz S. Bień <[hidden email]>:
>>>>>
>>>>> 1. Graphemes, if I understand correctly, are language dependent, …
>>>>
>>>> That’s true in linguistic terminology – well, at least within the
>>>> more popular schools of thought –, but not in technical (i.e.
>>>> Unicode) jargon.
>>
>> And what is "grapheme" in "technical (i.e. Unicode) jargon"?
>>
>>>
>>> From the Unicode glossary:
>>>
>>> Grapheme. (1) A minimally distinctive unit of writing in the context
>>> of a particular writing system.[...] (2) What a user thinks of as a
>>> character.
>>>
>>> As for (2), cf.
>>>
>>> User-Perceived Character. What everyone thinks of as a character in
>>> their script.
>>>
>>> So we have "a user" versus "everyone...in their script" - is the
>>> difference intentional? Probably not. Anyway the definitions are
>>> language/locale dependent.
>>
>> Does 'Grapheme' (2) make sense with "a (single?) user"?
>>
>> BTW, it is rather well know that the term "phoneme" was proposed first
>> by a Polish linguist Jan Niecisław Ignacy Baudouin de Courtenay (13
>> March 1845 – 3 November 1929), cf. e.g
>> https://en.wikipedia.org/wiki/Jan_Baudouin_de_Courtenay.  It is much
>> less know that he proposed also the term "grapheme". Let me quote
>> Alexander Berg's "English Historical Linguistics vol. I" page 230 from
>> Google Books:
>>
>>        Since the introduction of the term grapheme by Baudouin de
>>        Courtenay in 1901 (Ruszkiewicz 1976:24-37, 1981 [1978], 20-34),
>>        it has been defined in various ways:
>>
>>        [...]
>>
>>        As can be seen from these quotatioms, the available definitions
>>        can be divided into two groups, corresponding to two main senses,
>>        and reflecting "conflicting linguistics views of the status of
>>        writing" (Henderson 1985:142):
>>
>>        1. a letter or cluster of letters referring to or corresponding with a
>>        single phoneme;
>>
>>        2. the minimal distinctive unit of a writing system.
>>
>> For me the first meaning (not mentioned at all in English Wikipedia) is
>> the primary, i.e. more useful, meaning, as is has some practical
>> applications e.g. for describing Polish hyphenation rules.
>
> Best regards
>
> Janusz
>
Reply | Threaded
Open this post in threaded view
|

Aw: Re: TEI element for a grapheme?

Christian M. Prager
dear all,

An intriguing discussion of signs! Following Pulgram's sign theory a grapheme represents the smallest distinctive visual unit of a graphic system in accordance with phoneme being the smallest contrastive linguistic unit to bring a change of meaning with the phon being the speech realiziation sound. Accordingly the written realization of a grapheme is a graph and its variants are allographs.

Symbolism and thus symbols are not restricted to the realm of visualization or verbalisation (graphic, phonetic percepts), but may also come from visual, auditory, olfactive, kinesthetic etc. percepts (according to Dan Sperber). According this this cognitive-semiotic view symbolism uses a signals elements, acts or utterances that exist, and are also interpreted, independent of it (Sperber) and they have usually a variety of meanings - the value of a symbol is usually determined by the underlying knowledge of the recipient / creator of the symbol.

Hope this is useful
best, Christian



______________________________
Dr. Christian Prager
"Textdatenbank und Wörterbuch des Klassischen Maya"
Arbeitsstelle der NRW Akademie der
Wissenschaften und der Künste
Universität Bonn
Abteilung für Altamerikanistik
Oxfordstrasse 15
53111 Bonn
Tel. 0228 73 61 63
www.mayawoerterbuch.de


> Gesendet: Dienstag, 27. September 2016 um 21:02 Uhr
> Von: "Martin Holmes" <[hidden email]>
> An: [hidden email]
> Betreff: Re: TEI element for a grapheme?
>
> This is a really interesting discussion, but I must confess that this:
>
> "Graphemes are sequences of one or more encoded characters that
> correspond to what users think of as characters."
>
> is one of the most unhelpful definitions imaginable. Who is a user, and
> who can tell what he or she might choose to think of as a "character"?
>
> Cheers,
> Martin
>
> On 2016-09-27 11:21 AM, Janusz S. Bień wrote:
> > On Tue, Sep 27 2016 at 19:17 CEST, [hidden email] writes:
> >> Hello,
> >>
> >> On 27/09/2016 18:32, Paul Schaffner wrote:
> >>> But I am probably misunderstanding. In my experience, all discussions
> >>> of characters, glyphs, graphs, and symbols end in a metaphysical
> >>> muddle.
> >>
> >> Unicode has a definition for all these terms.
> >
> > Really? What is the Unicode definition of graphs? Is there a Unicode
> > definition of symbols?
> >
> >> In the case of grapheme,
> >
> > In the case of grapheme the only Unicode definition I know of is that
> > from the glossary of Unicode terms (http://www.unicode.org/glossary/),
> > which I quote below.
> >
> >> the key definitions is that of _grapheme cluster_.
> >
> > Exactly. There is even a definition of extended grapheme cluster. The
> > term "grapheme" doesn't occur in the standard alone.
> >
> >> Please see
> >> <http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>,
> >> especially the part about Hangul and Devanagari. Reasoning over
> >> non-Latin scripts makes that discussion always more scientific and less
> >> metaphysical. :)
> >
> > The title of the quoted Unicode® Standard Annex #29 is "UNICODE TEXT
> > SEGMENTATION" and grapheme clusters are just fragment of some
> > specific texts. It's unclear for me whether they are clusters of
> > (undefined) grahemes or just graphemic clusters in some vague relation
> > to the non-Unicode meaning of the term "grapheme".
> >
> > Let me quote my recent posting to the Unicode mailing list in the thread
> >
> > http://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0068.html
> >
> > entitled "graphemes":
> >
> > On Wed, Sep 21 2016 at  7:09 CEST, [hidden email] writes:
> >
> > [...]
> >
> >> Let me remind the issues which started the thread:
> >>
> >>
> >> On Sun, Sep 18 2016 at 12:26 CEST, [hidden email] writes:
> >>> Quote/Cytat - Christoph Päper <[hidden email]> (pią, 16
> >>> wrz 2016, 23:51:38):
> >>>
> >>>> Janusz S. Bień <[hidden email]>:
> >>>>>
> >>>>> 1. Graphemes, if I understand correctly, are language dependent, …
> >>>>
> >>>> That’s true in linguistic terminology – well, at least within the
> >>>> more popular schools of thought –, but not in technical (i.e.
> >>>> Unicode) jargon.
> >>
> >> And what is "grapheme" in "technical (i.e. Unicode) jargon"?
> >>
> >>>
> >>> From the Unicode glossary:
> >>>
> >>> Grapheme. (1) A minimally distinctive unit of writing in the context
> >>> of a particular writing system.[...] (2) What a user thinks of as a
> >>> character.
> >>>
> >>> As for (2), cf.
> >>>
> >>> User-Perceived Character. What everyone thinks of as a character in
> >>> their script.
> >>>
> >>> So we have "a user" versus "everyone...in their script" - is the
> >>> difference intentional? Probably not. Anyway the definitions are
> >>> language/locale dependent.
> >>
> >> Does 'Grapheme' (2) make sense with "a (single?) user"?
> >>
> >> BTW, it is rather well know that the term "phoneme" was proposed first
> >> by a Polish linguist Jan Niecisław Ignacy Baudouin de Courtenay (13
> >> March 1845 – 3 November 1929), cf. e.g
> >> https://en.wikipedia.org/wiki/Jan_Baudouin_de_Courtenay.  It is much
> >> less know that he proposed also the term "grapheme". Let me quote
> >> Alexander Berg's "English Historical Linguistics vol. I" page 230 from
> >> Google Books:
> >>
> >>        Since the introduction of the term grapheme by Baudouin de
> >>        Courtenay in 1901 (Ruszkiewicz 1976:24-37, 1981 [1978], 20-34),
> >>        it has been defined in various ways:
> >>
> >>        [...]
> >>
> >>        As can be seen from these quotatioms, the available definitions
> >>        can be divided into two groups, corresponding to two main senses,
> >>        and reflecting "conflicting linguistics views of the status of
> >>        writing" (Henderson 1985:142):
> >>
> >>        1. a letter or cluster of letters referring to or corresponding with a
> >>        single phoneme;
> >>
> >>        2. the minimal distinctive unit of a writing system.
> >>
> >> For me the first meaning (not mentioned at all in English Wikipedia) is
> >> the primary, i.e. more useful, meaning, as is has some practical
> >> applications e.g. for describing Polish hyphenation rules.
> >
> > Best regards
> >
> > Janusz
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Piotr Banski
In reply to this post by Martin Holmes
On 27/09/16 21:02, Martin Holmes wrote:
> This is a really interesting discussion, but I must confess that this:
>
> "Graphemes are sequences of one or more encoded characters that
> correspond to what users think of as characters."
>
> is one of the most unhelpful definitions imaginable. Who is a user, and
> who can tell what he or she might choose to think of as a "character"?

That's your English bias. ;-) In Polish, we have a graph 'dotted z' and
a graph 'crossed z' (I'm sure Prof. Bień can quote their established
standardized names) which for 'users' (bah, take any plausible
definition) represent a single grapheme. Similarly with the grapheme
realised phonetically as [w], in writing (it can have a tilde across 'l'
or over 'l'). Or the sound [t] in Russian, spelled with <m> or a 'little
(capital) T' in writing. This is free variation (context-independent,
across the population, but I guess you can also see it in the production
of the same individual). In English, 'plain z' and 'crossed z' can be
found in this sort of free variation, sometimes.

You can also have complementary (context-dependent) distribution of
graphs representing the same grapheme, think of word-final vs.
non-word-final Greek sigma. (In English, in some contexts, word-initial
<g> and non-word-initial <dg> could qualify though I'm not sure I would
use this example in a 101-course, due to the diachronic variables that
have to be taken into consideration here.)

Now, two remarks regarding Emmanuelle's original posting:
1. I am still not sure how Emmanuelle wants to define 'grapheme'; I
think that a bunch of examples might help to solve this particular
issue, without having to delve too deep into philosophy of language
and/or semiotics.

2. <fs> is "feature structure" in TEI-speak. It can represent anything,
including phonemes. But it definitely is _not_ the "default encoding"
for phonemes.

Best regards,

   Piotr

>
> Cheers,
> Martin
>
> On 2016-09-27 11:21 AM, Janusz S. Bień wrote:
>> On Tue, Sep 27 2016 at 19:17 CEST, [hidden email] writes:
>>> Hello,
>>>
>>> On 27/09/2016 18:32, Paul Schaffner wrote:
>>>> But I am probably misunderstanding. In my experience, all discussions
>>>> of characters, glyphs, graphs, and symbols end in a metaphysical
>>>> muddle.
>>>
>>> Unicode has a definition for all these terms.
>>
>> Really? What is the Unicode definition of graphs? Is there a Unicode
>> definition of symbols?
>>
>>> In the case of grapheme,
>>
>> In the case of grapheme the only Unicode definition I know of is that
>> from the glossary of Unicode terms (http://www.unicode.org/glossary/),
>> which I quote below.
>>
>>> the key definitions is that of _grapheme cluster_.
>>
>> Exactly. There is even a definition of extended grapheme cluster. The
>> term "grapheme" doesn't occur in the standard alone.
>>
>>> Please see
>>> <http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>,
>>> especially the part about Hangul and Devanagari. Reasoning over
>>> non-Latin scripts makes that discussion always more scientific and less
>>> metaphysical. :)
>>
>> The title of the quoted Unicode® Standard Annex #29 is "UNICODE TEXT
>> SEGMENTATION" and grapheme clusters are just fragment of some
>> specific texts. It's unclear for me whether they are clusters of
>> (undefined) grahemes or just graphemic clusters in some vague relation
>> to the non-Unicode meaning of the term "grapheme".
>>
>> Let me quote my recent posting to the Unicode mailing list in the thread
>>
>> http://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0068.html
>>
>> entitled "graphemes":
>>
>> On Wed, Sep 21 2016 at  7:09 CEST, [hidden email] writes:
>>
>> [...]
>>
>>> Let me remind the issues which started the thread:
>>>
>>>
>>> On Sun, Sep 18 2016 at 12:26 CEST, [hidden email] writes:
>>>> Quote/Cytat - Christoph Päper <[hidden email]> (pią, 16
>>>> wrz 2016, 23:51:38):
>>>>
>>>>> Janusz S. Bień <[hidden email]>:
>>>>>>
>>>>>> 1. Graphemes, if I understand correctly, are language dependent, …
>>>>>
>>>>> That’s true in linguistic terminology – well, at least within the
>>>>> more popular schools of thought –, but not in technical (i.e.
>>>>> Unicode) jargon.
>>>
>>> And what is "grapheme" in "technical (i.e. Unicode) jargon"?
>>>
>>>>
>>>> From the Unicode glossary:
>>>>
>>>> Grapheme. (1) A minimally distinctive unit of writing in the context
>>>> of a particular writing system.[...] (2) What a user thinks of as a
>>>> character.
>>>>
>>>> As for (2), cf.
>>>>
>>>> User-Perceived Character. What everyone thinks of as a character in
>>>> their script.
>>>>
>>>> So we have "a user" versus "everyone...in their script" - is the
>>>> difference intentional? Probably not. Anyway the definitions are
>>>> language/locale dependent.
>>>
>>> Does 'Grapheme' (2) make sense with "a (single?) user"?
>>>
>>> BTW, it is rather well know that the term "phoneme" was proposed first
>>> by a Polish linguist Jan Niecisław Ignacy Baudouin de Courtenay (13
>>> March 1845 – 3 November 1929), cf. e.g
>>> https://en.wikipedia.org/wiki/Jan_Baudouin_de_Courtenay.  It is much
>>> less know that he proposed also the term "grapheme". Let me quote
>>> Alexander Berg's "English Historical Linguistics vol. I" page 230 from
>>> Google Books:
>>>
>>>        Since the introduction of the term grapheme by Baudouin de
>>>        Courtenay in 1901 (Ruszkiewicz 1976:24-37, 1981 [1978], 20-34),
>>>        it has been defined in various ways:
>>>
>>>        [...]
>>>
>>>        As can be seen from these quotatioms, the available definitions
>>>        can be divided into two groups, corresponding to two main senses,
>>>        and reflecting "conflicting linguistics views of the status of
>>>        writing" (Henderson 1985:142):
>>>
>>>        1. a letter or cluster of letters referring to or
>>> corresponding with a
>>>        single phoneme;
>>>
>>>        2. the minimal distinctive unit of a writing system.
>>>
>>> For me the first meaning (not mentioned at all in English Wikipedia) is
>>> the primary, i.e. more useful, meaning, as is has some practical
>>> applications e.g. for describing Polish hyphenation rules.
>>
>> Best regards
>>
>> Janusz
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Martin Holmes
On 2016-09-27 04:03 PM, Piotr Bański wrote:

> On 27/09/16 21:02, Martin Holmes wrote:
>> This is a really interesting discussion, but I must confess that this:
>>
>> "Graphemes are sequences of one or more encoded characters that
>> correspond to what users think of as characters."
>>
>> is one of the most unhelpful definitions imaginable. Who is a user, and
>> who can tell what he or she might choose to think of as a "character"?
>
> That's your English bias. ;-) In Polish, we have a graph 'dotted z' and
> a graph 'crossed z' (I'm sure Prof. Bień can quote their established
> standardized names) which for 'users' (bah, take any plausible
> definition) represent a single grapheme.

Yes, this works nicely for a single language group using a single
script. But there are so many cases which are not so simple. Consider this:

:-)

I typed three "characters" to create a combination sign. In my email
client, and in my mind, they are distinct characters (as if I'd typed
"Yay!"). In your client, the software might substitute U+1F600 GRINNING
FACE, which is a single glyph, and which you might perceive as a single
character; someone else might receive exactly what I sent, three glyphs,
and yet still "think of" it as a "character" because they use emojis all
the time.

It just seems to me that any attempt at a definition which depends on
the perception of a non-specific, presumably non-expert user is hardly a
definition at all.

Cheers,
Martin

> Similarly with the grapheme
> realised phonetically as [w], in writing (it can have a tilde across 'l'
> or over 'l'). Or the sound [t] in Russian, spelled with <m> or a 'little
> (capital) T' in writing. This is free variation (context-independent,
> across the population, but I guess you can also see it in the production
> of the same individual). In English, 'plain z' and 'crossed z' can be
> found in this sort of free variation, sometimes.
>
> You can also have complementary (context-dependent) distribution of
> graphs representing the same grapheme, think of word-final vs.
> non-word-final Greek sigma. (In English, in some contexts, word-initial
> <g> and non-word-initial <dg> could qualify though I'm not sure I would
> use this example in a 101-course, due to the diachronic variables that
> have to be taken into consideration here.)
>
> Now, two remarks regarding Emmanuelle's original posting:
> 1. I am still not sure how Emmanuelle wants to define 'grapheme'; I
> think that a bunch of examples might help to solve this particular
> issue, without having to delve too deep into philosophy of language
> and/or semiotics.
>
> 2. <fs> is "feature structure" in TEI-speak. It can represent anything,
> including phonemes. But it definitely is _not_ the "default encoding"
> for phonemes.
>
> Best regards,
>
>   Piotr
>
>>
>> Cheers,
>> Martin
>>
>> On 2016-09-27 11:21 AM, Janusz S. Bień wrote:
>>> On Tue, Sep 27 2016 at 19:17 CEST, [hidden email] writes:
>>>> Hello,
>>>>
>>>> On 27/09/2016 18:32, Paul Schaffner wrote:
>>>>> But I am probably misunderstanding. In my experience, all discussions
>>>>> of characters, glyphs, graphs, and symbols end in a metaphysical
>>>>> muddle.
>>>>
>>>> Unicode has a definition for all these terms.
>>>
>>> Really? What is the Unicode definition of graphs? Is there a Unicode
>>> definition of symbols?
>>>
>>>> In the case of grapheme,
>>>
>>> In the case of grapheme the only Unicode definition I know of is that
>>> from the glossary of Unicode terms (http://www.unicode.org/glossary/),
>>> which I quote below.
>>>
>>>> the key definitions is that of _grapheme cluster_.
>>>
>>> Exactly. There is even a definition of extended grapheme cluster. The
>>> term "grapheme" doesn't occur in the standard alone.
>>>
>>>> Please see
>>>> <http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>,
>>>> especially the part about Hangul and Devanagari. Reasoning over
>>>> non-Latin scripts makes that discussion always more scientific and less
>>>> metaphysical. :)
>>>
>>> The title of the quoted Unicode® Standard Annex #29 is "UNICODE TEXT
>>> SEGMENTATION" and grapheme clusters are just fragment of some
>>> specific texts. It's unclear for me whether they are clusters of
>>> (undefined) grahemes or just graphemic clusters in some vague relation
>>> to the non-Unicode meaning of the term "grapheme".
>>>
>>> Let me quote my recent posting to the Unicode mailing list in the thread
>>>
>>> http://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0068.html
>>>
>>> entitled "graphemes":
>>>
>>> On Wed, Sep 21 2016 at  7:09 CEST, [hidden email] writes:
>>>
>>> [...]
>>>
>>>> Let me remind the issues which started the thread:
>>>>
>>>>
>>>> On Sun, Sep 18 2016 at 12:26 CEST, [hidden email] writes:
>>>>> Quote/Cytat - Christoph Päper <[hidden email]> (pią, 16
>>>>> wrz 2016, 23:51:38):
>>>>>
>>>>>> Janusz S. Bień <[hidden email]>:
>>>>>>>
>>>>>>> 1. Graphemes, if I understand correctly, are language dependent, …
>>>>>>
>>>>>> That’s true in linguistic terminology – well, at least within the
>>>>>> more popular schools of thought –, but not in technical (i.e.
>>>>>> Unicode) jargon.
>>>>
>>>> And what is "grapheme" in "technical (i.e. Unicode) jargon"?
>>>>
>>>>>
>>>>> From the Unicode glossary:
>>>>>
>>>>> Grapheme. (1) A minimally distinctive unit of writing in the context
>>>>> of a particular writing system.[...] (2) What a user thinks of as a
>>>>> character.
>>>>>
>>>>> As for (2), cf.
>>>>>
>>>>> User-Perceived Character. What everyone thinks of as a character in
>>>>> their script.
>>>>>
>>>>> So we have "a user" versus "everyone...in their script" - is the
>>>>> difference intentional? Probably not. Anyway the definitions are
>>>>> language/locale dependent.
>>>>
>>>> Does 'Grapheme' (2) make sense with "a (single?) user"?
>>>>
>>>> BTW, it is rather well know that the term "phoneme" was proposed first
>>>> by a Polish linguist Jan Niecisław Ignacy Baudouin de Courtenay (13
>>>> March 1845 – 3 November 1929), cf. e.g
>>>> https://en.wikipedia.org/wiki/Jan_Baudouin_de_Courtenay.  It is much
>>>> less know that he proposed also the term "grapheme". Let me quote
>>>> Alexander Berg's "English Historical Linguistics vol. I" page 230 from
>>>> Google Books:
>>>>
>>>>        Since the introduction of the term grapheme by Baudouin de
>>>>        Courtenay in 1901 (Ruszkiewicz 1976:24-37, 1981 [1978], 20-34),
>>>>        it has been defined in various ways:
>>>>
>>>>        [...]
>>>>
>>>>        As can be seen from these quotatioms, the available definitions
>>>>        can be divided into two groups, corresponding to two main
>>>> senses,
>>>>        and reflecting "conflicting linguistics views of the status of
>>>>        writing" (Henderson 1985:142):
>>>>
>>>>        1. a letter or cluster of letters referring to or
>>>> corresponding with a
>>>>        single phoneme;
>>>>
>>>>        2. the minimal distinctive unit of a writing system.
>>>>
>>>> For me the first meaning (not mentioned at all in English Wikipedia) is
>>>> the primary, i.e. more useful, meaning, as is has some practical
>>>> applications e.g. for describing Polish hyphenation rules.
>>>
>>> Best regards
>>>
>>> Janusz
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Paul Schaffner
I used to cite this made-up sentence (which nevertheless
is quite possible in my world) to illustrate the well-known
problems with 'ff'.

"Whilst suffering in ffraunce, he thought to himselff of
the proverb 'Pan darffo treiglo pob tre / Da yw edrych tuag
adre.'"

'suFFering' and 'himselFF' use the doubled f to
indicate a voiceless fricative. The former is the
common spelling, the latter an uncommon one.
But to the everyday user, 'ff' is not a character:
'f' is the character in question, and is the same
'f' as in 'oF' (where it represents a voiced sound).
These are the same everyday users who think
of 't' and 'h' as characters but 'th' as a combination
of characters, not as a single character composed
of graphemes.

'ffraunce' contains an old-fashioned use of 'ff'
to represent an upper-case (capitalized) variant
of "f". Are upper-case and lower-case letters the
same character or not? They are not glyph variants
in the usual sense. And if they *are* different
characters, then is "ff" simply a glyph variant of "F"?
The usual answer is 'yes.'  /F/ is a character that
may be represented either as grapheme F or as
grapheme cluster ff.

And of course the Welsh 'ff' is usually thought of
(by its users!) as a character in its own right,
quite distinct from 'f'. If we believe these users,
we should regard 'f' as a grapheme, which when
it appears singly represents the character /f/
and when it is doubled represents the character
/ff/.

In actual transcription, it is rare indeed to find
someone who distinguishes character from
glyph in any kind of consistent way: most, I think,
fall back on the Latin alphabet and its extensions
as the default character inventory. (The 'capital'
use of 'ff' is an exception, since I think most
transcribers would capture it as "F" -- not all,
however.)

pfs


On Tue, Sep 27, 2016, at 19:24, Martin Holmes wrote:

> On 2016-09-27 04:03 PM, Piotr Bański wrote:
> > On 27/09/16 21:02, Martin Holmes wrote:
> >> This is a really interesting discussion, but I must confess that this:
> >>
> >> "Graphemes are sequences of one or more encoded characters that
> >> correspond to what users think of as characters."
> >>
> >> is one of the most unhelpful definitions imaginable. Who is a user, and
> >> who can tell what he or she might choose to think of as a "character"?
> >
> > That's your English bias. ;-) In Polish, we have a graph 'dotted z' and
> > a graph 'crossed z' (I'm sure Prof. Bień can quote their established
> > standardized names) which for 'users' (bah, take any plausible
> > definition) represent a single grapheme.
>
> Yes, this works nicely for a single language group using a single
> script. But there are so many cases which are not so simple. Consider
> this:
>
> :-)
>
> I typed three "characters" to create a combination sign. In my email
> client, and in my mind, they are distinct characters (as if I'd typed
> "Yay!"). In your client, the software might substitute U+1F600 GRINNING
> FACE, which is a single glyph, and which you might perceive as a single
> character; someone else might receive exactly what I sent, three glyphs,
> and yet still "think of" it as a "character" because they use emojis all
> the time.
>
> It just seems to me that any attempt at a definition which depends on
> the perception of a non-specific, presumably non-expert user is hardly a
> definition at all.
>
> Cheers,
> Martin
>
> > Similarly with the grapheme
> > realised phonetically as [w], in writing (it can have a tilde across 'l'
> > or over 'l'). Or the sound [t] in Russian, spelled with <m> or a 'little
> > (capital) T' in writing. This is free variation (context-independent,
> > across the population, but I guess you can also see it in the production
> > of the same individual). In English, 'plain z' and 'crossed z' can be
> > found in this sort of free variation, sometimes.
> >
> > You can also have complementary (context-dependent) distribution of
> > graphs representing the same grapheme, think of word-final vs.
> > non-word-final Greek sigma. (In English, in some contexts, word-initial
> > <g> and non-word-initial <dg> could qualify though I'm not sure I would
> > use this example in a 101-course, due to the diachronic variables that
> > have to be taken into consideration here.)
> >
> > Now, two remarks regarding Emmanuelle's original posting:
> > 1. I am still not sure how Emmanuelle wants to define 'grapheme'; I
> > think that a bunch of examples might help to solve this particular
> > issue, without having to delve too deep into philosophy of language
> > and/or semiotics.
> >
> > 2. <fs> is "feature structure" in TEI-speak. It can represent anything,
> > including phonemes. But it definitely is _not_ the "default encoding"
> > for phonemes.
> >
> > Best regards,
> >
> >   Piotr
> >
> >>
> >> Cheers,
> >> Martin
> >>
> >> On 2016-09-27 11:21 AM, Janusz S. Bień wrote:
> >>> On Tue, Sep 27 2016 at 19:17 CEST, [hidden email] writes:
> >>>> Hello,
> >>>>
> >>>> On 27/09/2016 18:32, Paul Schaffner wrote:
> >>>>> But I am probably misunderstanding. In my experience, all discussions
> >>>>> of characters, glyphs, graphs, and symbols end in a metaphysical
> >>>>> muddle.
> >>>>
> >>>> Unicode has a definition for all these terms.
> >>>
> >>> Really? What is the Unicode definition of graphs? Is there a Unicode
> >>> definition of symbols?
> >>>
> >>>> In the case of grapheme,
> >>>
> >>> In the case of grapheme the only Unicode definition I know of is that
> >>> from the glossary of Unicode terms (http://www.unicode.org/glossary/),
> >>> which I quote below.
> >>>
> >>>> the key definitions is that of _grapheme cluster_.
> >>>
> >>> Exactly. There is even a definition of extended grapheme cluster. The
> >>> term "grapheme" doesn't occur in the standard alone.
> >>>
> >>>> Please see
> >>>> <http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>,
> >>>> especially the part about Hangul and Devanagari. Reasoning over
> >>>> non-Latin scripts makes that discussion always more scientific and less
> >>>> metaphysical. :)
> >>>
> >>> The title of the quoted Unicode® Standard Annex #29 is "UNICODE TEXT
> >>> SEGMENTATION" and grapheme clusters are just fragment of some
> >>> specific texts. It's unclear for me whether they are clusters of
> >>> (undefined) grahemes or just graphemic clusters in some vague relation
> >>> to the non-Unicode meaning of the term "grapheme".
> >>>
> >>> Let me quote my recent posting to the Unicode mailing list in the thread
> >>>
> >>> http://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0068.html
> >>>
> >>> entitled "graphemes":
> >>>
> >>> On Wed, Sep 21 2016 at  7:09 CEST, [hidden email] writes:
> >>>
> >>> [...]
> >>>
> >>>> Let me remind the issues which started the thread:
> >>>>
> >>>>
> >>>> On Sun, Sep 18 2016 at 12:26 CEST, [hidden email] writes:
> >>>>> Quote/Cytat - Christoph Päper <[hidden email]> (pią, 16
> >>>>> wrz 2016, 23:51:38):
> >>>>>
> >>>>>> Janusz S. Bień <[hidden email]>:
> >>>>>>>
> >>>>>>> 1. Graphemes, if I understand correctly, are language dependent, …
> >>>>>>
> >>>>>> That’s true in linguistic terminology – well, at least within the
> >>>>>> more popular schools of thought –, but not in technical (i.e.
> >>>>>> Unicode) jargon.
> >>>>
> >>>> And what is "grapheme" in "technical (i.e. Unicode) jargon"?
> >>>>
> >>>>>
> >>>>> From the Unicode glossary:
> >>>>>
> >>>>> Grapheme. (1) A minimally distinctive unit of writing in the context
> >>>>> of a particular writing system.[...] (2) What a user thinks of as a
> >>>>> character.
> >>>>>
> >>>>> As for (2), cf.
> >>>>>
> >>>>> User-Perceived Character. What everyone thinks of as a character in
> >>>>> their script.
> >>>>>
> >>>>> So we have "a user" versus "everyone...in their script" - is the
> >>>>> difference intentional? Probably not. Anyway the definitions are
> >>>>> language/locale dependent.
> >>>>
> >>>> Does 'Grapheme' (2) make sense with "a (single?) user"?
> >>>>
> >>>> BTW, it is rather well know that the term "phoneme" was proposed first
> >>>> by a Polish linguist Jan Niecisław Ignacy Baudouin de Courtenay (13
> >>>> March 1845 – 3 November 1929), cf. e.g
> >>>> https://en.wikipedia.org/wiki/Jan_Baudouin_de_Courtenay.  It is much
> >>>> less know that he proposed also the term "grapheme". Let me quote
> >>>> Alexander Berg's "English Historical Linguistics vol. I" page 230 from
> >>>> Google Books:
> >>>>
> >>>>        Since the introduction of the term grapheme by Baudouin de
> >>>>        Courtenay in 1901 (Ruszkiewicz 1976:24-37, 1981 [1978], 20-34),
> >>>>        it has been defined in various ways:
> >>>>
> >>>>        [...]
> >>>>
> >>>>        As can be seen from these quotatioms, the available definitions
> >>>>        can be divided into two groups, corresponding to two main
> >>>> senses,
> >>>>        and reflecting "conflicting linguistics views of the status of
> >>>>        writing" (Henderson 1985:142):
> >>>>
> >>>>        1. a letter or cluster of letters referring to or
> >>>> corresponding with a
> >>>>        single phoneme;
> >>>>
> >>>>        2. the minimal distinctive unit of a writing system.
> >>>>
> >>>> For me the first meaning (not mentioned at all in English Wikipedia) is
> >>>> the primary, i.e. more useful, meaning, as is has some practical
> >>>> applications e.g. for describing Polish hyphenation rules.
> >>>
> >>> Best regards
> >>>
> >>> Janusz
> >>>
> >>
--
Paul Schaffner  Digital Library Production Service
[hidden email] | http://www.umich.edu/~pfs/
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Piotr Bański-2
In reply to this post by Martin Holmes
Hi Martin,

(briefly)

On 28/09/16 01:24, Martin Holmes wrote:
[..]
> It just seems to me that any attempt at a definition which depends on
> the perception of a non-specific, presumably non-expert user is hardly a
> definition at all.

Sometimes it's the best thing you can get on your way to the concept,
and perhaps this is what we're looking at, here. A "mentalistic"
definition of the phoneme, coming from the school of thought represented
by de Courtenay quoted earlier ("Kazan school" [1]), but this time given
by his student Mikołaj Kruszewski, says that it's "the speaker's
intention, at which he aims, but does not arrive for a variety of
reasons". Not a definition at all, according to your (respectable)
criteria, and at the same time the most beautiful 'definition' of the
phoneme that I have encountered :-)

I'm not arguing that the other definition (or characterisation, etc.) is
super-precise, I just wanted to illustrate how it could be made sense
of. By no means would I like to see it in the Guidelines as _the_
definition of an element.

Cheers,

   Piotr

[1] https://en.wikipedia.org/wiki/Kazan_school




--
Piotr Bański, Ph.D.
Senior Researcher,
Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Gioele Barabucci-2
In reply to this post by Martin Holmes
Am 28.09.2016 um 01:24 schrieb Martin Holmes:

> Yes, this works nicely for a single language group using a single
> script. But there are so many cases which are not so simple. Consider this:
>
> :-)
>
> I typed three "characters" to create a combination sign. In my email
> client, and in my mind, they are distinct characters (as if I'd typed
> "Yay!"). In your client, the software might substitute U+1F600 GRINNING
> FACE, which is a single glyph, and which you might perceive as a single
> character; someone else might receive exactly what I sent, three glyphs,
> and yet still "think of" it as a "character" because they use emojis all
> the time.
>
> It just seems to me that any attempt at a definition which depends on
> the perception of a non-specific, presumably non-expert user is hardly a
> definition at all.

"Eppur si fa", i.e. despite all these possible complications, every text
editor makes a decision about what is a "grapheme" (although not all of
them may agree on a single definition).

A grapheme is what is skipped when the user pressed the back arrow
button. (A more strict version uses the backspace button instead of the
back arrow).

In my computer, configured with my locale, using the composition editor
of my mail client, "è" is grapheme, even though I typed it as COMPOSE +
"a" + "`".

If I write "sarà" (typing "s", "a", "r", COMPOSE, "a", "`") and the
cursor is at the end of that word ("sarà|"), pressing the back arrow
will move the cursor over one grapheme, leading to this situation:
"sar|à". Pressing the back arrow will not move the cursor between "a"
and "`": I will not find my self in this situation "sara|`", No Italian
speakers would want that behaviour (and my locale is it_IT).

(To see the difference between the definition with the back arrow and
that with the back space one has to resort to Hangul, but the discussion
is already complex enough with the Latin alphabet.)

This example is not meant to show that the editor's definition is the
definitive definition of the term "grapheme". But it is an
_implementable_ definition, although locale/context dependent. It is
even in Perl [1]. In the case of a TEI transcription that is going to be
processed though computer programs, I think that one should limit the
discussion to the definitions that can be described in
machine-implementable steps. Definitions that take into account
individual preferences are hardly machine-implementable.

Regards,

[1]
http://www.perl.com/pub/2012/05/perlunicook-string-length-in-graphemes.html

--
Gioele Barabucci <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: TEI element for a grapheme?

Kalvesmaki, Joel
In reply to this post by Emmanuelle Morlock
On the topic of graphemes, there is a stimulating discussion now underway on the public Unicode listserv:
http://unicode.org/pipermail/unicode/2016-September/thread.html

jk
--
Joel Kalvesmaki
Editor in Byzantine Studies
Dumbarton Oaks
202 339 6435


On 9/27/16, 5:27 AM, "TEI (Text Encoding Initiative) public discussion list on behalf of Emmanuelle Morlock" <[hidden email] on behalf of [hidden email]> wrote:

    Which tei element would you use to represent a grapheme (as opposed to a
    phoneme)?