classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view


Robin C. Cover
(part 1 of 2)
February 21, 1990                         3909 Swiss Avenue
Robin C. Cover                            Dallas, TX    75204
Draft, TEI-REP Working Paper              (214) 296-1783/841-3657
For TEI-TRR4, sections 6.9 - 6.10         BITNET: zrcc1001@smuvm1
                                          INTERNET: [hidden email]
*Supersedes draft "February 18, 1990"     UUCP: attctc!utafll!robin
distributed on TEI-REP                    UUCP: attctc!texbell!txsil.robin



     In this paper I discuss some proposals and partial solutions to the
problems of encoding textual parallels, textual variation and related
features.  Encoding for textual parallels and textual variation will
employ SGML-based (and perhaps non-SGML) constructs developed for other
encoding problems, both within the domain of TEI-REP and within the TEI
ANA group.  The two inter-related topics specifically assigned to me
under the current TEI-REP work agenda ("textual parallels" and "textual
variants") are dependent upon and closely related to the decisions
reached on "Reference Systems" (6.7; Mylonas), "Cross-References and
Text-Links" (TEI TRR4 6.11; DeRose).  Likewise, Johansson's work on
"Normalization Practices and Documenting Them" (TEI TRR4 7.2.1) bears
similarity to the text-critical issue of machine collation and machine
analysis of texts in different languages, scripts, transcription systems,
orthographic-strata, etc.  Accordingly, I plan to revise the following
sections for TEI TRR2 6.9-6.10 in light of other decisions and
recommendations reached in the Oxford TEI-REP meetings.

I. TEXTUAL PARALLELS  [revised material to be inserted at TEI TRR2 6.10]


     I use the term "(textual) parallels" to refer broadly to any two
documents, or regions of text within the same document, or sets of
documents which literary/linguistic analysts would like to compare in
some way because they share significant degree of structure and/or
content, or perhaps pedigree.  The comparison may be quantitative (e.g.,
searching for index of term-frequencies or co-occurrences) or simply
visual (e.g.,parallel displays which support synchronous scrolling of two
or more documents). Provisionally, I think of "textual variants" as a
special case of textual parallelism (though requiring much more
annotation) in that they are parallel textual objects competing for
antiquity, typological priority, authenticity and so forth.  The issue of
document versioning is tangent to textual criticism in that both involve
attempts to reconstruct or record exact evolutionary stages and processes
in the composition and transmission of texts.  In the more general case,
"textually" variant readings are simply parallel texts which contribute
information to our understanding of the evolutionary history of an
idealized (fictional) literary entity we think of as a literary "text."

Here are some examples of (sub-) documents I would call "parallel" texts:

*a literary allusion and its assumed/purported origin in the
    classic; a quotation and its contextual source

*a literary text and its translation (into another language, or
    into multiple languages)

*the same text viewed simultaneously in different scripts, within
    different orthographic strata or under different transcriptional
    systems [here document views is envisioned as being generated
    from separate documents or parts of a database rather than through
    document filtering]

*texts which stem from a common source (eg., the three New Testament
    synoptics, sharing a common origin in the hypothetical document Q)

*multiple (serial or synchronous) recensions of the same text, which
    nevertheless maintain a common identity as a certain
    literary document (e.g., the eight versions Sargon's
    annalistic reports; multiple recensions of the flood epic in
    several Mesopotamian languages at several periods; the long
    and one-seventh-shorter versions of biblical Jeremiah)

*instances of oral formulaic poetry (line, stanza) which appear
    in different epic or liturgical compositions

*a literary text and running commentary on that text, including
    dedicated region(s) of textual commentary on the text
    or textual apparatus

*texts which bear some unclear textual relationship, but share
    significant amount of content and/or structure (e.g.,
    sections of biblical Samuel/Kings/Isa/Chron)

*sections of texts "reused" in other contexts (long sections of
    certain Psalms in prose texts narrating a liturgy)

*sections of texts excerpted wholesale and thereafter acquiring
    independent transmission history(e.g., biblical Odes),

*paraphrases or synopses (e.g., Josephus' <cit>Antiquities of the
    Jews</>, which sometimes tracks very closely with the
    biblical (Greek) Septuagint text

*a lineated/versified text and its associated (sets of) cross-references
    to other texts or locations within the same text

*a "clean" text and alternate views of that text based upon its
    literary or linguistic annotations;


    In order to provide encoding for such parallel texts or regions of
texts, the encoder must begin by: (a) determining the level of
granularity at which the text(s) are to be tagged, and (b) designating
names for those textual objects which will be matched as the parallels
(c) devising referencing a scheme which allows one to unambiguously point
to regions of "parallel" text delimited by the tagging (so that reference
may be made to these units from external sources).  The referencing
scheme is critical because the parallels may not involve matched textual
objects (e.g., "verses" in one edition with "verses" in another edition).
The parallels may involve mappings between points and spans, or between
spans of text very unlike in some characteristics.

    The levels of granularity chosen of course depend upon the
applications.  I believe that for representing "parallel" structures (and
"textual") variants, character-level encoding and everything higher must
be supported.  Certainly this will be required by linguists in the TEI
ANA group, but we may imagine multiple uses for character-level encoding
for "textual variation" and "parallels" as well.  In text criticism,
paleographic annotations at character level are often given by various
typographic conventions.  For example, in Qumran publications there are
4-5 special sigla attached to characters (usually supralinear attachment)
for indicating whether the characters are of such-and-such varying
degrees of legibility, or suspended characters in a word, etc.  In other
cases, uncertain characters are interpreted in terms of a closed set of
alternatives:  "b/m" (written in Hebrew script) would mean "the character
beth or mem -- we know it is one or the other, because and both make
sense and are paleographically possible , but we don't know which letter
this is for sure."  Or suppose we wish to study parallel displays in
which adjacent orthographic strata are involved, and we want to see
"level n+1" orthographic innovations in a color-coded scheme (e.g, in
historic periods when quiescent consonants or semi-vowels are being
introduced as vocalization markers).  Or a literary critic may wish to
selectively tag consonance or assonance using character-level or
syllable-level annotation.

    At a higher (sub-word) unit, morpheme-level annotations are also both
necessary for text criticism and parallel analysis.  Consider the case
where a German student may be assisted in Greek language instruction by
viewing parallel texts, say an Greek tragedy and a German translation, or
an English student learning Hebrew uses synchronized parallel displays
for Hebrew and English.   In either instance, the student might tab
through a sentence in either window and see equivalent (parallel) terms
displayed in reverse video in both windows.  But morpheme-level would be
required: the Hebrew "and/from/his/hellish abode" is written as one
"word" (yet four or more "morphemes," as counted by Hebrew linguists) but
as four words in English, so the color display of the single Hebrew word
will show four color-coded regions matching the similarly color-coded
four separate words in English.  And German separable prepositions will
map at morpheme level to single Greek terms.  For text-critical work,
variants often involve just differences in person, number, gender, mood,
state (whatever) which are resolved in inflectional morphemes; since
these kinds of variations need to be annotated for cataloging, it's clear
that morpheme-level annotations (morpheme-level tagging) are necessary.
Syllable-level marking and morpheme-level marking are necessary for
making text-critical comparisons of cuneiform texts, for example, because
of mixed orthographies in various scribal traditions, geographic regions,
etc. (e.g., the neo-Assyrian scribes use far more logographic
(ideographic) spellings with phonetic complements for certain words than
Babylonian scribes copying the same texts with syllabic writing; for
these languages, text critical analysis will require not only
orthographic normalization, but in some cases syllable- and morpheme
level tags.  It is easy to recognize, or course, that linguistic analysis
requires morpheme-level annotations, and probably multiple alternate
morphemic representations based upon varying linguistic theories.

    I have started with two examples of sub-word granularity for
parallelism because it is not entirely clear that an SGML encoding should
be used for this level of tagging/markup/linking.  I have no personal
opinion or preference on this matter.  On the authority of two SGML
experts, I am cautioned about two potential concerns: (a) the amount of
overhead involved in SGML tagging at character- and morpheme-levels is
extraordinary -- we can easily get more meta-data than data, which might
be a problem; (b) character-level tagging using SGML has (apparently) not
yet been adequately tested.  Here follows an example of an encoding of
one single Hebrew word. I know it's monstrous: it evoked a gasp of
"oy,veh" from a more qualified member of TEI-REP (I suppose not just
because it's so ugly, and redundant -- fails to use legal minimizations,
but also because there is no DATA here yet except marking of morphemes
with reference id's; I have delimited only one of the 13 characters; I
could have also added "syllableid," which would be useful in text
criticism for cuneiform texts.

========= I wrote: ========
"Suppose I mark up a Hebrew text as follows, where (in my particular DTD,
"verseid," "wordid" and "morphid" are required, but "charid" is optional
(used mostly for citing Qumran readings, or in manuscript publications
themselves).  In this sample, I put a charid tag on the first char by way
of example because I know there's a textual variant I'll want to talk
about.  But I use no minimizations yet.  Suppose I do something like in
setting up markers for the first word... (the "text", if you're looking
for it, is the first word of II Samuel 22:43 (assuming Michigan-Claremont
uses IRV chars which arrive bei dir): W:)E$:XFAQ"73M  where "73" is the
accent number from the standard Michigan-Claremont encoding scheme, and
the word means "(thus) I beat them flat")

<verse id=S2.22.43><word id=S2.22.43.1><morph id=2S.>
<char id=S2.>w</charid>:</morph><morph id=S2.>
)E</morph><morph id=S2.>$:XFQ</morph><morph id=S2.>
"73M</morph></word>....  (This example would be inbearably monstrous
if charid's were assigned to each of the thirteen characters in the


    One obvious criticism of this approach is that much of the encoding
can be generated (e.g., charid's) by a program; so why introduce this
stuff into the text?  Why not just let the application handle it, and use
some non-SGML system for referencing?  On the hypothesis that such
encoding is at least legal, and help would satisfy our need for
character- and morpheme-level annotations, it remains unclear to me
whether this encoding would actually be useful in solving the problem of
"parallel" and "textually variant" texts.  If character- and morpheme
level units tagged with an SGML-based tagging system cannot be addressed
(referenced) with SGML mechanisms (IDREF, IDREFS) for purposes of
synchronization, then perhaps they are of lesser value; perhaps a non
SGML (applications-level) solution should be proposed. The issue of SGML
based referencing will be discussed more below.

    Besides the fact that this markup is ugly, redundant, and can in part
be programmatically-generated, there are other criticisms: (a) current
processors will gag and choke; (b) the file now defies the canon of
"human-readable" SGML files.  Possible answers to these objections: (a)
"don't let processors (except batch processors working as filters) look
at the monstrosity," or "processors are getting faster all the time" and;
(b) "so what? (give up the myth of 'human-readable' SGML)."

    A non-SGML solution which would be well-suited (?) to humans and
computers alike would be to use a standard referencing system down to
verse or line level, then some kind of regular-expression style "pattern
match" to get the offsets.  Of course, this is the method used by human
editors and authors in traditional scholarship.  We say, "as reported by
Josephus in Antiquities XVI.187, '(preceding_context) relevant_text
(following_context)' (Loeb/Marcus edition) blah blah..."  For use of this
convention in encoding, however, the rules would need to be much more
rigorously defined and enforced (how to qualify second-occurrences of
substrings; how to designate omission of characters/words with ellipsis).
The advantage of this system is that it's instantly democratized because
it's based upon unambiguous rules and applicable to paper editions as
well as electronic editions.  A scholar in a remote part of the world
publishing a new text, or encoding an extant text could cross-reference
any other "parallel" text at any level of granularity without knowing how
others (unbekannt) were encoding those "parallel" texts.  From the
machine's point of view, the system is unambiguous and efficient because
it uses character offset in the file.  The downside would be that
character-offsets (in machine languages) are vulnerable when files are
changed in minor ways, and methods for updating offsets/indices or
redundancy checks would place another burden on the application.


More will be said below about the kinds of annotations which need to be
attached at character- morpheme- word- arbitrary_string- and other
(linguistic) levels for encoding textual variation.  Since parallel texts
are just declared to be "matched" (but not HOW), the situation is easier.
For the purpose of synchronizing textual regions for parallel display or
analysis, the primary concern is whether/how the SGML-based encoding can
be used, in connection with other factors, to help drive these
applications. Some of the "other factors" will now be discussed.  Most of
these issues were discussed at length in my earlier TEI-REP paper (TEI
TRW2, available in 2 parts on the UICVM listserver as TRW2PT1 SCRIPT and
TRW2PT2 SCRIPT), and need not be rehearsed in detail here.

*referencing discontinuous segments (morphemes, words) belonging to the
   same object (SGML id)

*referencing individual sub-elements, and sequencing of those sub-
   elements of such discontinuous segments

*referencing arbitrary strings, word-substrings (e.g., where word
   boundaries are in dispute, or have been misunderstood); we may think
   of arbitrary character-offset as one example, though for texts
   with syllabic, ideographic or mixed writing systems, other levels
   of arbitrary spans are possible

*multiple, overlapping, alternative hierarchies (without multiple DTD's
   and burdensome CONCUR overhead)

*synchronizing or normalizing incompatible external referencing
   schemes which APPEAR TO point to identical content, but do not;
   normalizing existing canonical referencing schemes (see especially
   "FACTOR 4" in TEI-TRW2)

*referencing textual elements within parallel streams of an interlinear
   text, Paritur-Umschrift/ ("score") edition, or similar document, where
   the text streams *could* be conceived as or generated from independent
   documents, but in which specific text-geography (presentation) of the
   printed edition is an essential aspect of the document's personality
   (see my earlier description of variations,TEI TRW2 "Factor 7,")

     Two general approaches for synchronization are obvious; in our
CDWord hypertext project, we have used the first method: (1) one may
place structure markers from text A as empty tags into all other
documents which need to be set in synchrony with text A (multiple binary
mappings), and, (2) one may place artificially-devised "canonical"
reference markers in all texts, with the help of ancillary tables which
help resolve anomalous situations relative to the canonical standard
text.  Synchronizations may then be made between the artificial scheme
and the individual referencing schemes used in traditional scholarship
(incongruent systems being resolved at this level).  Whether either of
these method is ideal, and how either would be best implemented with an
SGML-based language I will defer to others.  Perhaps there are far better
systems.  I have some skepticism about both methods, based on
difficulties we have encountered.  Against the first method: (a) the
amount of overhead in meta-data will become burdensome for texts which
are heavily cross-referenced, and (with CONCUR?) seems to imply multiple
or overfull DTD's; (b) it fails to exploit the usefulness of currently
existing canonical referencing schemes; (c) it may (?? -- not sure) not
support the need to see different sections of the same document in
parallel.  However, in favor of the first method (multiple binary
mappings) is that some synchronizations, especially those at a very low
level of granularity, appear to be required at binary level.  Imagine,
for example, trying to map a dozen versions or translations of the Bible
onto a common "canonical" scheme, based upon unique (SGML) id's for each
morpheme in each version, with an ultimate goal of supporting color-coded
"parallels" at morpheme level between any two versions or all versions.
Binary mappings at this level appear feasible (though labor intensive),
but multiple synchronizations via a single canonically scheme seems hard.
Maybe it would work, but I have some doubts (general linguists: please
speak up).  Multiple binary mappings sounds like a lot of work, but
currently the CATSS project at the University of Pennsylvania/Hebrew
University, and the Fils d'Abraham Project (Maredsous) are making binary
mappings between Bible versions, apparently with satisfactory results.
At a higher level of granularity (e.g. biblical "verse" level), using an
idealized general referencing system may be feasible; it works tolerably
well in our CDWord project for the purpose of visually aligning
(synchronous scrolling) biblical texts, translations, commentaries and
other "canonically-structured" documents in parallel displays.

     I do not know whether there may be more optimal SGML or non-SGML
solutions to the referencing problem which avoid cluttering the text with
unique id markers.  I feel confident that other members on the TEI-REP
subcommittee (those with professional training in formal language theory)
will have valuable judgments about these markup issues; I cannot promote
other than the general options I see on the surface.  In response to a
query, Steve DeRose briefly suggested using an open-ended canonical
scheme with integers or whole numbers; presumably Steve (section 6.11,
"Cross-References and Text-Links") and Elli Mylonas (section 6.7,
"Reference Systems") David Durand will offer better presentations of the

I do feel it would be wise to communicate with the TEI-ANA group about
their solution(s) to the problem of character- and morpheme-level
annotations.  The encoding of text-linguistic features with character
level and morpheme-level annotations is inevitable: how does TEI-ANA
propose to deal with id-markers?   Another concern parallel to that of
TEI-ANA is that in some textual-critical arenas (as with
linguistic/literary annotations), the volume of text-critical annotation
will become immense: where shall these annotations be located in the SGML
file?  It may become increasingly distasteful (to some??) to envision
text-critical encoding (kilobytes per "word" in some cases) inter
lineated with "the text."  It may be that "the text" should be kept free
both of id markers and text-critical annotations (placing this
information elsewhere in the document) so that our processors can
(directly) read "the text" in real-time.  Others might argue that the
encoding is not meant to be used by applications directly (but only as
data/document interchange), so never mind if the "words" of "the text"
are separated by a 100,000 characters of meta-data and annotation-data in
the flat file.  I have no opinion, except that I would like to purchase
affordable SGML software for my 386-class machine and not have it choke
on my data.

II.  TEXTUAL VARIANTS [revised material to be inserted at TEI TRR2 6.9]


    I suggested above that "textual variants" may be viewed as special
cases of "textual parallels," though more complicated parallels.  Such a
view is the more reasonable if we believe that talking about a textual
problem in a textual commentary (referring to the core data of lemma,
variants and witnesses) or from any  EXTERNAL locus is just as important
as viewing textual variation from within a given document (the that
document's "lemma" as over against readings of other witnesses cited only
by their sigla).  I think the former is the proper (or optimal) way to
conceptualize "textual variation" anyway, but incline even more strongly
to it for pragmatic reasons.  Textual parallels may be thought of as
immediate candidates for hypertext even if the taxonomy of links (link
types) is underdeveloped.  We simply declare that loci A and B and C
(where A, B, C may be points, spans, or discontinuous spans) are
equivalent in some way, and let the application handle the expression of
those links.  For encoding textual variation, the taxonomy of links must
(usually) be much richer: texts A, B, and C are not just formally
equivalent, but they are related in very specific ways.  The complex
network of inter-dependencies between parallel objects is one important
factor in making the encoding of textual variation more demanding.  I
would be interested in the judgments of others about the relevance of
hypertext (link types) to textual criticism; the paper of Steve DeRose
("Expanding the Notion of Links," Hypertext 1989 Conference) is
suggestive, but I have been unable to talk to him about this in detail.

    Based upon exchanges with other scholars interested in textual
variation, I sense that each one's formulation of a model for encoding
textual variation will strongly reflect the particular field of textual
studies (modern literature? epigraphic texts? ancient texts?), the
adequacy of printed critical editions in the field, and the particular
text-critical theories each one embraces.  These biases probably cannot
be escaped, nor are they necessarily bad.  I would suggest that for TEI
encoding "standards" (recommendations) to be accepted in humanities
areas, it will be necessary to create user manuals highly specific to
sub- specialty areas.  Scholarly needs and goals may be very different in
specific areas, and the domain-experts in each literary field should be
assisted in the refinement of prioritized goals in which encoding of
text-critically-relevant information will play an important role.  In
some fields, standard editions may contain evidence from manuscripts that
are badly and incompletely collated, so that encoding through
programmatic re-collation would be an optimal effort.  In other fields,
the most fruitful historical results may come from intense study of
scribal ductus and manuscript paleography which has hitherto been
inadequately investigated.  Other fields may be blessed with superb
critical editions in which the encoding of critical apparatuses alone may
yield a rich, complete text-critical database.  The "manuals" for
"encoding textual variants" should therefore reflect the variable
situations and priorities in our respective literary fields.  At the same
time, I feel it would be wise to develop as general a model as possible
for the encoding of information germane to the text-critical enterprise.
We may accept this on the strength of the obvious assumption that for
standards-purposes, general solutions are better than dedicated solutions
(e.g., solutions which are matched to current applications, or lack of

    We are fortunate to have on the TEI-REP subcommittee and immediate
orbit several authorities on critical editions and machine-collation:

*Wilhelm Ott (Director of the University of Tübingen in the Abteilung
    für literarische und dokumentarische Datenverarbeitung, Zentrums
    für Datenverarbeitung; TUSTEP programs for text collation and textual
*Susan Hockey and Lou Burnard (most recently with Peter Robinson in
    development of the COLLATE programs for Old Norse; cf.
    <cit>LLC</> 4/2, 99-105 and 4/3, 174-181)
*Manfred Thaller (whose representations for text-critical data I have
    not seen, unfortunately)
*Michael Sperberg-McQueen (superb control of the standard, combined with
    background in letters; I have learned far less than I should have in
    several email exchanges; SGML-izing of EDMACS)
*Dominik Wujastyk (EDMACS, = modification of John Lavagnino's TEXTED.TEX
    macros, for formatting critical apparatuses)
*Robert Kraft and Emanuel Tov (CATSS project at the University of
    Pennsylvania/Hebrew University, Jerusalem)
*R.-Ferdinand Poswick (Secretary of AIBI and director of the
    multi-lingual biblical databank project Fils d'Abraham at the Centre
    Informatique et Bible, Maredsous)
*Claus Huitfeldt (Norwegian Wittgenstein Project)

I look forward to the contributions of these individuals on encoding
text-critical data as TEI moves into its second cycle.


    My assignment in TEI TRR4 (6.9) is specified as providing assistance
on encoding of the "Critical Apparatus."  Textual apparatuses have been
used in printed books for several centuries, so the critical apparatus is
undeniably an issue of concern for TEI-REP.  In my earlier TEI-REP paper
(TEI TRW2, "Factor 2), I outlined several reasons why I (nevertheless)
felt that a focus upon the "Critical Apparatus" was not the best approach
to modeling the encoding of textual variation and textual transmission.

    If the goal of the TEI-REP encoding recommendations is to provide
mechanisms for the encoding of critical apparatuses <emp>exactly as they
appear in printed manuscripts and editions</>, then we are faced with a
broad task of surveying all kinds of critical apparatuses used in world
literature (something I have been unable to do).  But volumes I have
access to employ significantly different conventions in critical
apparatus (some of which, like the "score" or /Paritur-Umschrift/, could
possibly be handled as textual parallels; see TEI TRW2 "Factor 7," final
paragraph).  If the goal is to provide mechanisms for encoding text
critical information in critical apparatuses within a new standard sub
document type (which represents the "best" traditions of critical
apparatuses, e.g., in the Oxford critical editions), then the task is
easier.  If the goal is to provide general mechanisms for recording
knowledge/information germane to textual composition, textual evolution,
recension and transmission, then we must determine whether and how
optimally the encoding of the critical apparatus contributes to that
process, and how to encode analysis <emp>not</> contained within the
traditional critical apparatus.  In any case, the goal involves more than
the simply the "markup" of a single document: it involves the encoding of
complex relationships between elements of many documents.  The analysis
involved in the taxonomy of relationships appears to me tangent to the
forms of literary and linguistic analysis being developed in the TEI-ANA

    I briefly outline my recommendation here. I believe it is more
important to focus on encoding KNOWLEDGE about textual readings and
textual evolution (e.g., information which is traditionally contained in
separate volumes or sections of textual commentary), with a view to the
creation of a text-critical database.  Much of this knowledge is not,
traditionally, information coded in critical apparatuses: the fact that
it <emp>could be</> is less relevant than the fact that it <emp>is not</>
for discernible reasons.  My perspective is that data/knowledge about
textual variants and textual evolution is parallel to the matter of
(encoded) linguistic annotations, which contribute to lexicons, grammars,
corpus-linguistics databases and other linguistic tools.  Thus, many of
the critical editions I own have separate sections or associated
companion volumes containing textual commentary: additional tables and
lists of examples of orthographic practices, dialectical features,
scribal proclivities, tell-tale recensional data, etc.  Such data would
be far more valuable if included as part of the text-critical database
(enriched even more because <emp>exhaustive</> lists and tables, based
upon annotations, could then be generated).  In cases where text
critical, philological other literary-critical data are mixed in the same
commentary, it may be adequate or preferable to link (via our "textual
parallel" mechanism) the text or textual apparatus to the commentary.  On
balance, I still judge that much of the information in question is most
useful in a database, and that "critical apparatus" and "textual
commentary" should not simply be encoded as separate subdocuments.

    In light of several exchanges with Michael Sperberg-McQueen, I am
prepared to believe that my perspective arises from experience with
ancient oriental (especially, biblical) texts, and that it may be
unrepresentative of the scholarly interests and goals in a majority of
literary fields. For this reason, I relegate to the Appendix some
additional arguments and illustrative data on this point, but I do not
ask that everyone read them.  In attempting to represent the interests of
SBL/AOS/ASOR, I do feel it is necessary to document these concerns (which
may be more germane to TEI cycle two).


    Here follows a (preliminary) list of features which I recommend be
considered as standard (where applicable) for an enriched encoding scheme
-- encoding which would be destined for a database, from which improved
critical apparatuses may be printed, including expression-driven user
specified critical apparatuses.  For this purpose, I accept that we may
differentiate "lemma" and "alternative readings," though it is unclear
from a database point of view why a "lemma" deserves a privileged place
or that its features would be any different than those of alternative
readings.  The distinction is useful in that traditional critical
apparatuses often do represent "lemma" and "alternative readings" in
different ways.  Obviously, many of these features will pertain only to
textual arenas where multiple languages and long traditions are involved
(e.g., most sacred texts and other religious literature).  This first
list is a prosaic description of the most important desiderata, but it is
followed by a list of features in more formal terms.

(1) exact identification of witness(es) offered as variands
    -- Variant readings may usefully be grouped together "in support" of
a certain textual alternative reading, and especially, with implied or
explicit top-level normalization for readings in various languages.
However, this grouping should not be at the expense of other important
underlying information, including the listing of <emp>actual</> witnesses
or at the expense of obscuring other details. They should not be grouped
together unless every other relevant attribute (other than witness-id and
textual_locus) is identical <emp>or</> unless access to the differences
in detail is provided for (language, identical orthography, complete (not
restored or partial) readings, etc.).  If groupings of dissimilar
witnesses are needed for convenience (as in the Goettingen LXX, for
example), then a mechanism for viewing the underlying individual readings
should be supported.

(2) declaration of the NAME of the canonical referencing scheme
    -- The DTD of the (app crit's) containing document will presumably
include some identification of the traditional referencing scheme used
(most relevant when similar/identical referencing schemes actually point
to different content).  In order to provide proper synchronization, the
names of the referencing schemes of the alternative witnesses (which may
not be available in electronic format) should be provided; such
information will also be useful for machines when the texts of the
alternative readings are machine-readable.

(3) exact identification of textual locus for each cited witness
    -- Just as the full identification of witnesses and "irrelevant"
details about their readings are often suppresses, the exact locations
are often not given.  The notation in a siglum (viz, the <emp>presence</>
of the siglum) usually implies: "go look THERE, you'll find it at the
appropriate point").  I consider this unsatisfactory, because machines
will not be able to retrieve the context of the alternative reading
merely from a siglum which implies "go look for it."  Referencing
system(s) for textual_locus to be determined by TEI-REP, and included as
properties of the alternative readings.

(4) the language in which the alternative readings occur

(5) the script, and/or orthographic level and/or transcriptional system
    --  Applications will have difficulty comparing alternative readings
which occur in different languages, scripts, orthographic strata or
transcriptional schemes.  While the (app crit's) containing document will
presumably contain these declarations in the DTD, relevant information
about the alternative readings must be supplied in the encoding of the
variants.  This requirement also applies equally to scholarly emendations
offered as part of textual reconstructions, restorations or as is
primitive readings.

(6) the exact reading of a witness, not just a notation THAT it exists
    -- When alternative readings all occur in a single language (the same
language as the lemma) this requirement may be obvious or unnecessary.
The feature is more important (non-optional) in cases where
alternate_readings occur in various languages, and the
language/script/orthography of the alternative_reading is different than
that of the lemma

(7) encoding of linguistic/literary attributes of characters, morphemes,
words, phrases, clauses (etc.) in the lemma and in the alternative
    -- Readings frequently vary in predictable (but text-critically
important) ways: grammatical concord, scribal hyper-correction at other
linguistic levels.  The linguistic annotations on the readings (lemma and
variands) will bear on their interpretation and frequently be useful in
classification of the reading for machine-collation & quantitative

(8) encoding of paleographic and similar information
    -- Such details will sometimes not be known, but encoding should be
supported for publication of new texts and text fragments, and in cases
where manuscript collation is used to verify readings.  These notations
would register information about restored readings, character-level
annotations about degree of legibility, alternative paleographic
interpretations, erasures, corrections, marginal supralinear, sublinear
(interlinear) readings, etc.

(9) encoding of known or assumed inter-dependencies between the
alternative reading and the lemma, or between various alternative
    -- Such annotations are more appropriate when the readings can
obviously be recognized as derivative from typologically-antecedent
readings, or when some readings are (translations) in derivative
languages, and so forth.  More subtle evidence of genetic/stemmatic
relationships may be shown by collation programs, of course; I envision
here some standard kinds of dependencies (variously applicable to
different textual arenas).

(10) a top-level normalized rendering (retroversion, transcription
conversion, orthographic-stratum conversion) of the "readings"
    -- Some kind of normalization for lemma and alternative reading is
usually implied in a standard critical apparatus, but it may be necessary
or desirable to make this representation explicit to allows machine
collation of all the readings together, or to allow linguists to make use
of the text-critical database.  (Perhaps this could always be inferred
from elements and the associated attribute lists, but I'm not sure.)

(11) translation of the lemma and alternate readings into modern
    -- This may be viewed as a concession, and even an inappropriate
concession, to non-specialists.  I feel it can be justified, and should
be encouraged, to assist non-specialists, including inter-disciplinary
scholarship, in surveying the specialists' field.  Similarly, it seems
unnecessary to use Latin as a single "standard" language, when other
international languages would make the information more accessible to
persons who have legitimate interest in the data (personal opinion).

(12) evaluation of the typological placement of each alternate reading
and the lemma) within its own "inner dimension" (language group,
geographical region, time period, etc.) and within the larger scope of
textual evolution
    -- This information would be germane in textual arenas which have
long traditions, or multiple phases of textual evolution

(13) evaluation/explanation of the alternate readings (and lemma)
in terms of standard text-critical comments
    -- Examples (will vary across fields): the reading represents
expansion, euphemistic rendering, paleographic confusion, dittography,
haplography, aural error, rendering based upon homographic root
(polysemy), (hyper-) etymological translation, midrashic paraphrase,
misplaced marginal/supralinear gloss, modernizing, secondary grammatical
adjustment, other conflation, etc.   I confess that some standard sigla
make sense only on the convention that each witness (as a published,
encoded document) has its own "lemma" as over against the rest of the
universe's "alternative readings"  (e.g., "textual plus," "textual
minus," "different word order"), and these would have to be converted or
interpreted when moving the annotations to a database.