(part 1 of 2)
February 21, 1990 3909 Swiss Avenue Robin C. Cover Dallas, TX 75204 Draft, TEI-REP Working Paper (214) 296-1783/841-3657 For TEI-TRR4, sections 6.9 - 6.10 BITNET: zrcc1001@smuvm1 INTERNET: [hidden email] *Supersedes draft "February 18, 1990" UUCP: attctc!utafll!robin distributed on TEI-REP UUCP: attctc!texbell!txsil.robin ENCODING FOR TEXTUAL PARALLELS AND CRITICAL APPARATUS INTRODUCTION In this paper I discuss some proposals and partial solutions to the problems of encoding textual parallels, textual variation and related features. Encoding for textual parallels and textual variation will employ SGML-based (and perhaps non-SGML) constructs developed for other encoding problems, both within the domain of TEI-REP and within the TEI ANA group. The two inter-related topics specifically assigned to me under the current TEI-REP work agenda ("textual parallels" and "textual variants") are dependent upon and closely related to the decisions reached on "Reference Systems" (6.7; Mylonas), "Cross-References and Text-Links" (TEI TRR4 6.11; DeRose). Likewise, Johansson's work on "Normalization Practices and Documenting Them" (TEI TRR4 7.2.1) bears similarity to the text-critical issue of machine collation and machine analysis of texts in different languages, scripts, transcription systems, orthographic-strata, etc. Accordingly, I plan to revise the following sections for TEI TRR2 6.9-6.10 in light of other decisions and recommendations reached in the Oxford TEI-REP meetings. I. TEXTUAL PARALLELS [revised material to be inserted at TEI TRR2 6.10] A. DEFINITIONS I use the term "(textual) parallels" to refer broadly to any two documents, or regions of text within the same document, or sets of documents which literary/linguistic analysts would like to compare in some way because they share significant degree of structure and/or content, or perhaps pedigree. The comparison may be quantitative (e.g., searching for index of term-frequencies or co-occurrences) or simply visual (e.g.,parallel displays which support synchronous scrolling of two or more documents). Provisionally, I think of "textual variants" as a special case of textual parallelism (though requiring much more annotation) in that they are parallel textual objects competing for antiquity, typological priority, authenticity and so forth. The issue of document versioning is tangent to textual criticism in that both involve attempts to reconstruct or record exact evolutionary stages and processes in the composition and transmission of texts. In the more general case, "textually" variant readings are simply parallel texts which contribute information to our understanding of the evolutionary history of an idealized (fictional) literary entity we think of as a literary "text." Here are some examples of (sub-) documents I would call "parallel" texts: *a literary allusion and its assumed/purported origin in the classic; a quotation and its contextual source *a literary text and its translation (into another language, or into multiple languages) *the same text viewed simultaneously in different scripts, within different orthographic strata or under different transcriptional systems [here document views is envisioned as being generated from separate documents or parts of a database rather than through document filtering] *texts which stem from a common source (eg., the three New Testament synoptics, sharing a common origin in the hypothetical document Q) *multiple (serial or synchronous) recensions of the same text, which nevertheless maintain a common identity as a certain literary document (e.g., the eight versions Sargon's annalistic reports; multiple recensions of the flood epic in several Mesopotamian languages at several periods; the long and one-seventh-shorter versions of biblical Jeremiah) *instances of oral formulaic poetry (line, stanza) which appear in different epic or liturgical compositions *a literary text and running commentary on that text, including dedicated region(s) of textual commentary on the text or textual apparatus *texts which bear some unclear textual relationship, but share significant amount of content and/or structure (e.g., sections of biblical Samuel/Kings/Isa/Chron) *sections of texts "reused" in other contexts (long sections of certain Psalms in prose texts narrating a liturgy) *sections of texts excerpted wholesale and thereafter acquiring independent transmission history(e.g., biblical Odes), *paraphrases or synopses (e.g., Josephus' <cit>Antiquities of the Jews</>, which sometimes tracks very closely with the biblical (Greek) Septuagint text *a lineated/versified text and its associated (sets of) cross-references to other texts or locations within the same text *a "clean" text and alternate views of that text based upon its literary or linguistic annotations; B. LEVELS OF GRANULARITY In order to provide encoding for such parallel texts or regions of texts, the encoder must begin by: (a) determining the level of granularity at which the text(s) are to be tagged, and (b) designating names for those textual objects which will be matched as the parallels (c) devising referencing a scheme which allows one to unambiguously point to regions of "parallel" text delimited by the tagging (so that reference may be made to these units from external sources). The referencing scheme is critical because the parallels may not involve matched textual objects (e.g., "verses" in one edition with "verses" in another edition). The parallels may involve mappings between points and spans, or between spans of text very unlike in some characteristics. The levels of granularity chosen of course depend upon the applications. I believe that for representing "parallel" structures (and "textual") variants, character-level encoding and everything higher must be supported. Certainly this will be required by linguists in the TEI ANA group, but we may imagine multiple uses for character-level encoding for "textual variation" and "parallels" as well. In text criticism, paleographic annotations at character level are often given by various typographic conventions. For example, in Qumran publications there are 4-5 special sigla attached to characters (usually supralinear attachment) for indicating whether the characters are of such-and-such varying degrees of legibility, or suspended characters in a word, etc. In other cases, uncertain characters are interpreted in terms of a closed set of alternatives: "b/m" (written in Hebrew script) would mean "the character beth or mem -- we know it is one or the other, because and both make sense and are paleographically possible , but we don't know which letter this is for sure." Or suppose we wish to study parallel displays in which adjacent orthographic strata are involved, and we want to see "level n+1" orthographic innovations in a color-coded scheme (e.g, in historic periods when quiescent consonants or semi-vowels are being introduced as vocalization markers). Or a literary critic may wish to selectively tag consonance or assonance using character-level or syllable-level annotation. At a higher (sub-word) unit, morpheme-level annotations are also both necessary for text criticism and parallel analysis. Consider the case where a German student may be assisted in Greek language instruction by viewing parallel texts, say an Greek tragedy and a German translation, or an English student learning Hebrew uses synchronized parallel displays for Hebrew and English. In either instance, the student might tab through a sentence in either window and see equivalent (parallel) terms displayed in reverse video in both windows. But morpheme-level would be required: the Hebrew "and/from/his/hellish abode" is written as one "word" (yet four or more "morphemes," as counted by Hebrew linguists) but as four words in English, so the color display of the single Hebrew word will show four color-coded regions matching the similarly color-coded four separate words in English. And German separable prepositions will map at morpheme level to single Greek terms. For text-critical work, variants often involve just differences in person, number, gender, mood, state (whatever) which are resolved in inflectional morphemes; since these kinds of variations need to be annotated for cataloging, it's clear that morpheme-level annotations (morpheme-level tagging) are necessary. Syllable-level marking and morpheme-level marking are necessary for making text-critical comparisons of cuneiform texts, for example, because of mixed orthographies in various scribal traditions, geographic regions, etc. (e.g., the neo-Assyrian scribes use far more logographic (ideographic) spellings with phonetic complements for certain words than Babylonian scribes copying the same texts with syllabic writing; for these languages, text critical analysis will require not only orthographic normalization, but in some cases syllable- and morpheme level tags. It is easy to recognize, or course, that linguistic analysis requires morpheme-level annotations, and probably multiple alternate morphemic representations based upon varying linguistic theories. I have started with two examples of sub-word granularity for parallelism because it is not entirely clear that an SGML encoding should be used for this level of tagging/markup/linking. I have no personal opinion or preference on this matter. On the authority of two SGML experts, I am cautioned about two potential concerns: (a) the amount of overhead involved in SGML tagging at character- and morpheme-levels is extraordinary -- we can easily get more meta-data than data, which might be a problem; (b) character-level tagging using SGML has (apparently) not yet been adequately tested. Here follows an example of an encoding of one single Hebrew word. I know it's monstrous: it evoked a gasp of "oy,veh" from a more qualified member of TEI-REP (I suppose not just because it's so ugly, and redundant -- fails to use legal minimizations, but also because there is no DATA here yet except marking of morphemes with reference id's; I have delimited only one of the 13 characters; I could have also added "syllableid," which would be useful in text criticism for cuneiform texts. ========= I wrote: ======== "Suppose I mark up a Hebrew text as follows, where (in my particular DTD, "verseid," "wordid" and "morphid" are required, but "charid" is optional (used mostly for citing Qumran readings, or in manuscript publications themselves). In this sample, I put a charid tag on the first char by way of example because I know there's a textual variant I'll want to talk about. But I use no minimizations yet. Suppose I do something like in setting up markers for the first word... (the "text", if you're looking for it, is the first word of II Samuel 22:43 (assuming Michigan-Claremont uses IRV chars which arrive bei dir): W:)E$:XFAQ"73M where "73" is the accent number from the standard Michigan-Claremont encoding scheme, and the word means "(thus) I beat them flat") <verse id=S2.22.43><word id=S2.22.43.1><morph id=2S.22.43.1.1> <char id=S2.22.43.1.1.1>w</charid>:</morph><morph id=S2.22.43.1.2> )E</morph><morph id=S2.22.43.1.3>$:XFQ</morph><morph id=S2.22.43.1.4> "73M</morph></word>.... (This example would be inbearably monstrous if charid's were assigned to each of the thirteen characters in the word.) ================ One obvious criticism of this approach is that much of the encoding can be generated (e.g., charid's) by a program; so why introduce this stuff into the text? Why not just let the application handle it, and use some non-SGML system for referencing? On the hypothesis that such encoding is at least legal, and help would satisfy our need for character- and morpheme-level annotations, it remains unclear to me whether this encoding would actually be useful in solving the problem of "parallel" and "textually variant" texts. If character- and morpheme level units tagged with an SGML-based tagging system cannot be addressed (referenced) with SGML mechanisms (IDREF, IDREFS) for purposes of synchronization, then perhaps they are of lesser value; perhaps a non SGML (applications-level) solution should be proposed. The issue of SGML based referencing will be discussed more below. Besides the fact that this markup is ugly, redundant, and can in part be programmatically-generated, there are other criticisms: (a) current processors will gag and choke; (b) the file now defies the canon of "human-readable" SGML files. Possible answers to these objections: (a) "don't let processors (except batch processors working as filters) look at the monstrosity," or "processors are getting faster all the time" and; (b) "so what? (give up the myth of 'human-readable' SGML)." A non-SGML solution which would be well-suited (?) to humans and computers alike would be to use a standard referencing system down to verse or line level, then some kind of regular-expression style "pattern match" to get the offsets. Of course, this is the method used by human editors and authors in traditional scholarship. We say, "as reported by Josephus in Antiquities XVI.187, '(preceding_context) relevant_text (following_context)' (Loeb/Marcus edition) blah blah..." For use of this convention in encoding, however, the rules would need to be much more rigorously defined and enforced (how to qualify second-occurrences of substrings; how to designate omission of characters/words with ellipsis). The advantage of this system is that it's instantly democratized because it's based upon unambiguous rules and applicable to paper editions as well as electronic editions. A scholar in a remote part of the world publishing a new text, or encoding an extant text could cross-reference any other "parallel" text at any level of granularity without knowing how others (unbekannt) were encoding those "parallel" texts. From the machine's point of view, the system is unambiguous and efficient because it uses character offset in the file. The downside would be that character-offsets (in machine languages) are vulnerable when files are changed in minor ways, and methods for updating offsets/indices or redundancy checks would place another burden on the application. C. SGML-BASED REFERENCING SYSTEM More will be said below about the kinds of annotations which need to be attached at character- morpheme- word- arbitrary_string- and other (linguistic) levels for encoding textual variation. Since parallel texts are just declared to be "matched" (but not HOW), the situation is easier. For the purpose of synchronizing textual regions for parallel display or analysis, the primary concern is whether/how the SGML-based encoding can be used, in connection with other factors, to help drive these applications. Some of the "other factors" will now be discussed. Most of these issues were discussed at length in my earlier TEI-REP paper (TEI TRW2, available in 2 parts on the UICVM listserver as TRW2PT1 SCRIPT and TRW2PT2 SCRIPT), and need not be rehearsed in detail here. *referencing discontinuous segments (morphemes, words) belonging to the same object (SGML id) *referencing individual sub-elements, and sequencing of those sub- elements of such discontinuous segments *referencing arbitrary strings, word-substrings (e.g., where word boundaries are in dispute, or have been misunderstood); we may think of arbitrary character-offset as one example, though for texts with syllabic, ideographic or mixed writing systems, other levels of arbitrary spans are possible *multiple, overlapping, alternative hierarchies (without multiple DTD's and burdensome CONCUR overhead) *synchronizing or normalizing incompatible external referencing schemes which APPEAR TO point to identical content, but do not; normalizing existing canonical referencing schemes (see especially "FACTOR 4" in TEI-TRW2) *referencing textual elements within parallel streams of an interlinear text, Paritur-Umschrift/ ("score") edition, or similar document, where the text streams *could* be conceived as or generated from independent documents, but in which specific text-geography (presentation) of the printed edition is an essential aspect of the document's personality (see my earlier description of variations,TEI TRW2 "Factor 7,") Two general approaches for synchronization are obvious; in our CDWord hypertext project, we have used the first method: (1) one may place structure markers from text A as empty tags into all other documents which need to be set in synchrony with text A (multiple binary mappings), and, (2) one may place artificially-devised "canonical" reference markers in all texts, with the help of ancillary tables which help resolve anomalous situations relative to the canonical standard text. Synchronizations may then be made between the artificial scheme and the individual referencing schemes used in traditional scholarship (incongruent systems being resolved at this level). Whether either of these method is ideal, and how either would be best implemented with an SGML-based language I will defer to others. Perhaps there are far better systems. I have some skepticism about both methods, based on difficulties we have encountered. Against the first method: (a) the amount of overhead in meta-data will become burdensome for texts which are heavily cross-referenced, and (with CONCUR?) seems to imply multiple or overfull DTD's; (b) it fails to exploit the usefulness of currently existing canonical referencing schemes; (c) it may (?? -- not sure) not support the need to see different sections of the same document in parallel. However, in favor of the first method (multiple binary mappings) is that some synchronizations, especially those at a very low level of granularity, appear to be required at binary level. Imagine, for example, trying to map a dozen versions or translations of the Bible onto a common "canonical" scheme, based upon unique (SGML) id's for each morpheme in each version, with an ultimate goal of supporting color-coded "parallels" at morpheme level between any two versions or all versions. Binary mappings at this level appear feasible (though labor intensive), but multiple synchronizations via a single canonically scheme seems hard. Maybe it would work, but I have some doubts (general linguists: please speak up). Multiple binary mappings sounds like a lot of work, but currently the CATSS project at the University of Pennsylvania/Hebrew University, and the Fils d'Abraham Project (Maredsous) are making binary mappings between Bible versions, apparently with satisfactory results. At a higher level of granularity (e.g. biblical "verse" level), using an idealized general referencing system may be feasible; it works tolerably well in our CDWord project for the purpose of visually aligning (synchronous scrolling) biblical texts, translations, commentaries and other "canonically-structured" documents in parallel displays. I do not know whether there may be more optimal SGML or non-SGML solutions to the referencing problem which avoid cluttering the text with unique id markers. I feel confident that other members on the TEI-REP subcommittee (those with professional training in formal language theory) will have valuable judgments about these markup issues; I cannot promote other than the general options I see on the surface. In response to a query, Steve DeRose briefly suggested using an open-ended canonical scheme with integers or whole numbers; presumably Steve (section 6.11, "Cross-References and Text-Links") and Elli Mylonas (section 6.7, "Reference Systems") David Durand will offer better presentations of the options. I do feel it would be wise to communicate with the TEI-ANA group about their solution(s) to the problem of character- and morpheme-level annotations. The encoding of text-linguistic features with character level and morpheme-level annotations is inevitable: how does TEI-ANA propose to deal with id-markers? Another concern parallel to that of TEI-ANA is that in some textual-critical arenas (as with linguistic/literary annotations), the volume of text-critical annotation will become immense: where shall these annotations be located in the SGML file? It may become increasingly distasteful (to some??) to envision text-critical encoding (kilobytes per "word" in some cases) inter lineated with "the text." It may be that "the text" should be kept free both of id markers and text-critical annotations (placing this information elsewhere in the document) so that our processors can (directly) read "the text" in real-time. Others might argue that the encoding is not meant to be used by applications directly (but only as data/document interchange), so never mind if the "words" of "the text" are separated by a 100,000 characters of meta-data and annotation-data in the flat file. I have no opinion, except that I would like to purchase affordable SGML software for my 386-class machine and not have it choke on my data. II. TEXTUAL VARIANTS [revised material to be inserted at TEI TRR2 6.9] A. PRELIMINARIES I suggested above that "textual variants" may be viewed as special cases of "textual parallels," though more complicated parallels. Such a view is the more reasonable if we believe that talking about a textual problem in a textual commentary (referring to the core data of lemma, variants and witnesses) or from any EXTERNAL locus is just as important as viewing textual variation from within a given document (the that document's "lemma" as over against readings of other witnesses cited only by their sigla). I think the former is the proper (or optimal) way to conceptualize "textual variation" anyway, but incline even more strongly to it for pragmatic reasons. Textual parallels may be thought of as immediate candidates for hypertext even if the taxonomy of links (link types) is underdeveloped. We simply declare that loci A and B and C (where A, B, C may be points, spans, or discontinuous spans) are equivalent in some way, and let the application handle the expression of those links. For encoding textual variation, the taxonomy of links must (usually) be much richer: texts A, B, and C are not just formally equivalent, but they are related in very specific ways. The complex network of inter-dependencies between parallel objects is one important factor in making the encoding of textual variation more demanding. I would be interested in the judgments of others about the relevance of hypertext (link types) to textual criticism; the paper of Steve DeRose ("Expanding the Notion of Links," Hypertext 1989 Conference) is suggestive, but I have been unable to talk to him about this in detail. Based upon exchanges with other scholars interested in textual variation, I sense that each one's formulation of a model for encoding textual variation will strongly reflect the particular field of textual studies (modern literature? epigraphic texts? ancient texts?), the adequacy of printed critical editions in the field, and the particular text-critical theories each one embraces. These biases probably cannot be escaped, nor are they necessarily bad. I would suggest that for TEI encoding "standards" (recommendations) to be accepted in humanities areas, it will be necessary to create user manuals highly specific to sub- specialty areas. Scholarly needs and goals may be very different in specific areas, and the domain-experts in each literary field should be assisted in the refinement of prioritized goals in which encoding of text-critically-relevant information will play an important role. In some fields, standard editions may contain evidence from manuscripts that are badly and incompletely collated, so that encoding through programmatic re-collation would be an optimal effort. In other fields, the most fruitful historical results may come from intense study of scribal ductus and manuscript paleography which has hitherto been inadequately investigated. Other fields may be blessed with superb critical editions in which the encoding of critical apparatuses alone may yield a rich, complete text-critical database. The "manuals" for "encoding textual variants" should therefore reflect the variable situations and priorities in our respective literary fields. At the same time, I feel it would be wise to develop as general a model as possible for the encoding of information germane to the text-critical enterprise. We may accept this on the strength of the obvious assumption that for standards-purposes, general solutions are better than dedicated solutions (e.g., solutions which are matched to current applications, or lack of applications). We are fortunate to have on the TEI-REP subcommittee and immediate orbit several authorities on critical editions and machine-collation: *Wilhelm Ott (Director of the University of Tübingen in the Abteilung für literarische und dokumentarische Datenverarbeitung, Zentrums für Datenverarbeitung; TUSTEP programs for text collation and textual editing) *Susan Hockey and Lou Burnard (most recently with Peter Robinson in development of the COLLATE programs for Old Norse; cf. <cit>LLC</> 4/2, 99-105 and 4/3, 174-181) *Manfred Thaller (whose representations for text-critical data I have not seen, unfortunately) *Michael Sperberg-McQueen (superb control of the standard, combined with background in letters; I have learned far less than I should have in several email exchanges; SGML-izing of EDMACS) *Dominik Wujastyk (EDMACS, = modification of John Lavagnino's TEXTED.TEX macros, for formatting critical apparatuses) *Robert Kraft and Emanuel Tov (CATSS project at the University of Pennsylvania/Hebrew University, Jerusalem) *R.-Ferdinand Poswick (Secretary of AIBI and director of the multi-lingual biblical databank project Fils d'Abraham at the Centre Informatique et Bible, Maredsous) *Claus Huitfeldt (Norwegian Wittgenstein Project) I look forward to the contributions of these individuals on encoding text-critical data as TEI moves into its second cycle. B. THE GOALS OF ENCODING TEXT-CRITICAL DATA My assignment in TEI TRR4 (6.9) is specified as providing assistance on encoding of the "Critical Apparatus." Textual apparatuses have been used in printed books for several centuries, so the critical apparatus is undeniably an issue of concern for TEI-REP. In my earlier TEI-REP paper (TEI TRW2, "Factor 2), I outlined several reasons why I (nevertheless) felt that a focus upon the "Critical Apparatus" was not the best approach to modeling the encoding of textual variation and textual transmission. If the goal of the TEI-REP encoding recommendations is to provide mechanisms for the encoding of critical apparatuses <emp>exactly as they appear in printed manuscripts and editions</>, then we are faced with a broad task of surveying all kinds of critical apparatuses used in world literature (something I have been unable to do). But volumes I have access to employ significantly different conventions in critical apparatus (some of which, like the "score" or /Paritur-Umschrift/, could possibly be handled as textual parallels; see TEI TRW2 "Factor 7," final paragraph). If the goal is to provide mechanisms for encoding text critical information in critical apparatuses within a new standard sub document type (which represents the "best" traditions of critical apparatuses, e.g., in the Oxford critical editions), then the task is easier. If the goal is to provide general mechanisms for recording knowledge/information germane to textual composition, textual evolution, recension and transmission, then we must determine whether and how optimally the encoding of the critical apparatus contributes to that process, and how to encode analysis <emp>not</> contained within the traditional critical apparatus. In any case, the goal involves more than the simply the "markup" of a single document: it involves the encoding of complex relationships between elements of many documents. The analysis involved in the taxonomy of relationships appears to me tangent to the forms of literary and linguistic analysis being developed in the TEI-ANA group. I briefly outline my recommendation here. I believe it is more important to focus on encoding KNOWLEDGE about textual readings and textual evolution (e.g., information which is traditionally contained in separate volumes or sections of textual commentary), with a view to the creation of a text-critical database. Much of this knowledge is not, traditionally, information coded in critical apparatuses: the fact that it <emp>could be</> is less relevant than the fact that it <emp>is not</> for discernible reasons. My perspective is that data/knowledge about textual variants and textual evolution is parallel to the matter of (encoded) linguistic annotations, which contribute to lexicons, grammars, corpus-linguistics databases and other linguistic tools. Thus, many of the critical editions I own have separate sections or associated companion volumes containing textual commentary: additional tables and lists of examples of orthographic practices, dialectical features, scribal proclivities, tell-tale recensional data, etc. Such data would be far more valuable if included as part of the text-critical database (enriched even more because <emp>exhaustive</> lists and tables, based upon annotations, could then be generated). In cases where text critical, philological other literary-critical data are mixed in the same commentary, it may be adequate or preferable to link (via our "textual parallel" mechanism) the text or textual apparatus to the commentary. On balance, I still judge that much of the information in question is most useful in a database, and that "critical apparatus" and "textual commentary" should not simply be encoded as separate subdocuments. In light of several exchanges with Michael Sperberg-McQueen, I am prepared to believe that my perspective arises from experience with ancient oriental (especially, biblical) texts, and that it may be unrepresentative of the scholarly interests and goals in a majority of literary fields. For this reason, I relegate to the Appendix some additional arguments and illustrative data on this point, but I do not ask that everyone read them. In attempting to represent the interests of SBL/AOS/ASOR, I do feel it is necessary to document these concerns (which may be more germane to TEI cycle two). C. ENRICHED ENCODING SCHEME Here follows a (preliminary) list of features which I recommend be considered as standard (where applicable) for an enriched encoding scheme -- encoding which would be destined for a database, from which improved critical apparatuses may be printed, including expression-driven user specified critical apparatuses. For this purpose, I accept that we may differentiate "lemma" and "alternative readings," though it is unclear from a database point of view why a "lemma" deserves a privileged place or that its features would be any different than those of alternative readings. The distinction is useful in that traditional critical apparatuses often do represent "lemma" and "alternative readings" in different ways. Obviously, many of these features will pertain only to textual arenas where multiple languages and long traditions are involved (e.g., most sacred texts and other religious literature). This first list is a prosaic description of the most important desiderata, but it is followed by a list of features in more formal terms. (1) exact identification of witness(es) offered as variands -- Variant readings may usefully be grouped together "in support" of a certain textual alternative reading, and especially, with implied or explicit top-level normalization for readings in various languages. However, this grouping should not be at the expense of other important underlying information, including the listing of <emp>actual</> witnesses or at the expense of obscuring other details. They should not be grouped together unless every other relevant attribute (other than witness-id and textual_locus) is identical <emp>or</> unless access to the differences in detail is provided for (language, identical orthography, complete (not restored or partial) readings, etc.). If groupings of dissimilar witnesses are needed for convenience (as in the Goettingen LXX, for example), then a mechanism for viewing the underlying individual readings should be supported. (2) declaration of the NAME of the canonical referencing scheme -- The DTD of the (app crit's) containing document will presumably include some identification of the traditional referencing scheme used (most relevant when similar/identical referencing schemes actually point to different content). In order to provide proper synchronization, the names of the referencing schemes of the alternative witnesses (which may not be available in electronic format) should be provided; such information will also be useful for machines when the texts of the alternative readings are machine-readable. (3) exact identification of textual locus for each cited witness -- Just as the full identification of witnesses and "irrelevant" details about their readings are often suppresses, the exact locations are often not given. The notation in a siglum (viz, the <emp>presence</> of the siglum) usually implies: "go look THERE, you'll find it at the appropriate point"). I consider this unsatisfactory, because machines will not be able to retrieve the context of the alternative reading merely from a siglum which implies "go look for it." Referencing system(s) for textual_locus to be determined by TEI-REP, and included as properties of the alternative readings. (4) the language in which the alternative readings occur (5) the script, and/or orthographic level and/or transcriptional system -- Applications will have difficulty comparing alternative readings which occur in different languages, scripts, orthographic strata or transcriptional schemes. While the (app crit's) containing document will presumably contain these declarations in the DTD, relevant information about the alternative readings must be supplied in the encoding of the variants. This requirement also applies equally to scholarly emendations offered as part of textual reconstructions, restorations or as is primitive readings. (6) the exact reading of a witness, not just a notation THAT it exists -- When alternative readings all occur in a single language (the same language as the lemma) this requirement may be obvious or unnecessary. The feature is more important (non-optional) in cases where alternate_readings occur in various languages, and the language/script/orthography of the alternative_reading is different than that of the lemma (7) encoding of linguistic/literary attributes of characters, morphemes, words, phrases, clauses (etc.) in the lemma and in the alternative readings -- Readings frequently vary in predictable (but text-critically important) ways: grammatical concord, scribal hyper-correction at other linguistic levels. The linguistic annotations on the readings (lemma and variands) will bear on their interpretation and frequently be useful in classification of the reading for machine-collation & quantitative analysis. (8) encoding of paleographic and similar information -- Such details will sometimes not be known, but encoding should be supported for publication of new texts and text fragments, and in cases where manuscript collation is used to verify readings. These notations would register information about restored readings, character-level annotations about degree of legibility, alternative paleographic interpretations, erasures, corrections, marginal supralinear, sublinear (interlinear) readings, etc. (9) encoding of known or assumed inter-dependencies between the alternative reading and the lemma, or between various alternative readings -- Such annotations are more appropriate when the readings can obviously be recognized as derivative from typologically-antecedent readings, or when some readings are (translations) in derivative languages, and so forth. More subtle evidence of genetic/stemmatic relationships may be shown by collation programs, of course; I envision here some standard kinds of dependencies (variously applicable to different textual arenas). (10) a top-level normalized rendering (retroversion, transcription conversion, orthographic-stratum conversion) of the "readings" -- Some kind of normalization for lemma and alternative reading is usually implied in a standard critical apparatus, but it may be necessary or desirable to make this representation explicit to allows machine collation of all the readings together, or to allow linguists to make use of the text-critical database. (Perhaps this could always be inferred from elements and the associated attribute lists, but I'm not sure.) (11) translation of the lemma and alternate readings into modern language(s) -- This may be viewed as a concession, and even an inappropriate concession, to non-specialists. I feel it can be justified, and should be encouraged, to assist non-specialists, including inter-disciplinary scholarship, in surveying the specialists' field. Similarly, it seems unnecessary to use Latin as a single "standard" language, when other international languages would make the information more accessible to persons who have legitimate interest in the data (personal opinion). (12) evaluation of the typological placement of each alternate reading and the lemma) within its own "inner dimension" (language group, geographical region, time period, etc.) and within the larger scope of textual evolution -- This information would be germane in textual arenas which have long traditions, or multiple phases of textual evolution (13) evaluation/explanation of the alternate readings (and lemma) in terms of standard text-critical comments -- Examples (will vary across fields): the reading represents expansion, euphemistic rendering, paleographic confusion, dittography, haplography, aural error, rendering based upon homographic root (polysemy), (hyper-) etymological translation, midrashic paraphrase, misplaced marginal/supralinear gloss, modernizing, secondary grammatical adjustment, other conflation, etc. I confess that some standard sigla make sense only on the convention that each witness (as a published, encoded document) has its own "lemma" as over against the rest of the universe's "alternative readings" (e.g., "textual plus," "textual minus," "different word order"), and these would have to be converted or interpreted when moving the annotations to a database. |
Free forum by Nabble | Edit this page |