Formal Methods and Text Processing

Formal Methods and Text Processing

In this section, we focus on the relevance of formal methods to the processing of textual information, and in particular on emergent theories concerned with the nature of text encoding and its central relevance to the humanities in general and textual scholarship in particular.

Formal methods involve the definition of data structures and of algorithms which are capable of representing both the materials of textual scholarship and the processes typically carried out on them. Since the publication of a seminal article by Coombs, Renear et al in 1987 (`Markup Systems and the Future of Scholarly Text Processing', Communications of the ACM 30.11: 933-947.), and more particularly since the publication of the Text Encoding Initiative's influential Guidelines for Electronic Text Encoding and Interchange in 1994, a clear consensus appears to have emerged in favour of the use of descriptive markup languages as a means of representing textual data within the scholarly community, and hence a corresponding focus on the effectiveness of such languages as formal representations for the data structures and algorithms relevant to those disciplines.

The fact that such descriptive languages and their associated theoretical assumptions have also come to dominate the world of commercial data processing, and in particular the Internet, does not of course imply that such languages and such methods are necessarily the best for academic purposes: only that they are those most likely to be encountered, and most likely to be well understood or supported by non-academic or non-humanities oriented data processingssionals.

What grounds are there for assuming that the methods appropriate for electronic commerce and the worldwide distribution of soft pornography are also methods appropriate to humanities scholarship and textual research? To what extent do the formal methods underlying the world wide web facilitate the development of a better formal understanding of the business of textual scholarship? This section will argue that there is in fact a surprisingly close overlap, and that, far from being peripheral or in opposition to the humanistic endeavour, text encoding and markup are central to it.

We start with the observation that textual material is prepared and used in digital form for a very wide variety of purposes, and by users from quite different academic disciplines. Over-simplifying, we may identify at least the following broad groups of academics likely to have an interest in formal methods for textual scholarship:

The relevance of textual encoding theory to the first broadly defined group seems self-evident: it addresses directly the central concerns of that community relating to the identification of a text independent of its carrier and has much to say about textuality itself, as further discussed below. For the second group, also, text encoding theory provides more than simply a useful ragbag of tried and tested techniques or a low-cost way of integrating multiple resources (though neither of these advantages is to be underestimated); it also directly facilitates the kind of polyvalent, multipurpose, analyses against a common corpus of observed material which typifies the field of applied linguistics. The various models and abstract schemata which typify computational linguistics have also informed the development of descriptive markup languages; such languages can now also be used in those contexts.

Text encoding theory, by providing a language for the representation of arbitrarily complex formal systems can also offer something to those in the third group above. The historian interested in the interplay between his or her conception of the social system made concrete by a set of documents has often had to choose between a focus on the primary source itself on the one hand, or on an abstraction (such as a relational database) derived from one reading of it, on the other. Descriptive markup languages offer a bridge between these two perceptally different worlds: the annotation or markup relating to the text can coexist with annotation relating to its referents.

As to our final group of potential users, the system independence implicit in the uncoupling of process and data which typifies descriptive markup languages surely is the best hope for longevity in that most evanescent of cultural artefacts: the digital document.

It also seems clear that although originally developed for quite other purposes, declarative markup languages have also made interesting and important contributions to the evolution of textual scholarship itself, not just by providing us with hard cases and useful tools, but also by transforming the way we perceive text (see e.g. Renear, Fiormonte, Monist). Declarative markup languages are typically used to assert properties of the parts of a given document: for example this segment of a text is an instance of some generic type of textual segment (e.g. this is a title) and the properties of that specific instance (e.g. this title is in English). The formal separation of assertions about the ontological status of document components from assertions about how they are to be processed is one of the key differences between descriptive markup languages and their predecessors. The separation has a number of pragmatic benefits (reusability of content, multiplicity of application, simplication of processing etc) but also marks a significant assertion about what text really is: in quite a traditional (non-post-modernist) way, textuality is now grounded in something exterior to the physical text. Just as the librarian distinguishes work and copy, so the textual scholar distinguishes text and reading. With the availability of a markup language, the textual scholar now has a tool to make the reading explicit, i.e. processable, within the text.

Descriptive markup languages make feasible the definition of textual grammars, that is the definition of meta-statements specifying how element types can meaningfully co-occur in documents, and in particular what dependency or other relations exist between them. A DTD thus defines not just that there is such a thing as a title, but also that titles should appear at the start of sections rather than in the middle of them, and that a title contains the same kind of other objects as a paragraph. An SGML application can permit the definition of non-hierarchic relationships such as that between a heading and its entry in a table of contents.

We noted aboive that declarative markup languages have been very valuable in the computational analysis of linguistic materials. Their versatility enables scholars to represent in a uniform way such very different aspects of a text as its formal organization (as paragraphs, headings etc), the paratextual aspects associated with that or other organizations of it, analytic information concerning its interpretation, its linguistic or rhetorical structure and so forth, a point to which we return below. They also provide a formalism at least as powerful as any other for the representation of the complex abstractions typifying much work in computational linguistics and artifical intelligence.

Aspects of the text

It seems self evident that a text has at least three major axes along which we may attempt to analyse it, and thus implies the application of at least three interlocking semiotic systems. A text is simultaneously an image (which may be transferred from one physical instance to another, by various imaging techniques), a linguistic construct (which may equally be encoded using different modalities, as when a written text is performed), and an information structure (it has an important quality of "aboutness" and bears semantic content relating to a perception of the world at large). It may be noteworthy that these three dimensions seem also to be reflected in three different kinds of software: word processing software focussing on the appearance of text, text retrieval software focussing on its linguistic components, and database systems focussing on its `meaning'.

Texts and their meanings are not however to be constrained by the capabilities of software. They remain defiantly both linguistic and physical objects; their formal organization may seem to be linear but is generally not, being characterized by multiple hierarchic structures and interlinked components. Moreover, as cultural objects, they are at once products of and defined by specific contexts. By context I mean here not simply a consideration of the agency carrying intellectual responsibility for a text, but also its intended, presumed, or actual audience, its intended or assumed function, and so forth. And in a highly textualized society such as ours, no text is an island: an important aspect of any text is thus the properties it shares with other texts, the reference it makes to itself and to others, its inter-textuality. And the same is true of the readings of texts.

The scope and variety of the encoding systems we need to envisage in developing a unified account of textual hermeneutics may thus seem very large indeed. The claim of this paper is however that a unified approach remains feasible.

The scope of markup

The term markup covers a range of interpretive acts. Like other semiotic systems, markup has its own lexis and its own syntax. The former determines which features are available for marking, the latter how those features co-exist; we focus here on the former. It seems clear that no violence is done to the term markup if we give it a rather wide ranging scope. We may use it to describe the process by which individual components of a writing or other scheme are represented, and for the simple reduction to linear form which digital recording requires. We can also use it for the more obvious acts of representing structure and appearance, whether original or intended. And markup is also able to represent characterizations such as analysis, interpretation, the affect of a text, or the contexts in which it was or is to be articulated -- the metadata associated with it. Since the range of such features is now more or less co-extensive with the range of interesting things one might want to say, the term is probably in need of some subcategorization. We therefore propose here three broad classes for the myriad textual features which text markup may make explicit:

Some typical compositional features include the formal structure of a text -- its constituent sections, chapters, headings etc., as well as its linguistic structure -- its constituent sentences, clauses, words, morphemes etc. From a different perspective, we might identify as compositional features the components of a text's discourse structure -- its exchanges, moves, acts, etc. A third view concerns itself more with the ontological status of a text's composition: its constituent revisions, deletions, additions etc., or its history as a shifting nexus of discrete fragments.

Some typical contextual features include a consideration of the agencies by which a text came into being or is identified as such (its author, title, publisher...) and of the situation in which it is experienced (the intended or actual audience, the mode of performance itself, the predefined category of text to which it explicitly or implicitly belongs...). Some may be identifiable only externally (its subject, text-type, mode), while others are internal (size, encoding, revision status)

Some typical interpretive features include linguistic properties such as morpho-syntactic classifications, lemmatization, sense-disambiguation, identication of particular semantic or discourse features, and in general all kinds of annotation and commentary, for example associating passages in one text with passages in another, or citing instances of a more abstract knowledge structure.

Despite the convenience of this kind of triage, it has to be stressed that at bottom all markup is interpretive. In most encoded texts, features of all three kinds typically co-occur, often at the same point. When, for example, transcribing a word that is hard to understand or uniquely attested within a manuscript (or for that matter an audio tape), the encoder may find it necessary to call on both morphological and semantic information to justify an emendation, while at the same time being obliged to record the fact that an emendation has been made.

Two criticisms are routinely made of the view that all encoding is a form of interpretation. One (often associated with Ian Lancashire) asserts that encoding can and should be made independent of any theoretical assumptions: that it is possible to make a neutral transcription without interpretation. Another (often associated with Mark Olsen), asserts that by facilitating many possible interpretations, encoding makes impossible the effective integration and practical use of bodies of encoded material.

We find neither criticism helpful, though both are defensible. It is certainly true that there is a group of textual features which (almost) every scholar will agree to be (almost always) objectively detectable in a given text; and it is also true that the less intelligence an encoding contains, the less intelligence is needed to decode it. But, in the first case, the position of almost any textual feature with respect to the boundary between "obviously objectively detectable" and "requires an interpretive act" can always be moved by at least one case for at least one scholar, while in the second case, there seems little purpose in using the tools offered by digitization unless to the full. Interestingly, both criticisms seem to derive from the same reductionist view of how computers "ought" to be used, arising perhaps from the same unwarranted fear that computationally gained insights might replace human intelligence rather than enhance it.

It now should be apparent why the availability of a single encoding scheme, a unified semiotic system, is of such importance to the emerging discipline of digital transcription. By using a single formalism we reduce the complexity inherent in representing the interconnectedness of all aspects of our hermeneutic analysis, and thus facilitate a polyvalent analysis.

Markup has however another function, in some ways a more critical one. By making explicit a theory about some aspect of a document, markup maps a (human) interpretation of the text into a set of codes on which computer processing can be performed. It thus enables us to record human interpretations in a mechanically shareable way. The availability of large language corpora enables us to improve on impressionistic intuition about the behaviour of language users with reference to something larger than individual experience. In rather the same way, the availability of encoded textual interpretations can make explicit, and thus shareable, a critical consensus about the status of any of the textual features discussed in the previous section for a given text or set of texts. It provides an interlingua for the sharing of interpretations, an accessible hermetic code.

If we see digitized and encoded texts as nothing less than the vehicle by which the scholarly tradition is to be maintained, questions of digital preservation take on a more than esoteric technical interest. And even here, in the world of archival stores and long term digital archiving, a consideration of hermeneutic theory is necessary. The continuity of comprehension on which scholarship depends implies, necessitates indeed, a continuity in the availability of digitally stored information. Digital media, however, are notoriously short lived, as anyone who has ever tried to rescue last year's floppy disk knows. To ensure that data stored on such media remains usable, it must be periodically `refreshed', that is, transferred from one medium to another. If this copying is done bit for bit, that is, with no intervening interpretation, the new copy will be indistinguishable from the original, and thus as usable as the original.

In that last phrase, however, there lurks a catch. Digital media suffer not only from physical decay, but also from technical obsolescence. The bits on a disk may have been preserved perfectly, but if a computer environment (software and hardware) no longer exists capable of processing them, they are so much noise. Computer environments have changed out of all recognition during the last few years, and show no sign of stabilizing at any point in the future. To ensure that digital data remains comprehensible therefore, simple refreshment of its media is not enough. Instead the data must periodically be `migrated' from one computer environment to another. Migration, in this context, is exactly analagous with the processes of decoding and encoding carried out by a human being when copying from one stored form of a text to another: there is a potential for information loss or transformation in both decoding and encoding stages.

Where digital encoding techniques may perhaps have an advantage over other forms of encoding information is in their clear separation of markup and content. The markup of a printed or written text may be expressed using a whole range of conventions and expectations, often not even physically explicit (and therefore not preservable) in it. By contrast, the markup of an electronic text may be carried out using a single semiotic system in which any aspect of its interpretation can be made explicit, and therefore preservable. If moreover this markup uses as metalanguage some scheme which is independent of any particular machine environment (for example international standards such as SGML, XML, or ASN1), the migration problem is reduced to preservation only of the metalanguage used to describe the markup rather than of all its possible applications.

Conclusions

Text encoding provides us with a single semiotic system for expressing the huge variety of scholarly knowledge now at our disposal, through which, by means of which, and in spite of which, our cultural tradition persists. Text markup is currently the best tool at our disposal for ensuring that the hermeneutic circle continues to turn, that our cultural tradition endures.