Special Problems of Input, Encoding, Display, and Retrieval,
by Joseph N. Bell

3.1c. As more and more material becomes accessible electronically, scholars in the humanities are being confronted with an exponential expansion in the quantity of written sources available, a dramatic improvement in the searchability of this material, and the steady emergence of previously inconceivable methods of processing it. Not all the methods of the past will become obsolete, but the drift towards computational studies that characterizes, for example, linguistics is becoming, or will become, increasingly evident in every branch of the humanities.

As we have in the past trained university students in the use of traditional paper sources for the writing of reports and articles, it is now necessary to give them experience in the  use of electronic sources as well. This is the case not only for students who are preparing for advanced research but also for those who are planning careers that will involve the production of general  reports, advertising copy, or the like. Standards for citation and preservation of cited material (given the ephemerality of much electronic material today) must also be developed, and students will have to be exposed to them.

Knowledge is stored electronically primarily in the form of text, images, or sound (the latter two in both static and dynamic forms), or as combinations of these media. Each medium presents particular problems for storage and retrieval which must be solved for the medium itself and for the integration of material in that medium into a useable corpus or database.

Of course a great number of eclectic, specialist, and encyclopedic corpora already exist in electronic format, especially in English, and there is a considerable variety of search engines for accessing desired information in them. The Internet is a meta-collection of such corpora, and many beginning university students can today be expected to be at least as familiar with its use, particularly its multimedia aspects, as their teachers. The job of instructors at universities or other advanced institutions of learning will therefore for the most part consist of introducing students to the use of specialized, discipline specific corpora and databases (whether text only or multimedia).

In this connection we can distinguish three fundamental levels of corpora, according to the extent of human (or human assisted) encoding or tagging of the material (and thus the extent of risk of error):

1. corpora with minimal tagging (sentences, paragraphs, and the like), such as newspaper and journal articles

2. corpora which tag some specific elements in the material in order to disambiguate them for the sake of retrieval or statistical analysis, such as text editions

3. corpora which for similar reasons tag most or all elements in the material, such as many linguistic corpora and collections of patient medical journals

The sophistication of the search engines required will generally be inverse to the amount of human or human assisted machine tagging involved. The fewer the tags, the more powerful the search engine must be to give useful results. Corpora of the third type mentioned here are in essence not very different from databases, which constrain the material as it is entered under a given classification.

Model corpora and databases

For fields for which there does not exist a sufficient number of corpora or databases to cover most of the problems students will need to confront, model corpora or model databases must be created. These do not have to be large or even cover any problem in its entirety. Where they must be comprehensive is in the range of discipline specific technical problems they involve, because it through training with these prototypical collections of material that students will learn to exploit or help to create the electronic sources of the near future in their fields.

One of the most obvious problems in the European context is the need to deal with multilingual, and indeed, multiscript corpora and databases. This presents many as yet unsolved or only partially solved problems in the areas of (1) character encoding and display (the standardization of display font input encoding to make possible automated conversion of massive amounts of text to the Unicode standard), (2) conceptual searching, and (3) textual input techniques (optical character recognition or OCR and mnemonic keyboards to facilitate the manual input of large numbers of different characters and symbols).

1. Character encoding and display

When only one value is used to encode a letter, 7-bit ASCII encoding (128 places) will only accommodate modern English (and other languages like standard Latin which use the same number of  characters or fewer). Even with the addition of the upper-ASCII places allowed by 8-bit encoding (total 256 places), common on most computers today, not all European languages using the Latin alphabet can be accommodated at the same time using a single font. That European languages like Greek, Russian, Bulgarian, and Serbian do not use the Latin alphabet further complicates the situation, as does the ever increasing need to create and disseminate multilingual documents containing non-European languages such as Japanese, Arabic, Persian, and Hebrew, to name only a few of the most obvious. An almost limitless number of special characters used in scientific notation and commercial conventions also adds to the problem.

Here, however, it is necessary to distinguish between entering (or encoding) a character or symbol and displaying it on a computer screen or a paper print-out. In the not too distant future, adherence to a 16-bit encoding standard like Unicode (with over 65000 places) will solve both these problems by the use of a single encoding value and a single character that corresponds to it. At the present time we are regularly obliged to switch fonts to produce characters and symbols that our normal font does not have. This has the disadvantage of giving the new character or symbol the same ASCII value, and for most search engines the same searching value, as some other character we are using. Moreover, switching fonts burdens the underlying computer file with a great deal of additional information and increases its size considerably. To limit the number of fonts we have had recourse to multi-value codes and, often but not always, to corresponding composite characters. An "a" with a grave accent ("à") may be a single image in a given font, or it may be a combination of two. In principal there is nothing wrong with a multi-value encoding system using composite characters as long as one is consistent. But consistency is precisely what has been lacking, both from one language tradition to another and from one computer platform to another.

Many mistakes have been made. For HTML (standard Internet) files, the cross-platform special character features of Internet browsers like Netscape and Internet Explorer would seem to be a step in the right direction for languages using the Latin alphabet with diacritics. But in fact they complicate the rendering across platforms of symbols the programs have not included. Adobe Acrobat (PDF) files, also common on the Internet, can reproduce a wide range of non-Roman scripts, Latin diacritics, and special symbols, but the "Find" function in the Acrobat Reader can seldom retrieve them. In essence this means that the work done converting scholarly journals with scientific notation and  files in languages other than English into PDF format has to be done over again.

We do not advocate here, in the few years that remain before almost everyone will be using Unicode or a similar encoding standard based on single, discrete values for many thousands of characters, that any attempt should be made to standardize present encoding schemes. We do argue, however, that automated conversion to Unicode or a similar standard should be an important consideration in the adoption of encoding systems for the model corpora and databases on which we train students, as should ready conversion to a display font, since this is the easiest form to work with visually. The number of programs for processing Unicode data is increasing, so we are already in fact talking of a three part system, proceeding from (1) 16-bit Unicode to (2) 7-bit or 8-bit character encoding for use on most of today's personal computers and over the Internet to (3) a high quality composite screen/print-out display font. Automated conversion should be possible in both directions. An example of such a font is JAIS1, a version of Times. The single font can represent almost all European languages as well as many specialist diacritics. It functions on both the PC and Macintosh platforms, and also over the Internet. The font may be downloaded free from www.uib.no/jais.

2. Conceptual searching

A successful information retrieval system for a given field will be the result of collaboration between computer specialists, linguists, and domain experts. The domain expert theoretically has little need to be familiar with computer language processing, but at this relatively early stage scholars in the humanities will have to attempt to find out what kinds of software are likely to give the most useful results and they will have to make their needs known to programming specialists. Attempts should be made by scholars training students in the use of real or model corpora to select one or more searching or information retrieval systems and to cooperate with the producers in fine tuning them for use with their material, the aim being to maximize the sophistication of the search engine so as to minimize the need for later indexing or manual tagging. (A similar fine tuning can be undertaken with databases, although in this case the software environment is selected in advance.)

In this connection it is essential to recall that a great many European text corpora will, almost by definition, be multilingual. This poses special problems in the selection of search engines. Search engines that make heavy use of syntactic analysis, that in a sense can understand a text, will presumably produce less "noise," that is unwanted hits, than that those that do not use syntactic analysis but only a minimum of morphology. But they may take much longer to produce, and they may be considerably less cost-effective in a situation where many different languages are involved.

It is important that students gain experience with both types of systems, however, and one should investigate examples of retrieval systems that make use of syntactic analysis as well as of those that do not. The purpose should not be to set one type of system up against the other, but to determine which type of system functions more efficiently for given types of investigations in the humanities and to provide feedback to the producers that will be useful for adapting their products to these specific needs. Example of the two approaches are the (EURO)SPIDER system, which was developed for use in the multilingual Swiss context, and which is not dependent upon syntactic analysis, and the package of linguistic analysis dependent tools produced by the Multi-Lingual Theory and Technology team at Xerox in Grenoble.

3. Textual input techniques

We leave aside the rapidly expanding field of speech recognition, which will have enormous consequences for textual input, and particularly for the building of corpora representing natural or everyday speech. The subject has been touched on above (p. XXXX), and in any event the implications of speech recognition technology are too vast for this section of the present chapter. OR The implications of speech recognition technology, however, are too vast for this section of the present chapter. The discussion here will be limited to optical character recognition (OCR) and the development of mnemonic keyboards to simplify the manual input of hundreds or even thousands of individual characters and symbols.

OCR. There are many OCR programs on the market today. Some of them are meant primarily for light office tasks performed on simple material with few problems or irregularities. Others, which in general are of more use in the humanities, especially in a multilingual context, have a very wide range of characters and variants which they can recognize, and they can be trained to recognize a great many more. Some can scan right to left as well as from left to right, and some again can recognize joined letters and an array of special ligatures, making them able to read scripts like that of Arabic, and, with development, handwriting. One of the best all round OCR programs available is certainly the multilingual Automatic Reader produced by the Sakhr Company in Egypt (www.sakhr.com) for Arabic Windows. It is based on an earlier DOS program produced in Russia. OCR programs can potentially be trained to recognize almost any image, not simply letters, and the image can then be assigned a code which can be used to reproduce it, either by means of including the image in a font or by setting as the code a hypertext link to the image. This fact became apparent when an early version of Sakhr's Automatic Reader was used to scan Egyptian hieroglyphs. The program has an English language interface, and even students not interested in Arabic should be encouraged to experiment with the possibilities it offers.

Mnemonic keyboards. One problem arising from the use of a great many composite signs in a display font is ensuring uniform encoding of these signs. For esthetic reasons, to cite one example, the dot one places under a slender "t" to represent ".ta'" in Arabic is not the same as the dot under a broader "h," and consequently it does not have the same encoding value. Carelessness, or disagreement about which of the available dots looks best under a given letter, could lead to a composite sign being encoded in more that one way, which would interfere with searchability and convertibility. To avoid this, a keyboard mapper such as Keys (2.x and above), created by Peter Szaszvari for the Windows platform, may be used to write keyboard macros which ensure that given composite signs are  produced in only one way. In the keyboard macros produced for the JAIS1 font mentioned above, in order to make it easy to remember how to produce a given combination, all diacritics are produced typologically by means of a mnemonic letter (for example "b" for breve, "d" for dieresis, "k" for acute).  If, when  Unicode or a similar standard comes into general use, all the composite characters which can be produced by the font are available as single images, the mnemonic technique will still be of use for accessing these characters, and, indeed, many more. Today's keyboard macro programs are thus useful tools for training students to deal with the character resources 16-bit encoding will make available.