Information and image processing
by Manfred Thaller

During the roughly forty years for which Humanities Computing exists, computing equipment has progressed from punched cards to colourful images on the WWW. During that time, the products of Humanities computing have changed along with the technology: from the concordances and cross-tabulations, which had almost the character of archetypes in the first two decades to interactive multi-media presentations.

Still, much of that is superficial. Whether one uses a printed concordance or a full text data base with a mouse driven interface does not change the significance of word patterns found. We ask for the reader's patience, therefore, when we start by defining what information means in the Humanities, instead of describing the advances in the interfaces used to analyse it.

While significant variation exists between individual disciplines within the Humanities, there is, broadly speaking, one major difference between them as a whole and other fields of study, particularly the hard sciences. That is, that the Humanities in general, have very little influence on the creation of the information the process. The strength of a magnetic field is measured directly in units, which can be analysed by computational equipment. The style of a painting is a property, which can be ascribed by a trained observer with some degree of inter subjective consensus among similarly trained individuals. However, the assumptions going into the assignment of that description are infinitely further removed from any meaningful way to process a resulting keyword, than the concept of a continuous field strength from the way in which floating point numbers are handled.

Systematically, we can speak of three types of information, for which we will use the following terms in this section:

Raw information is derived from an original by a purely mechanical process. Typical examples are digitised sound and images.

Transcribed information is produced by a process which tries to differentiate between such properties of the original as are deemed significant and such which are not; the process of transcription does not intentionally change the content to be derived. A clear example is a transcription of a spoken interview or a hand written source, where the transcriber filters background noise or visual properties of writing. Let it be noted, to prepare a later argument, that the introduction of the concept of "significance" makes that kind of information much more specific for an environment, than the raw one. While a digitised interview will be meaningful for all sorts of language studies as well as for oral historians, the decision to remove from a transcription background noises like laughter, will significantly reduce its usefulness for some, but not all research paradigms.

Coded information, finally, is that, where the content of the original is transferred into another set of symbols: Descriptions of paintings by Iconclass codes come immediately to mind, or statistical data sets containing collapsible numeric codes.

Computing methods are used in the Humanities on all three of these levels. They are also used to transfer information between them. OCR turns printed text from raw information into transcribed one and computer supported content analysis (though not that popular today as it has been some years ago) is more or less a systematised attempt at converting transcribed text into coded information.

While we described these categories in increasing distance from the original material on which the analysis is bases, historically Humanities Computing has developed into exactly the opposite direction. I.e.: While in earlier years the emphasis has been on the usage of computing to analyse the relationships and dependencies between coded properties of objects of the analysis, we are now moving more and more towards attempts at analysing the raw information the Humanities have to deal with.

How we evaluate the significance of that development depends very much on the general methodological approach a researcher follows. One position is, that the methodological quality of a scientific argument is, among other factors, but very centrally, influenced by two factors: (a) The ability to explain the largest possible amount of evidence and (b) the intersubjectivity of the string of argumentation.

While not always explicit, these assumptions have been with us since the earliest days of Humanities Computing. In history, e.g., the major argumentation for the introduction of computer usage has been the ability to use "mass sources", where the information contained in huge numbers of by themselves meaningless individual events could be sensibly integrated into statistical arguments. And much of the opposition against it arose from a discussion, whether statistical argumentation actually increased intersubjectivity, as all the assumptions had to be made explicit, or whether on the contrary it damaged it, as statistical training was now needed to understand the argumentation.

These two methodological assumptions are always a useful starting point for a discussion of the significance of the processing of information within the Humanities. Even more so, as the trend to move more and more from an analysis of coded towards raw information, has taken major steps forward in recent years.

As case in point, the ability to handle images digitally has many important effects. In the line of the arguments given above, leading practitioners of the field in art history are currently moving towards formalisations of concepts like "style" or "colour usage" which are based on a direct analysis of the image material.

In a more general perspective, the arrival of image handling capabilities has changed very general assumptions about the usage of computers in the Humanities. A few years ago, it was obvious, that Computing in the Humanities meant first and foremost the application of computers within research. The explosion of visually attractive presentation tools has changed this quite fundamentally. In many cases, nowadays, the usage of computers in the Humanities seems to be focused more on the didactically well formed presentation of results, than on their generation.

This new emphasis on visuality may in the nearer future have some surprising effects: Notice, e.g., that much on the current discussion of markup schemes started from the fundamental assumption, that visual attributes of a text were just arbitrary indications of a conceptual dimension, while their visual representation was irrelevant. It will be interesting to see, whether this assumption survives the state of technological development, which originally favoured this notion, if it did not introduce it.

Also in other fields, the results of the introduction of image processing had unexpected results. Until very recently the use of digital resources, raw information in our terminology, in the world of images was centred on art history and art historical objects, while manuscripts were rather a side area. At the moment it almost looks as if that would be changing: One of the fastest growing sectors in digital resources for the Humanities are currently the digitised collections of books and manuscripts created by libraries and archives.

It is somewhat alarming, that these resources are mainly created outside of Humanities research and produced by institutions, which traditionally have been focusing on the accessibility of material, not on its production. The more so, because this may be the background of one of the more fundamental changes which the information technologies are currently creating, though the Humanities may not be so much aware of it, as they should.

One of the constants for all considerations of how to make sources available in all of the Humanities subjects has always been, that their visual reproduction has been very costly, specifically much more costly than the publication of their transcriptions or descriptions. All the Humanities disciplines have therefore focused on rules on how to select relatively small numbers of sources, which were sufficiently important or canonical, to merit their reproduction by transcriptions or descriptions. These, cheaper than photographic reproductions, were still very expensive: Many Humanities disciplines and sub-disciplines are based, therefore, on a very intensive and detailed discussion of relatively small numbers of canonical texts or corpora.

The tacit assumption behind that strategy does not exist any longer. It is clear already now, that the systematic re-production of huge amounts of source material in digital form is possible today with very small costs, making, in principle, accessible corpora of sources for discussion, which are several orders of magnitude larger than so far.