[Session page] -- [ Bradley]

Title	A proposal for a Humanities Text Processing Protocol
Keywords	text analysis, architecture, tools
Author	Manfred Thaller
Affiliation	University of Bergen, Norway
Email	manfred.thaller@uib.no
Contact address	HIT, University of Bergen,
Fax	+44 (0)171 873-5081
Phone	+44 (0)171 873-2680

A proposal for a Humanities Text Processing Protocol

After almost a decade during which there seemed to be rather little interest in the creation of Humanities software there are groups of persons which are now actively interested in it. When we discuss a "Humanities Software" project today the stress should be on the possibilities of co-operation. Co-operation is most easily achieved if people can work in their own environments and ignore their partners - while still being sure that the result of their labours will be ultimately able to fit together with the work of others.

To achieve the design of the "Humanities Software for the Future" we need to make three sets of decisions:

Decisions about the functionality to be implemented.
Decisions about the software tools to be used.
Decisions about an infrastructure to enable different solutions to communicate with each other.

The first of these decisions - functionality - will obviously be much influenced by the Humanities background of different partners: what is interesting for a literary scholar may be remote to the interests of a historian and vice versa. Indeed, the choice of tools to be used may even almost bring us to religious conflicts. There are many valid reasons to use Delphi for the creation of applications, for example, although some people will refuse even to look at it. There are many reasons why C or C++ are highly attractive; and many people will consider it offensive to be asked to go down to that level of technical detail.

The infrastructure decision is so important exactly because different people will tend to make different decisions in the first two: if we reach agreement there, all the other questions can have multiple answers and still result in software that can flourish side by side. This third area, unfortunately, asks for questions which are rather "hard core" in a technical sense. Only when we can agree completely and in great detail what our programmes can expect from each other can we can expect them to communicate.

To design software in such a way (that components produced by independent parties can easily fit together) seems at first glance to be to be a very difficult task -- particularly if there is to be total freedom for individual parties to choose in which programming languages the individual components can be programmed.

Once we acknowledge, however, that it is necessary to cope with the task on a rather concrete technical level, then there are fairly straightforward ways to achieve such interoperability. We propose here the definition of such a system - usually called a protocol - within which software components shall be able to communicate with each other. A written draft for such a protocol will be available at the presentation of this paper, the paper presented will focus on the major design decisions involved.

The basic characteristics proposed are:

For communication between modules implementing different functionality, text shall be represented as an ordered series of tokens, where each token can carry an arbitrarily complex set of attributes. Such an ordered set of tokens is called a tokenset.
To support text transformations, where one token can be converted into a set of logically equivalent ones (e.g. cases where lemmatizers produce more than one possible solution), we assume that a token is a non-ordered set of strings (though most frequently a set with exactly one member).
Tokensets are recursive, i.e. the token of a tokenset can be a tokenset in turn.
Strings are implemented in such a way that all primitive functions for their handling (comparison, concatenation etc.) are transparent for the constituent part of a string. Such constituent parts can have different character sizes (1 byte (=ASCII), 2 byte (= Unicode), 4 byte (= chinese encoding schemes); they may also contain more complex models representing textual properties, handled in a lower level of processing capabilities.
The protocol provides tools for tokenisation. Tokenisation is defined as the process of converting a marked up text into the internal representation discussed above. Tokenisation functions accept sets of tokenisation rules together with input strings, which are than converted into tokensets. (Note: Tokenisation obviously describes the process of parsing, e.g. putting an SGML text into a form where a browser can decide how to display the tokens found. The proposed protocol does not define parsers or browsers as such. It provides all the tools necessary to build one, however.)
The protocol provides tools for navigation. A navigation function selects, out of a set of tokensets, those tokesets that either contain specified strings or specified combinations of descriptive arguments.
The protocol provides tools for transformation. A transformation function usually converts one token set into another one. This means: word based transformations have usually the context of the word available, when applying a specific transformation.
The protocol provides tools for indexing. Indexing includes support for all character modes described above. It provides mechanisms for "fuzzified" indexing, i.e. for working with keys which match only approximately, where the rules for the required degree to which two keys have to match in order to be considered identical are administered at run time; or to put it another way: a created index can be accessed with different degrees of precision being applied, without having to rebuild the index.
The protocol provides tools for pattern matching, including regular expressions, as well as a pattern librarian for the administration of more complex patterns, which can operate on the string level as well as on the level of tokensets.