[Session page] -- [
Bradley]
Title |
A proposal for a Humanities
Text Processing Protocol |
Keywords |
text analysis, architecture, tools |
Author |
Manfred Thaller |
Affiliation |
University of Bergen, Norway |
Email |
manfred.thaller@uib.no |
Contact address |
HIT, University of Bergen,
|
Fax |
+44 (0)171 873-5081 |
Phone |
+44 (0)171 873-2680 |
A proposal for a Humanities Text Processing Protocol
After almost a decade during which there seemed to be rather
little interest in the creation of Humanities software there are
groups of persons which are now actively interested in it. When
we discuss a "Humanities Software" project today the stress
should be on the possibilities of co-operation. Co-operation is
most easily achieved if people can work in their own environments
and ignore their partners - while still being sure that the
result of their labours will be ultimately able to fit together
with the work of others.
To achieve the design of the "Humanities Software for the Future"
we need to make three sets of decisions:
- Decisions about the functionality to be implemented.
- Decisions about the software tools to be used.
- Decisions about an infrastructure to enable different
solutions to communicate with each other.
The first of these decisions - functionality - will obviously be
much influenced by the Humanities background of different
partners: what is interesting for a literary scholar may be
remote to the interests of a historian and vice versa. Indeed,
the choice of tools to be used may even almost bring us to
religious conflicts. There are many valid reasons to use Delphi
for the creation of applications, for example, although some
people will refuse even to look at it. There are many reasons why
C or C++ are highly attractive; and many people will consider it
offensive to be asked to go down to that level of technical
detail.
The infrastructure decision is so important exactly because
different people will tend to make different decisions in the
first two: if we reach agreement there, all the other questions
can have multiple answers and still result in software that can
flourish side by side. This third area, unfortunately, asks for
questions which are rather "hard core" in a technical sense. Only
when we can agree completely and in great detail what our
programmes can expect from each other can we can expect them to
communicate.
To design software in such a way (that components produced by
independent parties can easily fit together) seems at first
glance to be to be a very difficult task -- particularly if there
is to be total freedom for individual parties to choose in which
programming languages the individual components can be
programmed.
Once we acknowledge, however, that it is necessary to cope with
the task on a rather concrete technical level, then there are
fairly straightforward ways to achieve such interoperability. We
propose here the definition of such a system - usually called a
protocol - within which software components shall be able to
communicate with each other. A written draft for such a protocol
will be available at the presentation of this paper, the paper
presented will focus on the major design decisions involved.
The basic characteristics proposed are:
- For communication between modules implementing different
functionality, text shall be represented as an ordered series of
tokens, where each token can carry an arbitrarily complex set of
attributes. Such an ordered set of tokens is called a tokenset.
- To support text transformations, where one token can be
converted into a set of logically equivalent ones (e.g. cases
where lemmatizers produce more than one possible solution), we
assume that a token is a non-ordered set of strings (though most
frequently a set with exactly one member).
- Tokensets are recursive, i.e. the token of a tokenset can be
a tokenset in turn.
- Strings are implemented in such a way that all primitive
functions for their handling (comparison, concatenation etc.) are
transparent for the constituent part of a string. Such
constituent parts can have different character sizes (1 byte
(=ASCII), 2 byte (= Unicode), 4 byte (= chinese encoding
schemes); they may also contain more complex models representing
textual properties, handled in a lower level of processing
capabilities.
- The protocol provides tools for tokenisation. Tokenisation is
defined as the process of converting a marked up text into the
internal representation discussed above. Tokenisation functions
accept sets of tokenisation rules together with input strings,
which are than converted into tokensets. (Note: Tokenisation
obviously describes the process of parsing, e.g. putting an SGML
text into a form where a browser can decide how to display the
tokens found. The proposed protocol does not define parsers or
browsers as such. It provides all the tools necessary to build
one, however.)
- The protocol provides tools for navigation. A navigation
function selects, out of a set of tokensets, those tokesets that
either contain specified strings or specified combinations of
descriptive arguments.
- The protocol provides tools for transformation. A
transformation function usually converts one token set into
another one. This means: word based transformations have usually
the context of the word available, when applying a specific
transformation.
- The protocol provides tools for indexing. Indexing includes
support for all character modes described above. It provides
mechanisms for "fuzzified" indexing, i.e. for working with keys
which match only approximately, where the rules for the required
degree to which two keys have to match
in order to be considered identical
are administered at run time; or to put it another way: a created
index can be accessed with different degrees of precision being
applied, without having to rebuild the index.
- The protocol provides tools for pattern matching, including
regular expressions, as well as a pattern librarian for the
administration of more complex patterns, which can operate on the
string level as well as on the level of tokensets.