[Session page] -- [ Thaller]

Title A New Computing Architecture to Support Text Analysis
Author Bradley, John
Affiliation King's College London
Email john.bradley@kcl.ac.uk
Contact address Centre for Computing in the Humanities, King's College London, Strand, London WC2R 2LS U.K.
Fax +44 (0)171 873-5081
Phone +44 (0)171 873-2680
Author Horton, Tom
Affiliation Florida Atlantic University
Email tom@cse.fau.edu
Phone (561) 297-2674
Fax (561) 297-2800
URL http://pigeon.cc.kcl.ac.uk/ta-dev/notes/design.htm

A New Computing Architecture to Support Text Analysis

Over the past year a small group of researchers and developers have been designing a general architecture into which a wide range of text analysis tools could fit. This proposed architecture is designed to respond both to recent developments in technology and to developments in textual encoding of the kind represented by the TEI. Furthermore, an important element of the architecture lies in its ability to allow independent developers to add or modify existing elements to suit their own needs, including developing modules to support kinds of processing as yet unknown.

Such a text analysis support system needs to exhibit certain characteristics if it is to be suitable for use in such a potentially wide range of different applications.

First, it needs to be modular and open. Textual research in the Humanities spans a very wide set of interests and techniques and one of the challenges for the designers of a new TA system is to avoid restricting its design to favour one type of application over another. By "modular" we mean that elements of the system (at a number of different levels) need to be cleanly defined so that each component, as far as is possible, stands alone and can be replaced by other equivalent modules if desired. For example, modularity would permit components that are oriented towards alphabet-oriented languages such as French or Hebrew to be replaced by modules that understand the needs of pictograph languages such as Chinese. It would permit users who do not need what others consider key components (such as word indexing) to still use effectively the parts of the system they do need. New modules could be added that would fit with existing elements. Interfaces between modules need to be clearly defined so that output from one module can be passed to a variety of alternative modules for continuing processing. By "open" we mean that a complete specification of the system needs to be accessible to all possible developers so that anyone with the appropriate technical skills may add new components to the system.

Second, it needs to be word-oriented. Not all applications of this system will focus on the words of the text (and indeed, as we have already said, the system should be useful to those who do not turn out to need its "word-oriented" features), but for much textual research, the words are a key component. Users like systems that can show them word lists, can provide word occurrence counts, and that understand the rich set of word-related structures implied by such notions as "part of speech", or "lemmatisation". Thus, although the system should not be dependant on word-oriented tools, it is natural that many of the tools and structures should be designed to support a word-oriented view of the text, and that the design of the system should be able to accommodate the need to represent these objects.

Third, it needs to be element-oriented -- to support highly structured texts of the kind now available to us as a result of the TEI. We have heard many papers at ALLC/ACH conferences past and present from people preparing electronic editions who have taken advantage of different parts of the TEI to identify a varying array of complex textual objects within a text. There are far too few tools that can do anything interesting with these objects once they have been identified. Furthermore, we are beginning to see some potential interest in new research which can be made possible by the simultaneous handling of both much more complex word and element structures in a single environment. An architecture is needed that can bring together both element-oriented and word-oriented tools in a single system.

Fourth, we believe that the full power of any new architecture must rest in its ability to assist in not only "word searching" or "text indexing", but in the support of the many other types of data transformations that are key to supporting analysis of data -- whether or not the base data is word-oriented. Tools such as TuStep have shown the potential power of a set of basic abstract tools; none of which are specifically "word oriented", but which can be combined together in many different ways to perform a wide range of different functions.

Over the past year we have designed an architecture which is based on the above assumptions and contains four parts:

(1) It contains a collection of textual objects that represents, for a particular user, the text or set of texts under study. These texts might well contain a rich markup of the kind possible with the TEI markup scheme. Current activities in the XML community have resulted in a serious of standards which are close to meeting this requirement -- in particular the development of the Document Object Model (DOM) which provides an object-oriented representation of an XML document. The DOM provides a logical model of the document that contains all relationships between its markup and text data. Thus software components that use a DOM representation of a text should be able to manipulate elements within a text, determine reference identification information based on mark-up, distinguish distinct tokens according to mark-up that have identical token-strings in the file, etc.

(2) It must also contain a collection of personal textual enhancement objects which can store someone's personal materials about the text. The architecture is designed so that this set of objects could be "overlaid" on top of the base text, allowing the user to modify or extend the objects found in the DOM for the base text, as well as adding entirely new elements to it. This separation of personal textual enhancements from the base texts, plus the ability to present a "combined view" when useful, offers additional advantages -- in particular, it allows the base textual objects and the personal materials to exist on different machines. Furthermore, this model represents a kind of "dynamic representation" that grows and changes each time the results of the application of the analysis tools are stored either as new data attached to existing nodes, or as entirely new nodes and node structures.

(3) A further part of the environment will be the collection of tools that work on this combined DOM representation. As mentioned earlier, it was important to think not only of text searching elements as the "tools". Analysis also includes the manipulation and presentation of data in a number of different ways so that patterns or structures become more visible. Thus, a TA architecture should support transformation tools that can accomplish this task. Many existing programs offer one or two transformation/presentation tools (a KWIC display is an example), but few offer a rich enough set of tools to allow users to design their own display. We believe that it is important to modularise the process -- to break apart the selection, transformation, and presentation elements so that the user can "mix and match" components from perhaps different software suppliers to meet their particular needs. In order to understand better the potential of modularisation we are examining a number of the transformational tools in inherantly modular systems such as TuStep and LT XML, but expect to derive tools also from work currently being done in XML style sheets, extended pointers, query languages and transformation objects.

(4) Finally, the environment must contain a framework in which the user could combine the tools. This framework would provide an operating environment in which the user could see the texts, his/her local and remote toolset, and the results of the application of the tools on the texts.

To us it seems obvious that a model of the DOM that is both persistent and dynamic represents a key element of this proposal. We will examine how recent developments and tools in the XML community can be incorporated to support this model of an architecture. In particular, several software implementations of the XML Domain Object Model (DOM) will be discussed. We will describe a prototype Java application that uses one of these to provide users with the ability to browse the structure of an XML document and the words it contains in a tightly-integrated manner.