Wolff, Andreev, Olsen

ATE: ARTFL Text Encoding

Mark Wolff
Electronic Text Services
University of Chicago
5424 South Cornell #203
Chicago, IL 60615
mbw3@stone.lib.uchicago.edu

Leonid Andreev
Department of Mathematics
Harvard University
2 Holyoke St. #21
Cambridge, MA 02138
leonid@math.harvard.edu

Mark Olsen
ARTFL Project
University of Chicago
1050 E. 59th Street
Chicago, IL 60637
mark@barkov.uchicago.edu
http://humanities.uchicago.edu/ARTFL/projects/ATE/

The ATE (ARTFL Text Encoding) specification is the encoding scheme for the current generation of textual databases running under PhiloLogic, a full-text search engine developed by ARTFL. The advantage of using ATE is that electronic text centers (such as the University of Chicago Library's Electronic Text Services) can provide sophisticated access to full-text databases without the costs and overhead of encoding document structure that current software cannot easily process. For those who wish to continue using full SGML (such as the TEI and the EAD), PhiloLogic can provide access to their electronic texts right now through data conversion.

PhiloLogic offers users the ability to define a corpus of documents by multiple criteria (such as author, title, and date) and allows them to search for words, phrases, and UNIX regular expressions in that corpus. Search results are displayed in concordance and KWIC formats as well as frequency by title listings. From the concordance and KWIC displays users can expand the context surrounding their search terms by viewing the page or content object (paragraph, poem, chapter, etc.) containing the terms. Users can also navigate a document from a table of contents which lists major document objects.

SGML in general and the TEI in particular make reporting search results over the web difficult for commercial full-text software because most browsers cannot handle anything other than HTML. XML promises a way for future browsers to handle document structure, but that functionality has yet to become a reality and will require not only a host of new or modified DTDs but also new browser software to process XML-encoded documents, all of which will require specialized skills and staff that may be beyond the resources of most humanities computing organizations.^[1] We agree with one of the authors of the TEI who observes that HTML is currently the most effective delivery vehicle for SGML documents to web browsers.^[2] As he notes, SGML-to-HTML conversion on the fly is complex, costly, and CPU-intensive. Commercial applications such as DynaWeb provide a means to search SGML-encoded documents and send results over the Web, but these applications are often cumbersome and require extensive postprocessing of search reports.^[3]

Searching ATE-encoded documents with PhiloLogic over the web avoids many of the problems of commercial SGML search engines because the source files are already in a format web browsers can handle. ATE uses Dublin Core headers for metadata as well as a few optional extensions to HTML. The following header from our general specification is in use for both ARTFL data creation projects, most notably our French Women Writer's project, and for loading databases from other sources with SGML DTDs.

<head>

</head>

ATE uses HTML to represent text as an ordered hierarchy of content objects.^[4] The top level of the hierarchy is the document itself, as described in the DC header. Lower text levels are encoded with the older HTML tags <H1>, <H2>, etc., instead of nested <DIV>s:

<body>

<H1>Preface</H1>

... some text and tags ...

<H3>Chapter 1</H3>

... some text and tags ...

<H3>Chapter 2</H3>

... some text and tags ...

</body>

A document need not contain all these divisions. The three level hierarchy can reflect any number of possible structures, such as Book-Chapter-Verse, or Act-Scene (leaving <H3> blank).

Further object levels which descend from the lowest division level are as follows:

<p> = paragraph/stanza

sentence = which are defined by a set of rules or optional tags (<sent>)

word = delimited by white space and punctuation

Pages constitute optional objects. They do not fit into the structural hierarchy outlined above, but they form a parallel structure for display purposes:

<page n="[ANY STRING]"> where [ANY STRING] is a page object identifier (e.g. page number) from the source edition (page tags occur at the beginning of pages).

ATE is therefore a specific implementation of HTML 3.2 with a few optional tags. PhiloLogic completely ignores SGML encoding that it does not recognize, simply passing such encoding through the system, to be delivered to the browser or modified by a database specific formatting module on output. Not only does ATE facilitate the use of well known systems (such as HTML editors, WWW browsers, etc.) for creating and modifying documents for PhiloLogic databases, but it also provides an experimental infrastructure for intelligent indexing of WWW document spaces.

In order to import SGML-encoded data into a PhiloLogic database, the SGML must be flattened to ATE. Any SGML data set can be converted to ATE using James Clark's nsgmls parser and David Megginson's Perl module SGMLS.pm. We have found that despite the inconsistencies in text markup from one etext shop to another and even within a single etext shop (such as Chadwyck-Healey, arguably the largest and most consistent producer of encoded text), the ATE specification can be applied to any set of SGML data through parsing and conversion. We have already loaded many databases based on a variety of SGML DTDs, including a many Chadwyck-Healey products (Patrologia Latina, Voltaire électronique, Goethes Werke, English Poetry, and many others), TEI documents (such the Oxford Text Archive's First Folio Shakespeare and documents from Indiana University's Victorian Women Writer's Project), and sample Encoded Archival Descriptions (EAD).

We plan to add extensions to PhiloLogic, but we are not developing encoding elements for which we not have an application or planned application. Extensions to the tag set and PhiloLogic functionality include database internal cross-references, of which notes are an important subclass, and full UNICODE support. We are particularly interested in hearing from users and text encoding experts regarding critical functional/tag set deficiencies.

We believe that we have built an effective search engine with a general data encoding specification that provides sophisticated capabilities at low cost. PhiloLogic is in full production at ARTFL and the University of Chicago Library. We are currently exploring several avenues for release of PhiloLogic and are beginning to work with some collaborating institutions to build a release version of PhiloLogic.

Notes

Jon Bosak and Tim Bray, "XML and the Second-Generation Web," Scientific American (May 1999). <http://www.sciam.com/1999/0599issue/0599bosak.html>
Lou Burnard, "SGML on the Web: Too Little Too Soon, or Too Much Too Late," Computers & Texts, 15 (August 1997). <http://info.ox.ac.uk/ctitext/publish/comtxt/ct15/burnard.html>
Toby Burrows, "Using DynaWeb to Deliver Large Full-Text Databases in the Humanities," Computers & Texts, 13 (December 1996). <http://info.ox.ac.uk/ctitext/publish/comtxt/ct13/burrows.html>
See Steven J. DeRose, David Durand, Elli Mylonas, and Allen H. Renear, "What is Text, Really?", Journal of Computing in Higher Education, 1:2 (1990), 3-26.