Burnard

BNC: the World Edition, and the BNC Sampler

Lou Burnard
Humanities Computing Unit
Oxford University
13 Banbury Rd, Oxford
OX2 6NN
lou.burnard@oucs.ox.ac.uk
http://info.ox.ac.uk/bnc

What's the plural of "corpus"? In what social situations is "wicked" a term of approval? Why does it "sound wrong" to say "The good weather set in on Thursday" although "The bad weather set in on Thursday" is perfectly acceptable? If I can say "I live a stone's throw away from here", can I also say "I'm going a stone's throw away from here"?

Large language corpora can help provide answers for these kinds of questions -- if only because they encourage linguists, lexicographers, and all who work with language to ask them. The purpose of a language corpus is to provide language workers with evidence of how language is really used, evidence that can then be used to inform and substantiate individual theories about what words might or should mean. Traditional grammars and dictionaries tell us what a word "ought to mean", but only experience can tell us what a word is used to mean. This is why dictionary publishers, grammar writers, language teachers, and developers of natural language processing software alike have been turning to corpus evidence as a means of extending and organizing that experience.

The British National Corpus (BNC) is a collection of over 4000 different text samples, of all kinds, both written and spoken, containing in all six and a quarter million sentences, and over 100 million words of current British English. Work on building the corpus began in 1991, and was completed in 1994. In 1997, work on a major revision of the corpus was completed, and in 1998 the British Government agreed to allow distribution of this revised version worldwide.

The BNC World Edition is now freely available for sale as a set of CD-ROMs containing the full SGML text of the corpus, together with software and indexes needed to search it. It can also be accessed via the BNC Online service provided by the British Library and managed by the OUCS. In addition, and perhaps of most general interest, a special purpose "sampler" is now available, containing two million words selected from the whole corpus, half from spoken and half from written texts.

In addition to SARA, (the SGML Aware Retrieval Application developed at Oxford to work with the BNC), the Sampler includes the following other state-of-theart corpus analysis software tools:

WordSmith, the tool of choice for many corpus linguists: developed at Liverpool University by Mike Scott, and distributed by OUP (Windows 95 only);
XKwic, tool of choice for corpus linguists in the Unix Environment: developed at the University of Stuttgart by Uli Heid;
CUE, a new XML-based corpus utility developed at Birmingham University by Oliver Mason.

The Sampler also comes complete with detailed documentation in HTML format, some additional data files (notably a selection of treebanked data prepared at Lancaster University, and some sample digitized audio files)

This presentation will introduce the Sampler and its contents, demonstrating how the tools provided can be used to cater for a variety of needs, whether of language teachers and learners or researchers in general. It will focus on uses made of the SARA system by advanced language learners, and the pedagagogic implications of the learning styles this system encourages.