The Spoken Dutch Corpus Project

Nelleke Oostdijk
Faculty of Arts
Nijmegen University
NL- 6500 HD Nijmegen
The Netherlands

As one of the smaller languages in Europe, Dutch is under serious threat of gradually disappearing as it is losing ground to English. In the European setting the English language is being used increasingly more widely, not just in international communication but also in business, economics and other areas of interest. The availability of the necessary resources has placed the English-language-based language and speech technology in the leading position it holds today and has thus further strengthened the position of English in general. The fact that to date for Dutch the relevant language resources available are but few forms a serious complication for the advancement of Dutch language and speech technology. Therefore in June 1998 the Spoken Dutch Corpus Project was started.[1]

The project aims at the compilation and annotation of a 10-million-word corpus of contemporary standard Dutch as spoken in the Netherlands and Flanders. It is funded jointly by the Flemish and Dutch governments and the Netherlands Organization for Scientific Research (NWO) and has a budget of some 4.5 million EURO. The entire corpus will be orthographically transcribed and annotated with part-of-speech information. A selection of one million words will be analysed in-depth, including a phonological, phonetic and prosodic transcription, and syntactic analysis. The corpus, together with the recordings, will be made available for scientific research. It will be distributed on CD-ROM by the Dutch Language Union.


  1. This publication was supported by the Netherlands Organization for Scientific Research (NWO) under grant number 014-17-510.