van Halteren

Renovating a worldclass tagset: from WOTAN to WOTAN-2

Hans van Halteren
Dept. of Language and Speech
University of Nijmegen
P.O. Box 9103
Nijmegen, NL-6500 HD
The Netherlands
hvh@let.kun.nl

In 1994, a new wordclass tagset for Dutch was designed (WOTAN; Berghmans, 1994), for use in the upgrade of a tagged corpus of more than a million words (including the Eindhoven corpus; uit den Boogaart, 1975) and the subsequent derivation of an automatic tagger. WOTAN was based on the most popular descriptive grammar of Dutch (ANS; Geerts et al., 1984), from which the encoded distinctions were selected using two criteria: a) importance to potential users, as estimated from interviews and b) feasibility of (semi-)automatic derivation from the existing tagging, given the lack of time for extensive manual changes. WOTAN was judged to be a good compromise and has since been used in several tagging projects and experiments in the Netherlands and Belgium.

Yet, WOTAN had its shortcomings, leading to the creation of a successor. WOTAN-2 adds some important distinctions originally left out because they needed manual intervention, and aims for compatibility with the EAGLES guidelines, the (extensively) revised version of the ANS (Haeseryn et al., 1997), the CELEX database and the AMAZON syntactic parser. Another, more uncertain, influence is the tagset to be used for the Spoken Dutch Corpus, which is presently under construction.

The poster will present:

the differences between WOTAN and WOTAN-2
the influence of the (sometimes contradictory) compatibility issues on the tagset
additions to (or deviations from) the EAGLES proposal necessitated by decisions for WOTAN-2
the upgrade of the WOTAN-tagged Eindhoven corpus to a WOTAN-2 version

References

Berghmans, J. (1994) WOTAN, een automatische grammatikale tagger voor het Nederlands. Dept. of Language and Speech, University of Nijmegen.

Uit den Boogaart (1975) Woordfrequenties in geschreven en gesproken Nederlands. Oosthoek, Scheltema & Holkema, Utrecht.

Geerts, G., Haeseryn, W., de Rooij, J., and van der Toorn, M. (1984) Algemene Nederlandse Spraakkunst (ANS). Wolters-Noordhoff, Groningen and Wolters, Leuven.

Haeseryn, W., Romijn, K., Geerts, G., de Rooij, J. and van der Toorn, M (1997) Algemene Nederlandse Spraakkunst (ANS). Martinus Nijhoff, Groningen and Wolters Plantyn, Deurne.