Constructing a Parsed Corpus of Historical Portuguese[1]

Helena Britto
Marcelo Finger Inst. Estudos da Linguagem (IEL)
Univ. Estadual Campinas (UNICAMP)
Caixa Postal 6153 - Cid. Univ.Zeffer
Campinas, São Paulo CEP 13082-970
Brasil
molina@server.nib.unicamp.br

The Tycho Brahe Parsed Corpus of Historical Portuguese <http://www.ime.usp.br/~tycho> consists of an electronically annotated corpus of Portuguese texts whose authors were native speakers of European Portuguese born between 1550 and 1850. Its construction follows the model of the Penn-Helsinki Parsed Corpus of Middle English <http://www.ling.upenn.edu/mideng>. Only texts from editions revised by the own authors or autographed manuscripts are included on the corpus, each one of them containing at least fifty thousand (50,000) words, presented electronically in three different ways: orthographically transcript, morphologically tagged, and syntactically annotated.

The Tycho Brahe annotation system is split into three levels: extra-linguistic material codification; morphological tagging; and a syntactic annotated system. The extra-linguistic coding system encapsulates information such as text edition, editor's or researcher's comments, original page number of the texts, etc.

The tag set that compounds the morphological annotation system was the result of a detailed research about morphosyntactic properties of Portuguese (Britto et al. 1999). In this system, tags have internal structure, and are basically formed from the following components: part-of-speech component, inflectional components, and diacritics. Proposed by Finger (1998), the structuring of tags in a part-of-speech basis and inflectional components allows for the capturing of the morphological richness Portuguese exhibits without increasing the number of tags involved.

Keeping the number of POS basic tags low has shown to be crucial to decrease the computational complexity of training the automated morphological tagger for Portuguese, which was developed in the lines of Brill's (1995) tagging method. A tagging editor has also been implemented (TAT: Tagging Aid Tool), to help the manual tagging of a set of Portuguese texts (Augusto et al. 1998), necessary for training the tagger. Both the tagger and TAT run under Windows (95/98/NT) with 16MB RAM; the tagger also runs under Unix.

Note

  1. This research has been developed with the support of the FAPESP (grants #98/3382-1 and #98/12075-3).

References

AUGUSTO, M. et al. (1998) "Morphological tagging for different periods of Portuguese prose". ms. Unicamp, Campinas, Brasil.

BRILL, E. (1995). "Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging". Computational Linguistics, 21(4):543-565.

BRITTO, H. et. al. (1999) "Morphological Annotation System for Automatic Tagging of Electronic Textual Corpora: from English to Romance Languages". Proceeding of the 6th International Symposium of Social Communication. Santiago, Cuba, pp.582-589.

FINGER, M. (1998). "Tagging a Morphologically Rich Language". In Proceeding of the first Workshop on Text, Speech and Dialogue (TSD'98). Brno, Czech Republic, pp. 39-44.