When two distributions are
better than one: Mixture models and word frequency distributions.
Fiona J. Tweedie
Department of Statistics
Mathematics Building
University Gardens
GLASGOW, Scotland G12 8QW
fiona@stats.gla.ac.uk
Harald Baayen
University of Nijmegan
Max Planck Institute for Psycholinguistics
Wundtlaan 1
PB 310, 6500 AH NIJMEGEN
The Netherlands
baayen@mpi.nl
Summary
Models for word frequency distributions are relevant for a wide range of
domains of inquiry, including authorship studies, statistical language
engineering, theoretical linguistics, and linguistic synergetics. For
inferences based on such models to be useful, they should provide accurate
descriptions of the data to which they are fitted. This paper shows that
improved fits may sometimes be obtained by analysing word frequency
distributions as mixtures of two or more distinct component distributions, with
the gain in accuracy outweighing the increased number of model parameters.
Introduction
Currently, there are three models for word frequency distributions
available that take the dynamics of the development of spectral
characteristics as a function of sample size into account: the
lognormal model, the extended generalized Zipf's model, and the
generalized inverse Gauss-Poisson model (GIGP), see Chitashvili and
Baayen (1993), for a review of these LNRE models. Although many
empirical word frequency distributions are well-described by one or
more of these models, there are also word frequency distributions for
which no adequate fit is available. Baayen and Tweedie (1998) discuss
informally a data set concerning the frequencies of use of Dutch words
with the suffix -heid (cf. -ness in English) which
illustrates this point.
The word frequency distribution of -heid is problematic because
the medium frequency ranges of the spectrum are more densely populated
than expected by the LNRE models. This suggests that we might be
dealing with a mixture of two, or more, distributions, rather than
with a single homogeneous distribution. The question we have set
ourselves is: Is it possible to find two component LNRE models that
jointly provide an improved fit to the observed frequency spectrum of
-HEID?
Mixture Models
Mixture models describe distributions where the data can be drawn from
one or more sources. Our starting point is a word frequency
distribution spectrum without any indication of how it is to be
decomposed into its two components. In general, when we model a word
frequency spectrum we are interested in finding expected values of the
elements V(m,N), the number of words occurring m times in a text of
length N. The parameters of LNRE models are then chosen to make the
expected value of the spectrum elements, E[V(m,N)] as close to the
observed V(m,N) as possible. When a single distribution is not enough
to deal with the observed data, we can consider the use of a mixture
distribution, where the expected values are made up as follows:
- E[V(m,N)] = E_1[V(m,pN)] + E_2[V(m,(1-p)N)],
where p is the proportion of the data coming from the first
distribution, usually called the mixing parameter, and (1-p) the
proportion which comes from a second distribution. E_1 and E_2
indicate the expected values under the different distributions.
It can be shown for each of the LNRE models that
- E[V(m,pN)|Z,...] = p E[V(m,N)|Z/p,...]
with Z the LNRE parameter of the distribution. This general relation,
which expresses a form of self-similarity in word frequency
distributions, allows us to show that limiting properties of the
mixture, such as its estimated population number of types, is the sum
of its mixture components. Similarly, expressions of variances and
covariances of the spectrum elements can be derived, so that the
mixture model itself is again a complete LNRE model.
-HEID as a Mixture Distribution
Figure 1 plots the number of types V(m,N) with frequency m in a sample
of size N as a function of m, for m = 2, ..., 15 in the left panel,
and for m=15, ..., 100 in the right panel, using dots (N=167353). The
dashed line represents the GIGP fit to the data (\hat{Z} = 41.5554,
\hat{b} = 0.00765648, \hat{\gamma} = -0.446889), which overestimates
for low m and underestimates for larger m. Other LNRE models provide
even worse fits to the data. The solid line represents the mixture
model for a Lognormal component (\hat{Z} = 200, \hat{\sigma} = 2.05)
and a GIGP component (\hat{b} = 0.000000002093, \hat{Z} = 82.9848,
\hat{\gamma} = -0.565). The mixing parameter p equals 0.96. The MSE
(mean squared error) for the GIGP fit is 3390.6, and X^{2}(13) =
1734.7, p < .1*10^-18. For the mixture model, the MSE is reduced to
97.1, and with X^{2}(10) = 19.58, p=0.0334 we have no reason to reject
the model. We have obtained similar improvements in goodness-of-fit
for other word frequency distributions that thusfar resisted adequate
modeling. At the conference, we will present further examples of the
advantages of using mixture models where `pure' models fail, and we
will demonstrate the software that we have been developing to fit
mixture LNRE models to empirical data.
References
- Baayen, R. H. and Tweedie, F.J.: 1998, "Mixture models and word frequency
distributions." Abstracts of the ALLC/ACH Conference, Debrecen, July 1998.
- Chitashvili, R. J. and Baayen, R. H.: 1993, "Word Frequency Distributions,"
In: G. Altmann and L. Hrebicek, L. (Eds.), Quantitative Text Analysis,
Wissenschaftlicher Verlag Trier, Trier, pp. 54-135.