Home Basic Theory Applications Research directions Bibliography Software
| |
Current research directions
There are a number of areas in this problem that need
further investigation. We are currently working on some of these, but
others are virtually untouched.
- Statistical inference.
The "conditional MLE" (conditional on the number of observed
species), which is our principal point estimator, has been examined from a
number of angles and is known to perform well (if the model is a reasonable
approximation to reality). It is strongly consistent; it is an
empirical modification of an unbiased estimator; it is the Horvitz-Thompson
estimator in this case, etc. Hence we believe its performance to be
well-understood and reliable (up to, as mentioned above, the suitability of
the parametric model). However, variance estimation is less firmly
established. We
currently have three formulas for standard errors for the parametric maximum
likelihood-based estimate of the number of classes: a lower bound for the
standard error, and two slightly different formulas based on different
approaches to asymptotic (large-sample) theory. The lower bound is
straightforward to compute, but given its value the user does not know how
much greater the true standard error may be. The other formulas
involve significant computation.
There are two main strategies for variance estimation: direct estimates
based on asymptotic formulae; and resampling, specifically the bootstrap.
(This holds for both the parametric and the coverage-based nonparametric
methods; the nonparametric maximum likelihood methods appear to use only
resampling. No exact small-sample variance results are known for any version
of this problem.) Sanathanan (1977) provided an asymptotic variance
formula for the conditional and unconditional MLE's. This is the basis
of our standard errors. However, the precision of this formula is unknown
for small samples, and remains to be investigated, whether analytically or
by simulation. In some cases one obtains a standard error such that
the normal-theory 95% confidence interval (point estimate) +/- 1.96*SE has a
lower endpoint below the necessary lower bound, which is the number of
actually observed species or classes. This may indicate imprecision in the
variance estimate; it certainly indicates inappropriateness of the standard
normal-theory confidence interval (see below).
As an alternative, variance estimation by the bootstrap has been carried out
by a number of authors in this problem (Bohning and Schon, 2005).
Certain modifications to the bootstrap are required by the structure of this
problem, and these have been applied with increasing sophistication in
recent literature. However, the bootstrap may run afoul of the same
problem noted above: the apparent range of variation of the estimate may
actually hit the necessary lower bound (the observed number of classes). It
is not presently clear whether this fact means that (a) bootstrap variance
estimation is incorrect in this problem; (b) it is correct but only after
careful modifications of the procedure; or (c) it is correct as is but
requires a different interpretation. This is an open problem in mathematical
statistics; we have not yet undertaken this study but our preliminary
reading of the theoretical literature indicates that the problem is subtle
and that the answer is not clear a priori.
We know that our estimator has an asymptotic normal distribution, so that
for large samples the normal-theory confidence intervals are appropriate.
However, the situation for small samples is less clear (even less than for
variance estimation). At present mathematical statistics does not have
an answer for this problem. Some solutions have been proposed. For example,
the coverage-based nonparametric methods suffer from this same problem, so
Chao (1987) proposed a transformed interval that avoids the lower bound of
the observed number of species and could be adapted for use here; however,
the properties of this interval have not been assessed theoretically (and
again it relies on asymptotics). One could also compute direct resampling
(bootstrap) -based intervals, but here again we are not certain of the
application of the bootstrap in this problem. Another possibility is to use
profile confidence intervals; but again the requisite theoretical
development has not yet been carried out, and such intervals may also be
affected by the lower bound problem out (although recent work of such as
that of Lang may be of value here (Lang, J. (2005), Profile confidence
intervals for contingency table parameters, Univ. of Iowa Tech. Report
#351)). Thus we do not presently have a procedure that can guarantee the
nominal coverage (say, 95%) in small samples.
The underlying issue is that in this problem, the operative parameter space
(the range of possible values of the estimand, i.e., the number of species)
is bounded by the random observables (in particular, the observed number of
species). This fact violates a regularity condition that is required for a
substantial proportion of classical statistical results, and consequently
analysis of this problem lies outside of the scope of classical theory. It
may be that only slight modifications to that theory are needed, or it may
be that significantly deep analysis will be necessary.
-
Models. We seek to expand the scope of parametric models, i.e.,
to find new classes of mixing distributions. "New" here means
new to this problem: in the past only a few mixing distributions have been
considered in this context (exponential, gamma, inverse Gaussian,
lognormal). The main goal is to find a class of mixing distributions that would
(ideally) make it unnecessary to truncate the data at some upper frequency
in order to obtain a good fit. Some possibilities include (but are not
limited to):
The generalized inverse Gaussian family, first
proposed in this application in by H.S. Sichel (see the Bibliography).
This is a three-parameter family (the inverse Gaussian is a two-parameter
special case). While promising, the family is mathematically
complicated and has proven computationally intractable to date.
The general class of mixed-exponential
distributions. We can currently fit low-dimensional finite mixtures of
exponentials, but the general family has not been explored in this
problem. There is some recent research on the family, especially on
computing, that may be helpful.
The "generalized gamma convolutions" set
forth by Bondesson in his book of that title. There is some overlap
with other families (and this group includes the gamma and exponential), but
the group is relatively unexplored.
Heavy-tailed distributions such as the log-stable
or the log-Cauchy. (The lognormal is a special case.) We are
experimenting with a location/scale version of the log-Cauchy but we do not
yet have a usable model-fitting algorithm for it.
Extensions of the Pareto, which seems to work well
for high-diversity data. We are not sure at this point how to embed
the Pareto in a larger family suitable for this application.
In fact, in a recent paper Karlis and Xekalaki (2005) gave an extended list
of potential Poisson mixture models (along with considerable analysis and
references to the literature), including the following possible mixing
distributions in addition to those mentioned above; as far as we know, none of these has been applied in the
species problem.
|
linear exponential family
|
|
Lindley
|
|
confluent hypergeometric series
|
|
inverse gamma
|
|
truncated normal
|
|
gamma product ratio
|
|
exponential ^ beta
|
|
generalized Pareto
|
|
beta types I & II
|
|
uniform
|
|
truncated gamma
|
|
generalized gamma
|
|
modified Bessel of the 3rd kind
|
|
Pearson's family
|
|
lomax
|
|
power variance family
|
|
other discrete distributions
|
-
Model selection. Currently we analyze the
performance of all parametric models on all possible right-truncated subsets
of the data, and choose the best performer on the basis of (i) two
goodness-of-fit tests, (ii) ability to produce a reasonably small standard
error, (iii) use of the maximum amount of the empirical frequency data
(largest right truncation point), and (iv) various heuristic
considerations. The concepts of model selection set forth in (for
example) Burnham and Anderson (1998) (Model selection and inference: A
practical information-theoretic approach, New York: Springer-Verlag)
provide a different perspective on the competing models. Indeed our original
idea was to rank the models according to one or more information-theoretic
statistics such as the AIC (Akaike information criterion), and to use this
to select our final analysis. We did attempt this, but we decided that, at
this stage of our knowledge, the problem of model selection in this problem
is still too multidimensional to permit such a reduction - in short, expert
judgment still plays a role.
The information-theoretic criteria do take the number of model parameters
into account. However, the numbers of parameters in our current models are:
1, for the (unmixed) Poisson, which assumes equal species abundances and
hence never fits real data; 2, for the negative binomial, inverse Gaussian,
lognormal and Pareto; and 3 for the 2-mixed exponential. Thus for realistic
models we have 2 parameters in four cases and 3 in one case, so the question
of different numbers of parameters does not play as large a role as it does
in, say, a high-dimensional regression problem. (Once we implement more
complicated finite mixtures, such as a 3-mixed exponential or a mixture of 2
lognormals, the number of parameters will rise significantly, and this issue
will become important.)
Because the different models permit different right truncation points, the
effective datasets (in particular the sample sizes) are not the same for all
models. Such differences can be accommodated to some extent by the
information-theoretic criteria, but we do not have enough empirical
experience with this issue in this problem to apply these criteria with
confidence under these circumstances. (The coverage-based nonparametric
methods get around this by simply taking the right truncation point to be 10
by default.) Our ultimate goal, of course, is to find a family of parametric
models (which will probably turn out to be finite mixtures of known
distributions) that will fit complete empirical frequency datasets, in which
case the problem of the right truncation point will disappear.
We are not fully satisfied with likelihood, the Pearson chi-square, or other
goodness-of-fit statistics at this point, partly because in this problem we
may wish to differentially weight the frequency curve, seeking better fit
toward the left (rare species) and less stringent fit toward the right. At
present we do not have a method for this. In short, at present we
prefer to compare simple and well-understood statistics, and to take
data-analytic and heuristic considerations into account, in selecting a
final analysis. However, we envision incorporating the
information-theoretic criteria into a more refined model-choice procedure,
once the issues above (and others) have been resolved.
-
Computing. We have implemented a new numeric
algorithm to compute the maximum likelihood estimates (and associated
statistics) for the parametric models. This is an accelerated version of the
EM (expectation-maximization) algorithm. In fact the EM algorithm for this
problem was applied more or less simultaneously by us, in the parametric
version, and by Mao, Bohning and others in the nonparametric maximum
likelihood version. The Bohning/Schon (2005) paper has an analysis of the
convergence of this algorithm. Our acceleration appears to be a version of
an acceleration due to Aitkin which has never been applied in this problem (Bohning,
personal communication). This algorithm allows us to obtain precise
estimates within a given model, and estimates that are comparable across
models from a numerical point of view. However, our
current code is not fast, or not as fast as it could be, and would admit a
number of incremental improvements. Larger-scale improvements would
entail moving to a different computing platform (software and/or hardware);
incorporating recursions for complicated formulas where possible, improved
convergence-checking, etc. We also note that some
of the parametric models present specific computational difficulties beyond
the general accelerated EM algorithm mentioned above.
In particular the lognormal and the Pareto require locally adaptive
approximations to the probability mass function (the former by Riemann sums,
the latter by another finite series representation). Such
approximations can always be improved and made more efficient.
-
Nonparametrics. One natural extension of the parametric model-fitting
approach is to allow a nonparametric mixing distribution, i.e., one that is
not a member of a family specified by a finite number of parameters.
There is some current and promising work in this regard (due to Mao, Bohning,
and others), but the computation
involved is still difficult, and the interpretation of the results will need
refinement. The general mixed-exponential approach mentioned above
constitutes one possible "semi-parametric" compromise.
Following up on this point, there is a need for reconciliation of currently
known coverage-based nonparametric methods with
parametric (and eventually nonparametric) maximum likelihood model-based
methods. This is a potentially fruitful area of research, because it
would tell us the conditions under which the various estimators are
likely to be effective, and how they relate to the various parametric
models. Apart from some simulation comparisons, however, the only work
in this direction (as far as we know) is the 2002 paper of Chao & Bunge.
-
Fully Bayesian methods. In a fully Bayesian approach
one assigns a prior distribution (informative or noninformative) to the
number of classes, and possibly "hyper-priors" to the parameters
of the mixing distribution (or to a family of mixing distributions) as
well. There is by now a considerable literature on these
approaches, but they have not been unified or (generally) compared to the
frequentist methods (parametric or nonparametric) discussed above. This is an excellent area for
research but is beyond the scope of our current projects.
-
Several samples. In some situations we may have a
number of distinct samples from the same population, and it may not be
scientifically logical to simply pool them and regard the result as a single
sample. Is there some way to exploit the division between the
samples? While some of the extensive work in multiple-recapture theory
may bear upon this problem, it is not clear at present how to use a mixture
model (that is, a probability distribution for the sampling intensities of
the various species) in this context. This is an open, and pressing,
topic for research.
|