Home Basic Theory Applications Research directions Bibliography Software
| | Software
Updated 01/29/2007
Please note that these programs are continually being revised. The latest versions of all programs are available below; please
check to see that you are using the latest version.
Downloadable code
Click on a link below to download the corresponding program code. The metadata program contains global (user-changeable) defaults and file
specifications; the other programs compute specific parametric models.
| Program |
Description |
Version |
| species.bat
|
Batch file that runs metadata
program |
2.1 |
metadata.txt
|
Maple program that sets
input/output file definitions and global program options, and runs desired
model-fitting program |
2.1 |
| poisson.txt |
Fits Poisson/equal class sizes model |
2.1 |
| negbin.txt |
Fits gamma mixed-Poisson/negative binomial model |
2.1 |
| invgauss.txt |
Fits inverse Gaussian-mixed Poisson model |
2.1 |
| lognormal.txt |
Fits lognormal mixed-Poisson model |
2.1 |
| pareto.txt |
Fits Pareto mixed-Poisson model |
2.1 |
| mixed_expl.txt |
Fits mixture of 2 exponentials mixed-Poisson model |
2.1 |
Operational overview
Our computer programs are written in Maple.
The basic structure of our algorithm is the same across
different parametric models (although mathematical differences between the
models require somewhat different specific computational strategies in each
case). This allows precise comparison of different models fitted to the
same dataset..
We have found that, while the graphical user interface (GUI) of
Maple is convenient, the numerically-intensive nature of these computations
requires maximum speed. We have therefore elected to bypass the GUI in
favor of a batch-processing system. Under this system, to run an analysis
you simply submit the batch (.bat) file to a "command prompt window"
or to the "run" window under Windows. Upon completion the output
is written to the location specified in the "metadata" file (see below
for details).
Getting started
The following resources are
required.
- A reasonably fast computer with a substantial amount of RAM. These programs perform a
sequence of numerical
searches, which can be time-consuming (high number of iterations), and they loop
through (what can be) a large number of subsets of the data. We recommend
at least a 1GHz processor with 512MB RAM (we are currently using a 3.4GHz
processor with 4GB RAM).
- The current version of Maple.
Many universities use the program for a variety of purposes, especially for
teaching calculus in
the mathematics department, so check to see if your institution has a site
license.
- All of the programs in the table above (under "Downloadable
code"). For convenience, copy all of these files to a single
directory (folder).
- Your dataset in text (.txt) file format, structured exactly as described
below.
To run an analysis, simply run the species.bat file (after specifying input
and output files, and other program options, as described below). You can
do this either in a "run" window, or at a command prompt in a DOS
window.
The main program files
- The species.bat file. This is a batch program that runs under
DOS. It is very simple; in fact here it is in its entirety:
REM batch program for local machine with one processor
REM set echo on
echo on
REM set path for command-line Maple (cmaple) home directory
path C:\Program Files\Maple 8\bin.win
REM run metadata program
cmaple8 "C:\Documents and Settings\John Smith\My Documents\species\metadata.txt"
In this version of species.bat it is assumed that the command-line Maple
program, cmaple, is in the directory C:\Program Files\Maple 8\bin.win.
To find this directory on your computer, search for the filename "cmaple,"
and edit the pathname in the species.bat file accordingly if
necessary, using a text editor such as Notepad. It is also assumed that the downloaded program files (from this website, above)
are in a folder named "species" under the "my documents"
folder of user "John Smith." Again, you must edit this line
in species.bat to reflect your installation.
- The metadata file. This
is a Maple program that sets the filenames for the input data file (the
dataset to be analyzed), the desired model-fitting program, and the two
output files. The structure of this file must be maintained exactly as it is given here,
since it is a Maple program. It is in large part self-explanatory.
You must edit the following options in metadata.txt, using a text editor
such as Notepad:
-
Model-fitting programs. Each program fits a specific parametric model to the observed
frequency data, via the method of maximum likelihood. The basic goals of
the program are to
-
Compute maximum likelihood estimates (MLE's) of the
distribution parameters;
-
Compute the "conditional maximum likelihood
estimate" of the unobserved and of the total number of classes;
-
Compute the standard error of these estimates;
-
Compute the p-value of the classical chi-squared
goodness-of-fit (GOF) statistic; and
-
Output text files with (i) fitted values, and (ii) all relevant statistics and program
error diagnostics.
These computations are done to a level of precision specified by the user (16 significant digits by
default). The complete analysis is run on each of a sequence of subsets of the
data: each subset consists of the frequency data from 1 to up to a given right
truncation point t, where t ranges from some minimum frequency specified
by the user (5 by default), up to a maximum set by the user (by default, the maximum
frequency encountered in the data). Each row or line in the output file
contains the complete analysis at a given right truncation point t.
Thus the user can compare
analyses at different right truncation points; typically the fit will vary with t.
The general architecture is as follows:
-
First, given a fixed set of starting
values, the program attempts to find method-of-moments estimates of the
unknown parameters. If this fails (unusual), the program stops and
continues to the next right truncation point t.
-
The GOF is computed based on the moment-method
estimates. If the GOF falls below a user-specified threshold (default:
p < 10^(-6)), the program stops and continues to the next right
truncation point t.
-
Using the moment-method estimates as starting values, the
program searches for the MLE's. This process continues through a
number of steps, and yields values for the MLE's that are as precise as the
program is able to compute (ideally exactly correct).
-
The GOF is then computed based on the MLE's. If the
GOF falls below a user-specified threshold (default: p < 10^(-3)),
the program stops and continues to the next right truncation point t.
-
The standard error is computed using the MLE's.
Once all computations are complete at all right truncation
points t, the
output is formatted and written to a user-specified text file, which can then be
read into Excel or any other package for editing and display.
Currently we have six parametric models. The output from
each program is structured the same way. They are all mixed-Poisson models
(see Basic Theory), with different mixing
distributions:
-
Poisson, with a point-mass mixing distribution, that is, the
ordinary unmixed Poisson. Under this model the sampling intensity is
constant or identical for all classes in the population.
-
Negative binomial, or gamma-mixed Poisson. The mixing
distribution, or the distribution of the sampling intensities, or the
stochastic abundance distribution, is the gamma.
-
Inverse Gaussian-mixed Poisson. The mixing
distribution is the inverse Gaussian.
-
Lognormal-mixed Poisson. The mixing distribution is
the lognormal.
-
Pareto-mixed Poisson. The mixing distribution is the
Pareto.
-
2-mixed-exponential-mixed Poisson. The mixing
distribution is a mixture of 2 exponentials.
We are continually searching for more families of mixing
distributions that (i) have the potential to fit a wide variety of count data,
particularly with high diversity and (some) large abundances; (ii) can be shown
to satisfy the technical conditions required for the general theory (in
particular asymptotic variances, i.e., standard errors) to be valid; and (iii)
are feasibly computable. We will add programs for these as they become
available.
The input and output files
-
Your dataset file. This must be a text (ASCII) file, with two columns, tab delimited, with a
carriage return/new paragraph mark at the end of each line (note that there must
not be an extra return after the last line). The first column contains the
frequencies, the second, the frequencies of frequencies. Here is a sample
dataset, the same one discussed under Basic Theory.
A file with this structure can be readily created using, e.g., Microsoft
Excel.
-
The fitted values file. The structure of this file is as follows. The first,
left-most column contains the integers from 1 up to the maximum frequency in the
data, i.e., all (potentially) observed frequencies. The second column
contains the actual observed frequency-of-frequency counts for each integer
(some of these may be zero). Subsequent columns contain the values fitted
by the model to the given frequency, from 1 to t; each column contains
the fitted values for a given right truncation point t.
-
The analysis output file. Each row or line in the analysis output file contains the results of a
complete analysis at a given right truncation point t. For a description of the
analysis results see Basic Theory. From left to right, the
statistics are:
-
the right truncation point t;
-
the MLE's of the parameters of the distribution;
-
the MLE of the "non-coverage," i.e., p0;
-
the estimated number of unobserved species, i.e., s0;
-
the estimated number of species based only on the
data up to the right truncation point, that is, excluding the species with observed
frequencies greater than the right truncation point;
-
the estimated total number of species, that is, including
the species with observed frequencies greater than the right truncation
point;
-
the standard error of the estimate of the number of species
(the standard error for the estimate based on the subset and for the
estimated total is the same);
-
a lower bound for the standard error (an empirical version
of the simple binomial SE; see Chao and Bunge (2002));
-
the "naïve" p-value of the chi-squared goodness-of-fit test
for the model, using all cells;
-
the p-value for an asymptotically correct chi-squared
goodness-of-fit test based on concatenating adjacent cells so that all
expected cell counts are at least 5, to conform with asymptotic theory;
-
the "program error report," which is actually a
numeric code indicating the state of the program when it terminated (not
necessarily an error).
Here is a Microsoft Excel template for the output file.
To use it, open the template, and from within Excel, open the analysis output text file,
and paste the results into the template under the header row. (If you have
output files from several models, paste each output into the template in a
vertical array (one below the other), labeling each with its model name in
column A.)
|