Sample-based Estimation of Regional Forest Tree Species Richness

Sample-based estimates of species richness for large heterogeneous regions may be affected by interactions between the sampling design and sub-regional heterogeneity in species richness and diversity. To investigate this issue a Monte Carlo simulation study was conducted with actual forest inventory tree species incidence data from two large regions, four estimators of species richness, three sampling designs, two sampling units (fixed area plots and tree nearest to plot center), and five sample sizes. The regions represent forested areas in central and eastern Canada and three states in the USA. Design effects in estimates of regional species richness were generally weak, of little practical import, and limited to settings with the smallest sample sizes. The study shows that accurate regional estimates of richness can be obtained by pooling sub-regional species incidence data. Sampling with fixed area plots was, with few estimator-dependent exceptions, considerably more efficient than sampling with one tree per sample location. A proposed ratio-based procedure for combining sub-regional estimates of richness broadens the options for regional estimation.


Introduction
Existing regional and national forest inventories can provide reasonable estimates of the number (viz.richness) of distinct forest tree species in a region or a nation (Magnussen, 2011).Large-scale forest inventory sampling with fixed area sample units (plots, quadrants, clusters) are typically characterized by sampling only a small proportion of the population of interest (Magnussen et al., 2007;Winter et al., 2008).Under these circumstances, few species richness estimators have performed well in Monte Carlo simulations with actual forest inventory data (Magnussen, 2011;Magnussen et al., 2010).Heretofore popular estimators of richness like the Jackknife, the bootstrap, and Chao's coverage-based estimators generally disappoint because they were not designed for sampling with multi-tree quadrants (Hellmann & Fowler, 1999;Hwang & Shen, 2010;Lam & Kleinn, 2008;Palmer, 1991;Schreuder et al., 2000;Schreuder et al., 1999).
Species richness in a large heterogeneous region can be decomposed into a within-and among-ecosystem diversity viz.alpha-and beta-diversity (Barnett & Stohlgren, 2003;Engen et al., 2008;Jobe, 2008;O'Dea et al., 2006;Rosenzweig & Clark, 1994;Rosenzweig et al., 2003).Alpha diversity is the biodiversity within a particular area, community or ecosystem, and is usually expressed as the species richness of the area.Beta diversity is also a measure of biodiversity which works by comparing the species diversity between ecosystems or along environmental gradients.Non-parametric and model-based estimators of species richness implicitly assume a spatially homogenous structure of species diversity (Chazdon et al., 1998;Colwell et al., 2004).Hence they target estimation of alpha-diversity (Gimaret-Carpentier et al., 1998;Rosenzweig, et al., 2003).For the estimation of both alpha-and beta-diversity new sampling strategies, plot designs, and estimators that explicitly consider species turnover rates and distance decay of compositional similarity have been proposed (Harte et al., 1999;Krishnamani et al., 2004;Palmer et al., 2002;Ricotta et al., 2002;Tackaberry et al., 1997;Ugland et al., 2003).Others argue for sampling stratified by habitat type (Waldhardt et al., 2004).Practical applications of these new designs (Hortal et al., 2006;Jobe, 2008;Nagaike et al., 2010;O'Dea, et al., 2006) have realized that a diverse origin, structure, and history of species mosaics and patterns prevent definitive conclusions (Whittaker & Field, 2000).
Existing regional and national forest inventories are designed for a multitude of purposes (Köhl et al., 2006, p 3;McRoberts et al., 2010) and are therefore unlikely to adopt a new sampling paradigm for the sake of improved species richness estimation.Most regional and national inventories employ a sampling design with sample locations on a regular or semi-regular grid that allows local variation in sampling intensity.Designs of this kind can only map environmental gradients at the scales of inter-plot distances.In principle, only a design suited for variogram estimation (Di Zio et al., 2004;Lark, 2002;Zhao & Wall, 2004) can quantify ecological gradients in species composition (turnover rates).In areas with managed forests and a significant anthropogenic influence on plant communities, these gradients may be discontinuous and irregular, making attempts to account for beta-diversity difficult.
Simulation studies by Brose et al. (2003) and Wagner and Wildi (2003) found no support for a major effect of either ecological gradients or spatial heterogeneity in species habitats and occurrence on the performance of non-parametric species richness estimators.Ecological gradients, habitat patches, and edges did increase Monte-Carlo estimates of sampling variance, but without any obvious effect on bias.Nor did a stratified sampling approach with estimation of richness from samples pooled across strata produce a clear improvement over simple random sampling in a species-rich wet tropical forest in the Western Ghats, India (Gimaret-Carpentier, et al., 1998).Sample size was again the single most influential factor.
Since the spatial distribution of forest tree species within the forested area of a large region is never uniform, the question remains whether stratified sampling (STR) offers an advantage over simple random sampling (SRS), and possibly over sampling with a constant sample size in each stratum (EQ).The answer may depend on the employed estimator of richness and choice of sampling unit (trees inside a fixed area plot, trees intersecting a transect line, a fixed number of trees nearest to a sample location).
To explore this question a Monte Carlo simulation study with i) two regional finite area populations of forest tree species, ii) four model-based estimators of species richness, iii) three sampling designs, iv) two sampling units, and v) five sample sizes was conducted.Estimators of species richness were limited to four with a demonstrated good performance in sampling with fixed-area plots (Magnussen, 2011;Magnussen, et al., 2010).In addition, to broaden options for regional estimation of species richness, a new procedure for combining sub-regional estimates of species richness into a single regional estimate is proposed and evaluated.

Forest species incidence data
Two collections of forest inventory tree species plot incidence data act as two finite area regions (populations) for which an estimate of the number (richness) of distinct forest tree species (S) is desired.Each region was composed of N approximately equal-area primary sampling units (PSU) and stratified to three geographic sub-regions (strata).The raw data for each PSU were the names of all distinct tree species contained in a PSU.
The N lists of names of species in each PSU were then converted into a N S  binary incidence matrix.
The first region is composed of forest inventory species incidence data from five Canadian provinces and is henceforth referred to as PROV.PROV is a spatially balanced collection of N = 21 091 permanent forest inventory plots from eastern Canada (Magnussen & Boudewyn, 2008).The plots represent approximately 108 million ha of forested land in the Canadian provinces of Ontario, Quebec, New Brunswick, Nova Scotia, and Newfoundland and Labrador.All trees with a stem diameter larger than the provincial lower limit (range: 2.5 to 9.0 cm) were identified to species.The median size of a sample plot is 404 m 2 with an average of 139 trees (sdev.53) and 8 species (sdev.5).There were a total of S = 84 forest tree species in the data.Three geographic sub-regions (strata) were formed to represent the northern (NO), the central (CE), and the maritime (MA) parts of PROV.The number of plots by sub-region were 2938 (NO), 11 963 (CE), and 6190 (MA).Species richness in the three sub-regions were 33 (NO), 70 (CE), and 71 (MA).
The second region is composed of forest inventory species incidence data from 13 181 US Forest Service (USFS) national forest inventory sample plots (Bechtold et al., 2005) collected between 2005 and 2008 in three states (sub-regions): Georgia (GA, 3213 plots), Michigan (MI, 5851 plots), and Minnesota (MN, 4117 plots).The sampling frame and design has been described by Bechthold and Patterson (2005).All trees in a plot with a stem diameter of 12.7 cm or larger at a reference level of 1.37 m above ground were counted and identified to species.Unknown species were excluded.They were assumed to be non-forest tree species.A sample plot contained, on average, six species (sdev.3).The region is henceforth called FIA.Species richness in FIA was 187, 143 in GA, 102 in MI, and 66 in MN.
The main ecological differences between PROV and FIA are i) the boreal nature of the forests in NO and ii) the warm temperate forests in GA dominated by deciduous tree species.

Sampling designs
For the purpose of estimating S in PROV and FIA a simulated (Monte Carlo) random (probability) sample of n fixed-area inventory plots (without replacement, wor) of species incidence data was taken from each region.Three designs determined selections of plots: 1) equal sub-regional sample sizes (EQ), 2) sub-regional sample sizes proportional to the number of inventory plots in a sub-region (STR), and 3) simple random selection within entire region (SRS).Five sample sizes, representing sample fractions of approximately 0.5%, 1%, 2%, 3%, and 5%, were used for each design.Data from each sample plot were either i) a list of species names of all trees recorded in the selected plot (CLU), ii) the species name of the tree located closest to the center of a selected plot (T1).Ties were resolved with a toss of a fair coin.All simulations and computations were carried out in the MATHEMATICA programming language (Wolfram, 1999).The advantage of working with actual inventory data is that actual species associations at the plot-level are captured.However, it also limits the scope of possible conclusions that can be drawn.
The observed species richness (SO) and four model-based estimates of S ( ˆest S , see next) were computed from each sample.The effects of design and sample size on these estimates were assessed from 400 Monte Carlo replications of the 15 design settings.With this number of replications, the estimated standard error of the mean number of observed species SO in a given design was always less than 0.3 (Koehler et al., 2009).

Estimators of species richness
Four estimators with a demonstrated good performance (in terms of bias and accuracy) in low-intensity sampling for forest tree species richness with fixed-area plots (CLU) were considered (Magnussen, 2011;Magnussen, et al., 2010).Estimators of species richness may differ in sensitivity to the sampling design due to differential reliance on the number and relative incidence of species observed in a sample (Haas & Stokes, 1998;Mao & Lindsay, 2007;Walther & Morand, 1998); thus the four estimators can only provide an example of how the sample design and choice of sampling unit can affect estimators of richness.On the other hand, their appeal to forest tree species richness estimation will increase if they also demonstrate insensitivity to the design for sample collection (EQ, STR, SRS) and choice of sampling unit (CLU, T1).
The four estimators of species richness used in this study are detailed in the Appendix.A brief outline of each estimator is given next.The first estimator is a Pòlya-urn type estimator   urn S which, through replicated chains of iterative urn-drawings, generates predictions of new species discoveries in the N-n non-sampled PSUs, and from that a posterior distribution of predictions of S. The estimate ˆurn S is the mean of the M posterior predictions and
The second estimator is a Horwitz-Thompson (HT) type (model-based) estimator   ht S that estimates the number of unseen species from sample-based estimates of species inclusion probabilities (Magnussen, 2011).
The HT-type formulation of the estimator means that it adapts to any sample design for which the design-based HT-estimator applies (Thompson, 1992, p 49).
The third estimator   ˆsh S was developed by Shen and He for fixed-area plot wor sampling from a finite area population (Shen & He, 2008).A zero-truncated binomial distribution is assumed for the number of plots containing a given species, and a modified beta-distribution for the distribution of the probability of the plot-level presence/absence of a species.After fitting the modified beta-distribution to the data, an explicit solution to the probability that a species occurred in x sample quadrats was obtained (EQ 4 p 2053) and used in a multinomial likelihood (Sanathanan, 1972) to estimate the number of unseen species.A maximum likelihood variance estimator is provided (Shen & He, 2008).The fourth estimator   ˆhw S is a generalized jackknife estimator of richness (Hwang & Shen, 2010) expressed as the sum of SO and a prediction of the number of unseen species.The latter was obtained from an exponential regression model that predicts a scaled ratio of where i f is the number of species seen in i sample plots.The regression model was then used to predict 0 f .

Estimation of regional forest tree species richness
Estimation of regional (FIA, PROV) forest tree species richness under the three sampling designs (EQ, STR, and SRS), five sample fractions, and two types of sampling unit (CLU, T1), proceeded by i) a direct estimation of regional species richness from the pooled incidence data (i.e.ignoring the sampling design), and ii) combining sub-regional estimates of richness to a regional estimate   comb S  via a novel procedure detailed below.The direct estimation of regional richness followed the procedures detailed in the source document of each estimator.
In some situations sub-regional estimates of species richness may already exist, and it could be advantageous to take advantage of these estimates in the pursuit of a regional estimate.Sub-regional estimates are, of course, non-additive when one or more species can occur in more than one sub-region within the region (Guralnick & Van Cleve, 2005).Thus the issue of sub-regional overlap in species composition must be addressed.The simplest approach is to multiply the sum of sub-regional estimates of richness by a "uniqueness" ratio   0 1 s R   of the regional species richness.Of course, s R is unknown and must somehow be estimated from available data or knowledge.Expert knowledge, tree atlases, museum data (Beck & Kitching, 2007), or previous studies may provide sufficient information for an estimation of s R .If sub-regional species incidence data (or equivalent) are available, a possible approximation to s R can be obtained as ( 4) where region SO is the total number of species observed in the pooled sub-regional samples, and h SO is the number of species observed in sub-region h.The multiplier 1 C   is a correction factor compensating for a positive bias that arises from the facts that i) observed species observed in a relatively small sample tend to be more common than species not captured by the sample (Chao, 1981;Reichert et al., 2010;Starr, 1979) and ii) common species are more likely to be shared among sub-regions than less-common species (Cao et al., 2004;Jaccard, 1901;Ugland et al., 2005).After extensive trials it was decided to use Chao's estimator of sample coverage as the correction factor (C = one minus the relative frequency of species found in just one plot) (Bunge & Fitzpatrick, 1993) (Chao & Lee, 1992).Our simulations (not shown) demonstrated that s R  is asymptotically unbiased as h n   .The ratio s R  attains a maximum of 1.0 when no species occur in more than one stratum, and a minimum of 1 H  when species compositions are identical across strata.Accordingly, the estimator for the regional richness becomes where est stands for the estimator used to estimate sub-regional richness.Note it is not necessarily that the same estimator is used in every sub-region.SO SO were obtained via a modified (plot-level) bootstrap procedure (Magnussen &   McRoberts, 2011).The variance of , ˆh est S was estimated as detailed in the source document for each estimator (c.f.references given in section on estimators).Finally, a variance approximation for C  was obtained from a bootstrap estimate of the variance of the number of species that occurred in just one plot (assuming the overall species yield is fixed by design).

Shannon-Weaver indicator of diversity
The Shannon-Weaver index is a widely used indicator of diversity (Magnussen & Boyle, 1995).To help readers appreciate the data used here, we also report on the Shannon-Weaver indices of diversity for the two regions (FIA and PROV) as well as for their sub-regions (Figures 1 and 2).The Shannon-Weaver index is computed as follow (Krebs, 1989, p 444) (5) where i  is the plot-level relative incidence of species i.

Regional and sub-regional species richness
Model-based estimates of species richness depends strongly on the distribution of relative plot-level incidence of species (Chao, 2005;Mao & Colwell, 2005).To obtain a sample of SO species in a community with many rare and a few common species requires a larger sample than in a community with a more even distribution of relative incidence.As expected, the relative plot-level incidence distribution for GA, MI, MN, and combined resembled an inverse J-shaped distribution (Colwell, et al., 2004) characterized by many rare species and fewer common species (Figure 1).Despite large differences in richness (66 in MN to 147 in GA) the sub-regional distributions share several common features: About half the species occur in less than 1% of the plots and only 10% reach a relative incidence of 10% or higher.A large number of species are shared among sub-regions   , , , 62, 38, and 56 Relative incidence distributions in sub-regions of PROV also resembled an inverse J-shaped distribution but less so than in FIA (Figure 2).Differences among sub-regional richness (33 to 71) were smaller.As for FIA, about half the species occur in less than 1% of the plots; a larger proportion (30-40%) can be regarded as fairly

Design effects in the number of observed species
The observed species count is, on its own, an estimator of richness and an important component of any estimator of S (Chao, 2005;Mao & Lindsay, 2007).Observed species counts in FIA under EQ, STR and SRS designs were similar (Figure 3 top).In the combination FIA×EQ the samples contained approximately three species more than with STR and SRS designs.Under an EQ design the sample fraction in GA is higher than in MN and MI, and with more species in GA, one should, a priori, expect that . It is therefore somewhat surprising, that the design effects are minor and do not seem to increase with sample size.Similarities among sub-regions in distributions of relative species plot-incidence are likely a main cause for the modest design effects.Design effects for CLU and T1 are similar once the sample fraction is 1% or higher.
In PROV the species count with an EQ design was lower than with STR and SRS designs (Figure 3 bottom).In an EQ design the sample fraction is highest in species-poor NO and lowest in CE.Hence, an increase in the NO sample fraction at the expense of CE and ME triggers a decline in the number of species encountered per sampling unit.As expected, the decline is more pronounced for T1 than for CLU sampling.
Sampling errors of observed richness in FIA and PROV were, for a given sample size and sampling unit, similar for EQ, STR, and SRS designs (Figure 3).With fixed-area plot (CLU) sampling units in FIA the sampling error declined with an increase in sample fraction from 5.3 (0.5% sample) to 3.8 (5% sample).Yet it increased (slightly) from 2.9 to 3.8 when only the centermost tree in each plot is selected (T1) due to larger increases in SO with increasing sample sizes.

Design effects in the number of observed singleton species
Species observed in just one sample unit (singletons, SI) exert a disproportionate effect on estimates of richness (Mao & Lindsay, 2007;Reichert, et al., 2010).In both FIA and PROV the number of observed SI-species was similar across the EQ, STR, and SRS designs.With fixed-area plot sampling (CLU) in FIA the number of singletons drops from approximately 25 with the lowest sample fraction to 20 with the largest.With selection of the centermost tree per plot the number of singletons at first increased and then decreased as the sample fraction is increased.In PROV one observed a steady decline in the number of SI-species with increasing sample size when only one tree is selected per plot (T1).
The impact of singletons on estimates of richness is relative to the number of observed species (SO) and hence sample size (Chao, 2005;Thrush et al., 2006).However, when examining the ratio of the number of observed singleton species to the number of observed species no important trend emerged across designs, sample size, and choice of sampling unit.

Direct estimation of regional tree species richness
A comprehensive comparison of the estimators under SRS and CLU sampling has been done elsewhere (Magnussen, 2011).In terms of root mean squared errors this study confirmed that the HT-type estimator ranks first for the three largest sample fractions while the urn estimator ranks first at the two lowest sample fractions (not shown).
In FIA the four studied estimators of richness were apparently insensitive to design effects (Figure 4).The only indication of an important design effect was in the combination of the Shen & He estimator, the two lowest sample fractions, and the T1 sampling unit (for which the estimator was not designed).Here, estimates under SRS designs were lower (15-30) than those under EQ and STR design.In PROV the ˆsh S and ˆhw S estimators indicated a sensitivity to the sampling design, whereas estimates with ˆurn S and ˆht S were indifferent (Figure 5).With ˆsh S and an EQ design higher (less biased) estimates of S were obtained than with STR and SRS designs.With the ˆhw S there appears to be complex design × sample size interactions.
Sampling with fixed-area plots (CLU) produced, as expected, less biased estimates of S than sampling with selection of a single central tree per plot (T1).The ˆurn S estimator was the most sensitive to the choice of sampling unit and hw S the least sensitive.
Bias in regional and sub-regional estimates of species richness were strongly correlated (> 0.8) and followed single linear trend across sample fractions and choice of sampling unit.The slope of the trend line for the ˆht S estimator was close to 1, otherwise the slope was in the range from 0.5 to 0.7.An absence of design effects suggest that the influence of sub-regional heterogeneity in species diversity and incidence (beta-diversity, Jobe, 2008) is outweighed by random variation introduced by low-intensity probability sampling.Interactions between estimators, design, and sampling unit were all minor and of no practical importance.Consequently, the performance of the four richness estimators is controlled almost exclusively by sample size and choice of sampling unit (Brose, et al., 2003;Chazdon, et al., 1998).Similar results, but limited to considerably higher sample fractions, were reached in simulations with artificial populations and non-parametric richness estimators (e.g.Jackknife, bootstrap, and Chao's estimators) (Brose, et al., 2003;Wagner & Wildi, 2002).Gimaret-Carpentier (1998) reported a slim advantage of stratifying samples by topography when estimating species richness in the Western Ghats of India.
The results also suggest that fairly accurate estimates of species richness in a large and heterogeneous region can be obtained by pooling sample data from sub-regional surveys.As well, the relative bias of a regional estimate may be lower than the average relative bias of sub-regional estimates.
Results from this study cannot be extended indiscriminately to regions with steep ecological gradients, and distance-dependent species' turn-over rates.When it is possible to verify, for example, a power law species-area relationship (Arrhenius, 1921), it should be exploited in pursuit of less biased and more efficient estimates of species richness than possible with estimators of the kind used here, since they ignore information about fundamental ecosystem processes (Chazdon, et al., 1998;Engen, et al., 2008;Harte, et al., 1999;Jobe, 2008;Krishnamani, et al., 2004;O'Dea, et al., 2006;Thrush, et al., 2006;Ugland & Gray, 2004;Ugland, et al., 2005).Nevertheless, the potential advantages of a ecological stratification may not materialize unless the difficult task of identifying an effective stratification criterion (metric) can be completed with confidence (Barnett & Stohlgren, 2003;Stohlgren et al., 1997).

Combining sub-regional estimates to a regional estimate of richness
The primary intent behind the proposed procedure for combining sub-regional estimates of species richness to a regional estimate is to give the analyst more options for combining and using existing information and data.Application in this study was limited to the case where the analyst has full access to all data gathered at the sub-regional level.As such, the example merely demonstrated that the direct and combined approach in this case generated very similar results.
As for the direct estimates of regional species richness, the design effects in the combined estimate of richness were unimportant (Figures 6 and 7).The performance of the combined estimate depends on the properties of the component estimates and the multiplier s R  .With the ˆsh S and ˆurn S estimators and CLU, the combination of sub-regional estimates of richness into a regional estimate produced a reduction in bias (1%-13%) in all designs with the three lowest sample fractions.A reduction linked to a positive bias in s R  that, as expected, declined rapidly with an increase in sample fraction.Otherwise, with ˆˆ, ,and ht sh urn S S S the performance of ˆcomb S was similar to that of a direct estimator (Figures 6 and 7).However, combining sub-regional CLU-based ˆhw S estimates reversed a negative bias of richness to a positive bias in both FIA and PROV.A noticeable among sub-regional heterogeneity in the estimates of the generalized jackknife regression function for the prediction of the number of unseen species is the most likely explanation for this anomaly.

Conclusions
Fairly accurate regional sample-based estimates of species richness can be obtained with any of four studied richness estimators by pooling species incidence data from low-intensity, sub-regional sampling with fixed-area plots.A new algorithm for combining sub-regional estimates of species richness affords added flexibility and more options for the task of regional estimation.EQ-T1, ■: STR-T1, ▲: SRS-T1.Trend lines for CLU (full) and T1 (dashed) are indicated

Appendix
The algorithm for the urn estimator by (Magnussen, et al., 2010) is as follows: Step Description 0 Place n balls -each representing a survey plot -in an urn.On each ball is written the number of species that was seen in only this plot   singleton species, , 1,..., The HT-type estimator of species richness (Magnussen, 2011) is computed from the number of observed species (SO), sample-based estimates of species inclusion probabilities ˆ, 1,..., i i SO   (Patil & Taillie, 2001), population size (N) and estimated size (number of plots) of species i = 1,…, SO.In short, (1) The estimator by Shen and He (Shen et al., 2003) (2) where  and  are maximum likelihood estimates of the parameters in a modified beta-distribution of the relative plot-incidence of observed species (see equations 3 and 4 on page 2053).Hwang and Shen (2010) proposed a generalized jackknife estimator of richness (3) where 0 f is an estimate of the 'unseen' species, which, in turn, is estimated from equal to the number of species occurring in 1 sample plot (singletons).the function is approximated by an exponential regression model and estimated from the data via constrained nonlinear least squares regression (see Hwang and Shen for details).

 
ˆ0 g is obtained by extrapolation.

Figure 1 .Figure 7 .
Figure 1.Relative plot incidence    of species in FIA sub-regions GA, MI, and MN.Species are ranked (1, …., S) in descending order of plot-level incidence.H = Shannon-Weaver index of diversity(Pla, 2004) counter equal to the observed species richness in the sample, S urn = SO.2Draw at random a ball from the urn with index, say, *, * 1,...., urn i i n  .Create a new ball, representing a new plot, with singleton value* i SI where  is a realization of a random Bernoulli variable (0,1).Where: the Gaussian error function, i.e. the integral of a standard normal variable (z) from zero to z.Return two copies of this ball to the urn (n urn = n+1).3Updatespecies counter: S urn =S urn +δ.4If n urn < N  is a single urn-replication prediction.Steps 0-5 are repeated M times to generate a resampling distribution of S urn