Statistical Tests: The Chebyshev UCL Proposal

(Draft)

The theory of confidence intervals

The problem concerns estimating an interval required to cover a distribution-dependent quantity with a minimum probability, such as 95%.  That probability is the confidence level of the interval.  The quantity typically is a mean, standard deviation, or percentile.

This interval is a pair of procedures that assign interval endpoints to any possible experimental outcome (set of observations).  The upper endpoint is the UCL and the lower is the LCL.  We write UCL(x) and LCL(x) to show the dependence on the data.

To determine UCL(x) and LCL(x), first find for each possible distribution a region where the quantity of interest has a 95% probability of occurring.  In the figure, this region--which is a set of outcomes and therefore describes a vertical extent--is shown with a dashed line.

We can find these regions because when we are given the distribution we can do any necessary probability computation.

The endpoints of these 95% intervals trace the wiggly curves in the figure.  The area between the wiggly curves encompasses all pairs (theta, x) where x falls within the 95% probability interval for theta.

Consider, now, a random outcome x.  The set of distributions whose regions contain x is shown by the solid horizontal arrow.  That is, (theta, x) lies along the horizontal arrow exactly when x is within the 95% probability interval for theta.  The set of these distributions (theta) gives a 95% confidence interval for x.  For instance, if the quantity of interest is the mean, then the 95% CI is the set of means of the distributions spanned by the horizontal arrow.

The demonstration is simple.  When the true distribution is theta, there is a 95% chance of the outcome lying within the extent of the vertical dashed arrow; that is, between x+ and x-.

In all such cases the confidence intervals (shown by the solid horizontal arrows) all include the true distribution (vertical dashed arrow) and therefore cover the true quantity of interest.  Thus there is at least a 95% chance that this procedure produces intervals covering the true value.  That is the definition of a confidence interval.

Of course you can replace "95%" with any value between 0% and 100% you like.  The analysis is the same.

One source for this material is chapter nine of Jack C. Kiefer, Introduction to Statistical Inference.  (See the Links page.) 

The "Chebyshev Method"

Recently, US EPA statisticians have proposed using the Chebyshev inequality as a confidence limit procedure for the mean.  It's a clever idea.  They write,

[The Chebyshev inequality] can be applied with the sample mean, x-bar, to obtain a conservative UCL for the population mean. ...  In general, if mu1 is an unknown mean, mu1-hat is an estimate and sigma-hat(mu1-hat) is an estimate of the standard error of mu1-hat, then the quantity UCL = mu1-hat + 4.47 sigma-hat(mu1-hat) will give 95% UCLs for mu1, which should tend to be conservative, but this is not assured.  [Singh, Singh, and Engelhardt, op. cit., page 12.]

The omitted part of this passage shows that the multiplier 4.47 is computed from the relationship 1/4.472 = 100 - 95% = 1/20.  Chebyshev's inequality states that if mu is the true mean and sigma the true standard deviation, then no more than 1/k2 of the probability can be supported at values further than k sigmas from mu.  In this instance, therefore, no more than 5% of the probability can be supported at values further than 4.47 sigmas from mu.  Evidently, "conservative" is meant in the sense of "having at least 95% coverage."

This proposal is based on two mistakes.  The formula really gives a tolerance limit, but it is taken to be a confidence limit.  (This issue is explored in detail elsewhere on these pages, in the evaluation of Pennsylvania's "75%/10X" rule.)  The second error lies in confusing the roles of theta and x in the general confidence limit construction presented above.

Just because the motivation for a procedure is mistaken does not necessarily mean the procedure itself is bad.  The rest of this article therefore evaluates the performance of the Chebyshev UCL procedure.

I want to point out that the EPA paper otherwise contains a thoughtful and clear discussion of possible problems with using Lognormal models in environmental applications.  The Chebyshev proposal is a small and inconsequential part of the paper and the authors are wary about its possible applications.  Nevertheless, some EPA regions apparently have adopted this proposal as a preferred alternative to current guidance for computing UCLs, so evaluating the Chebyshev UCL procedure is useful and important.

The EPA statisticians clearly are aware that something is suspect when they write "but this is not assured."  The caveat is all-important in this situation.

The first example provided by the EPA statisticians is a mixture of two widely separated Normal distributions, N(100, 50) and N(1000, 100) with a small sample size of 15.  Evidently, then, they intend the Chebyshev formula to be used in such circumstances.

Small samples from mixture distributions actually provide examples of the worst possible performance of this approach.

Suppose, for instance, the true distribution is a mixture of 95% of a low-mean, small-sd distribution with 5% of a high-mean, moderate-sd distribution.  Let the high mean be more than 23 small-sds greater than the low mean.  This will cause the mean of the mixture to be at least 23/20 = 1.15 small-sds higher than the low mean.  This is comparable to the situation in the N(100, 50) + N(1000, 100) example, where (1000 - 100) = 18*50.

The probability that all 15 data in the sample come from the low-mean component is (0.95)15 = 46%.  When this happens,  the sample mean will with high probability be close to the low mean and the sample standard deviation will be close to the low standard deviation.  The Chebyshev prescription, (sample mean + 4.47 * sample standard error), equals (sample mean + 4.47 * sample standard deviation / sqrt(15)), which is the sample mean + 1.15 * sample standard deviations.  By our assumption, which includes many realistic situations, this value is about the same as the true mean.  Therefore, just by chance, it has about a 50% probability of being less than the true mean.

The upshot is that at about 50% * 46% = 23% of the time, the Chebyshev value will not cover the true mean.  A putative 95% confidence limit in this case is really a 77% limit.

For more extreme mixtures (wider separation between the means of the components, smaller amounts of the high-mean component, but relatively large true means in the mixture) the situation only worsens.  It gets much worse with smaller sample sizes.

Where such mixtures are possible in practice, this analysis can be used to demonstrate that the Chebyshev procedure has an arbitrarily low confidence level.  We need only exhibit mixtures whose mean is highly unlikely to be covered by the Chebyshev UCL.

This is not mere theoretical speculation.  Not only are such mixtures possible, they happen all the time.

In effect, the EPA procedure is computing the inner dashed line in the figure based on the outcomes x (through their sample mean and standard deviation).  No matter what intended level of confidence is used, the dashed line can never extend fully to the solid line that gives a true 95% confidence limit. 

For the state (theta) shown, the likely sample outcomes lie between x- and x+, as before.  The outcome (x) shown is one of the likely ones.  Its Chebyshev UCL is given by the inner dashed line.  It falls short of the correct UCL given by the solid line.

Therefore--at least when sufficiently complex mixture models are contemplated--it is always possible to find a state of nature, such as the one shown at theta in the figure, which (a) is fairly likely to produce an outcome x that (b) produces a Chebyshev UCL less than the true mean.

Investigation of soils at a contaminated site usually will turn up a large quantity of background or low-level samples plus a smaller quantity of samples with highly elevated concentrations.  In extreme cases, such as pesticides in soils near pesticide manufacturing or formulating facilities, the high concentrations can average four to five orders of magnitude (10,000 to 100,000) greater than the background concentrations.  These high concentrations may occur in localized areas.  A mixture model is highly appropriate in these cases.  The failure of the Chebyshev method is likely.

Direct evaluation of the Chebyshev UCL procedure for lognormal distributions

In some ways the preceding analysis is unfair.  No reasonable procedure exists to compute a 95% UCL of the mean for arbitrary mixtures of Normal distributions.  We can demonstrate this in the same way as above, by exhibiting a mixture whose upper component is unlikely to be detected by a small number of samples.  If the mean of that upper component is sufficiently great, it can cause the mean of the mixture to exceed (on average) any finite UCL.

However, the mixture situations described above approximate the behaviors of some lognormal distributions.  So how does the Chebyshev procedure compare to conventional UCL procedures when the distribution is lognormal?

Let's do the comparison for situations the Chebyshev procedure is designed to address: small to moderate sample sizes (making it difficult to identify the true underlying distribution) and moderate to large skewness.  As a basis for comparison we may take Land's UCL method, which is the one the EPA statisticians are criticizing.

The following figure shows the anticipated ratio of the Chebyshev UCL to the Land UCL.  

For each value of N (sample size) and s (true standard deviation of the logarithms), the numerator was computed using the formulas for the MVUE of the mean and its variance, substituting s for the sample standard deviation.  The numerator thus underestimates the expected value of the Chebyshev UCL, but Monte-Carlo simulations show the underestimate generally is small.  The denominator was computed as the expected value of the Land UCL.  The computation was exact, not approximate.

This ratio is therefore a rough measure of the average ratio of the two UCLs that will be attained in random samples.

You can see that as the standard deviation increases, the ratios decrease very rapidly.  This means the Chebyshev UCL tends to be much lower than the Land UCL when computed for lognormal distributions, implying that the Chebyshev UCL achieves much less than 95% coverage when the underlying distribution is truly lognormal (or close to it.)  It is the opposite of conservative.

This loss of coverage occurs in two ways.  First, if the sample size is small, there are unlikely to be enough high values in the sample, so the Chebyshev formula will underestimate the mean.  Second, if the skewness is extreme (the standard deviation of logarithms is large), then larger sample sizes are needed to detect the extremely high values that exist.

Evaluation of Three Recommendations for Computing UCLs of the Mean

The three procedures

Singh et al. (op. cit.) conclude their issue paper with a procedural recommendation:

"...the following steps for computing a UCL of the mean of the contaminant(s) of concern are recommended:

1)  Plot histograms of the observed contaminant concentrations and perform a statistical test of normal or lognormal distribution (e.g., the Shapiro-Wilks test). ...

2)  If a normal distribution provides an adequate fit to the data, then use the Student's t approach ... for calculating the UCL of the population mean.

3)  If a lognormal distribution provides an adequate fit to the data, then a) use the lognormal theory based formulas for computing the MVUE of the population mean and the standard deviation, b) either use these MVUEs with the jackknife or bootstrap methods to calculate a UCL of the mean, or use the Chebychev approach for calculating a UCL.  Do not the use the UCL based on the H-statistic, especially if the number of samples is less than 30.

[Page 18, emphasis in the original.]

Standard US EPA guidance, in use for a decade and widely cited, recommends a slightly different procedure:

"EPA's experience shows that most large or 'complete' environmental contaminant data sets from soil sampling are lognormally distributed rather than normally distributed ...  In most cases, it is reasonable to assume that Superfund soil sampling data are lognormally distributed.  ...  However, in cases where there is a question about the distribution of the data set, a statistical test should be used to identify the best distributional assumption for the data set ... to determine if the data set is consistent with a normal or lognormal distribution."

[Supplemental Guidance to RAGS: Calculating the Concentration Term.  OSWER Publication 9285.7-081 May 1992, pp 3-4.]

"For exposure areas with limited amounts of data or extreme variability in measured or modeled data, the UCL can be greater than the highest measured or modeled concentration.  In these cases, ... the highest measured or modeled value could be used as the concentration term [i.e., the UCL]."

[Ibid., page 3.]

For computing lognormal UCLs, the EPA recommends Land's method; for normal UCLs, Student's method.  It does not address any other distributional assumption.

The EPA guidance is vague about exactly how the distributional assumption will be tested.  The language about "it is reasonable to assume ... data are lognormally distributed" strongly suggests the null hypothesis should be lognormality.  The remaining text then suggests rejection of this hypothesis should cause the risk assessor to use Student's method.

Many risk assessors in practice use a likelihood ratio test.  Rather than adopting a null hypothesis, this approach evaluates a test statistic--such as the Shapiro-Wilks statistic--for the original data and for their logarithms.  The UCL method is determined by the statistic that is more consistent with a Normal distribution.  Regardless, risk assessors following the EPA guidance usually use the maximum value as the UCL, instead of the computed UCL when it would exceed the maximum.

The distinction is that in some cases a test of normality would reject both datasets, and in other cases a test would fail to reject either dataset.  In the latter case the Singh et al. recommendation is to use Student's method, whereas the EPA guidance is to use Land's method.  In the former case (neither dataset looks normal), the EPA guidance is (implicitly) to use Student's method, whereas Singh et al. recommend choosing an alternative method (such as their Chebyshev procedure).

The problem

The difficulty with all three recommendations is that they can no longer be expected to produce 95% UCLs.  The preliminary distribution tests they recommend change the distribution of data sets to which the Student, Land, or Chebyshev procedures are applied, so there is no assurance that the desired coverage of 95% is achieved.

Theoretical computation of the coverage is difficult.  That would require information about the coverage achieved conditional upon the results of the distribution test.  Assessing the coverage practically requires simulation.

Evaluation methodology

To evaluate these three procedures, I simulated datasets derived from Lognormal, Normal, and many other distributions, most of them exhibiting positive skewness.  As to sample size, the EPA guidance states,

"... data sets with fewer than 10 samples per exposure area provide poor estimates of the mean concentration, ... while data sets with 10 to 20 samples per exposure area provide somewhat better estimates of the mean, and data sets with 20 to 30 samples provide fairly consistent estimates of the mean..."

[EPA, ibid., page 3.]

In practice, many screening risk assessments are performed with as few as five samples per exposure area (sampling region).  Therefore the simulated data sets contained between five and 31 samples each.  (To minimize interpolation, the data set sizes were chosen to coincide with sizes presented in Gilbert's tables of the Land H-values.)

For each simulated data set, I computed

The Anderson-Darling statistic for the data.
The Anderson-Darling statistic for the logarithms of the data.
The Land 95% UCL of the mean.
The Student 95% UCL of the mean.
The Chebyshev 95% UCL of the mean as described in Singh et al.

The Anderson-Darling statistic tests whether the data appear to have come from a Normal distribution.  For distinguishing Normal from Lognormal distributions it works essentially as well as the Shapiro-Wilks test [see M. A. Stephens, EDF Statistics for Goodness of Fit and Some Comparisons.  JASA 69 #347 (Sep 1974), pp 730-737].  It appears to be superior to the Shapiro-Wilks test for distinguishing Normal from heavy-tailed distributions such as the Cauchy (Student's t with one degree of freedom).

For the RAGS and Singh et al. procedures, I tested lognormality (respectively, normality) at the 95% level using the percentage points of the Anderson-Darling statistic tabulated in Stephens.

The version of the Singh et al. recommendation simulated was the following.  If the data appear normal (tested at the 95% level), then Student's method is used.  If they otherwise appear lognormal (also tested at the 95% level), then the Chebyshev method based on the MVUEs of mean and standard error of the mean is used.  Otherwise, without actually implementing it, I assumed some alternative method (such as bootstrapping) would be used that achieves exactly 95% coverage for any data not appearing normal or lognormal.  This was not a strong assumption, because in most simulations (where the underlying distribution was truly lognormal) only two to four percent of the datasets failed both distributional tests.

In the RAGS procedure, I used the maximum value in place of the UCL whenever the UCL was too high, as described above.

Each simulation consisted of at least 10,000 independent trials.  In each trial, the three recommended procedures (Singh et al., RAGS, and conventional "best fit") were applied to the data.  For comparison, the values of the Land UCL and Chebyshev UCL were also tracked.  The UCL produced by each one of these five procedures was compared to the true mean.  At the end of the simulation, the proportion of trials where the UCL covered (equaled or exceeded) the true mean was recorded.

Simulations were performed and later repeated for most combinations of underlying distribution and data set size.

Various quality control methods were employed to assure the computations were correctly performed.

The following figure shows typical results.  The underlying distribution is lognormal.  The data set size is 10.  The coverage rates depend on the standard deviation of logarithms as shown on the x axis.

This figure shows that when the underlying distribution really is lognormal,

The Land method performs well: its coverage averages 95%.
The RAGS recommendation works better than the best-fit approach.  This is because the Anderson-Darling test rarely rejects the hypothesis of lognormality, and in the 5% of cases when it does, sometimes the Student UCL covers the true mean anyway.  The best-fit approach more frequently selects the Student method, which typically does not cover the true mean, thereby reducing the coverage.
The Chebyshev approach is poor and grows dramatically poorer with increasing skewness.  This bears out the general results exhibited earlier: since the Chebyshev value tends to be far too small, it covers the true mean much less often than it should.
The Singh et al. approach is almost as poor as the Chebyshev UCL alone.  This is due partly to the use of the Chebyshev UCL whenever the hypothesis of Normality is rejected, which happens with greater frequency as sigma increases.  It is also due to the default assumption of normality, which is dramatically incorrect when the underlying distribution truly is lognormal and sigma is large.

Instead of using the Chebyshev procedure one may use some other procedure (like bootstrapping) in the Singh et al. approach.  This greatly improves the Singh et al. coverage, but it is still distinctly inferior to the other procedures for lognormal distributions because of its initial assumption of Normality.  It exhibits the poorest performance for smaller data set sizes (N < 15) and for moderate skewness (logarithmic standard deviation between 0.6 and 1.25).

Interestingly, the "best fit" procedure works well for a wide variety of underlying (usually skewed) distributions.  These include gamma, Normal, exponential, and Weibull.  In almost all cases the Singh et al. procedure achieves lower coverage than claimed.  It is evident that this behavior derives from the default assumption of an underlying Normal, rather than Lognormal, distribution.

 Conclusions

 In sum

The EPA's proposed Chebyshev UCL procedure achieves zero percent confidence rather than its intended 95% level.

This is poor behavior.

Furthermore,

Of the three competing approaches to computing 95% UCLs of the mean: RAGS, "best fit," and the Singh et al. recommendation, the Singh et al. procedure--by replacing the Land method (however poor it may be) with the much poorer Chebyshev method--is the worst.  Avoid its use in cases where the nominal confidence level needs to be achieved.

Replacing the Chebyshev UCL by a better method in the Singh et al. approach improves the results, but the procedure still remains inferior to the others in most situations due to its default Normality assumption.

For environmental concentration data, the RAGS recommendation produces the best overall 95% UCL procedure when data are approximately lognormal.  The "best fit" procedure, which uses the Student or Land method according to whether the data look more Normal or lognormal, respectively, appears to be the most robust to variations in underlying distribution.

Links to web resources

http://www.alceon.com/ln&lnpp.pdf -- Tutorial on using lognormal distribution in risk assessment: Using lognormal distributions and lognormal probability plots in probabilistic risk assessments.  David E. Burmaster and Delores A. Hull.  HERA, 96-26.

http://www.alceon.com/17points.pdf -- Burmaster & Thompson on the Lognormal ("extruded Voronoi" technique).

http://www.epa.gov/crdlvweb/pdf/lognor.pdf -- The Singh, Singh, and Engelhardt paper from the EPA.  [9 May 2001: The EPA appears to have removed this paper from the web!  Searches turn up references to it, and even the URL listed here, but the paper itself is no longer at the EPA site.  An abstract of a related presentation is available at http://www.amstat.org/meetings/jsm/1999/jsm99prog/abstract_info.asp?aid=7418 .]

http://www.deq.state.la.us/technology/recap/eparags.htm -- US EPA RAGS ("calculating the concentration term"): poor photocopy

http://www.deq.state.la.us/technology/recap/LognormalA5.xls -- an attempt to compute 95% Lognormal UCLs.  Includes some H-statistics, but (incorrectly) uses linear interpolation.

http://www.clu-in.org/download/techdrct/tdsubsurf_proceed.pdf -- quotes a mis-use of a 95% UCL of the mean "to characterize background."  A non-pdf version is available at http://www.frtr.gov/optimization/optimize.html#Dataassessment.

Return to the Environmental Statistics home page

This page is copyright (c) 2001 Quantitative Decisions.  Please cite it as

This page was created 25 March 2001 and last updated 9 May 2001.