Notes on Chapter 5

Go to Notes on Chapter 4

These notes are intended to amplify the text, point out questions and areas for further thought, and identify and resolve ambiguities and (the rare) mistakes.  You definitely should be aware of the mistakes, so their pages are displayed in bold face.

Chapter.Page Comments
5.201 There are many other methods for estimating parameters.  The most important one omitted from this discussion is the Bayes method.
5.201 (Bottom)  The n observations are implicitly assumed to be independent of each other.  This means that the result of one observation will not change the probabilities governing any other observation.  Statistical theory can handle non-independent observations, but that theory is not introduced in this chapter.
5.202 For the theory in this book (or in almost any other statistics book) to apply, the function h() in equation (5.1) must be determined without reference to the sample results.  The best way is to define h() before the sample is ever observed.

This is a very important point, because in practice h() often does depend on the data.  For example, US EPA guidance for statistical tests of groundwater monitoring data (see the Links page) recommends evaluating the shape of the batch (x1, x2, ..., xn).  If the shape is consistent with a Lognormal distribution then, to estimate its mean, one might use formula (5.41) on page 211.  Otherwise, one might use the sample mean given by formula (5.2) on page 202.  It is a mistake to believe (without further theoretical investigation) that this complex procedure has any of the statistical properties claimed of (5.41) or (5.2).

5.209 We use calculus to maximize the likelihood function by differentiating f(p) and equating it to zero.  The derivative is [80/p - 20/(1-p)]*f(p) which is zero only when p=0.80.  Because f() is differentiable for all p in the interval [0, 1], the maximum must occur either at the ends of the interval or at p=0.80.  Since f() is zero at the ends of the interval and positive everywhere else, the maximum occurs at p=0.80.
5.209 (Bottom line.)  Equation 5.36 says that the log-likelihoods add.  That is, each observation xi contributes a term log(f (theta | xi)) and the logarithm of the likelihood function is just the sum of all such terms.  This is a simple, easily-remembered result.
5.210 It is not essential to maximize the log-likelihood function in formula (5.37) "first relative to mu, then relative to sigma2."  We just have to find values of mu and sigma (there may be more than one of each!) that maximize the likelihood.  How you go about finding them is a matter of expedience.  In this particular case, it is expedient to let mu alone vary so we can establish the largest value of the likelihood for any given sigma.  We then find the value of sigma that gives the "maximum of the maxima."  The hardest part actually is establishing that tiny values of sigma (as sigma approaches zero in the limit) do not maximize the likelihood.
5.212 "Average value of the estimator" on this page means in the sense of statistical expectation.  After all, we only have a single value of the estimator itself, so there is nothing to average!  This is why the text reminds us that estimators are random variables.  To find the expectation, we need to know the probability law governing the estimator's possible values.
5.213 Although the text does not say it, it leaves a strong impression that unbiased estimators always exist.  This is not true.  A simple example is the problem of estimating the odds of a binomial variable, B(N, p).  The odds are the ratio of expected successes to expected failures: p/(1-p).  Kiefer (see Links), section 4.6, proves there is no unbiased estimator of the odds.
5.214 Formula 5.57 is incorrect; the denominator is missing terms between (m+2) and (m+2i); the missing terms are (m+4), (m+6), ..., (m+2i-2).  The formula gn-1(t) is usually simplified to read
5.218 (Bottom)  We have to be very careful about what "the probability of observing a nondetect value at any of the six wells" really means.  This has to be interpreted in terms of a random model.  For example, we could write down each of the 36 data values on a slip of paper, put these slips into a box, shake the box, and draw one slip out.  There is about a 92% chance (33/36) that the slip we draw will be a nondetect.  For this to have an environmental meaning, you would have to suppose that the process of obtaining and analyzing groundwater samples from any of the wells is accurately modeled by drawings from the box.  This simple model cannot accommodate physical variation in well conditions or temporal (time) variation in any aspect of the physical system, sampling procedures, or analytical procedures.
5.223 Statisticians define "efficiency" in two almost equivalent ways.  The useful definition is in terms of sample sizes.  Suppose you want to achieve a given level of precision in an estimate.  If you need to pay for N observations using one estimator and for M observations using another, then the more efficient estimator is the one where you pay less.  The efficiency ratio is N/M.  In most cases, the variance (which measures precision) is inversely proportional to the number of observations, so the ratio of the variances of the estimators also computes efficiency (remembering that smaller variance means more efficient).  That is what equation (5.62) is doing.  It says that using the median with 100 observations or the mean with 64 observations would give about the same precision (for normal populations).
5.225 The section, "Comparing Estimators Based on Confidence Intervals," is pithy and well put.  Make sure you understand these two paragraphs well.
5.226 "Berthoux" should read "Berthouex".
5.228 (Bottom)  To this list of assumptions we must add a third: that each observation comes from a single (usually unknown) distribution.
5.229 (Top)  This last assumption should read, "3.  The observations come from a specified family of probability distributions."  When the statistician writes "normal," for example, he or she usually means "any one of the doubly infinite family of normal distributions with some (unknown) mean and (unknown) standard deviation."  Indeed, that's the entire point: because we do not know beforehand exactly which distribution determines the outcomes, we have to guess at (estimate) its parameters.
5.229 (First paragraph.)  There exist a lot of probability distributions.  To specify a continuous distribution, for instance, all you have to do is create the graph of its PDF.  To do that, just draw any curve you like--it can even have discontinuities--that defines a unique height for each x-value.  There are only two simple restrictions: the heights cannot be negative and the total area beneath the curve must be finite.  Then, by uniformly shrinking all heights, we can always arrange to make the area equal to 1.  The curve is now the graph of a PDF.

Many applications, however, focus on small families of distributions.  This is a powerful way of incorporating experience, theory, and judgment into a statistical problem.  Members of a small family can be named ("parameterized") in a natural way using a finite number of variables, or "parameters."  For example, the family of Normal distributions can be parameterized by mu and sigma.  As mu and sigma vary, the corresponding Normal distribution varies smoothly, too.  There are infinitely many Normal distributions, but they are completely described by only two varying numbers.

(This is analogous to using coordinates to describe points on a line, plane, or space.  Although the Euclidean plane contains infinitely many points, two coordinates suffice to name any point.  Coordinates that are numerically close designate points that are geometrically close: this is one sense in which the coordinates are "natural."  The set of all distributions forms an infinite-dimensional space: it takes infinitely many coordinates to name any arbitrary distribution in this space.  A parameterized set of distributions forms something like a one-dimensional line, a two-dimensional plane, or at least some subspace of finite dimension, whose points need only a few numbers for their coordinates.)

A "parametric" statistical method is one that assumes the possible underlying distributions are limited to a set that can be described by a finite number of coordinates: the parameters.  This definition is a little more general than the text's assertion that "you know what kind of distribution describes the population."  For instance, a problem that assumes the underlying distribution is either Normal or Lognormal is still parametric, even though you do not know what kind of distribution describes the population.  Indeed, a problem that assumes the underlying distribution comes from any of the distribution families described in the text is still parametric (albeit with a large number of parameters)!

Non-parametric methods, in contrast, do not make assumptions that can be described with a few parameters.  These methods usually make assumptions, though.  For example, a method that assumes the median of the distribution is zero is non-parametric, because this assumption still admits too many distributions to parameterize.

In practice, parametric methods tend to be based on techniques for estimating the parameters.  Non-parametric methods have no parameters to estimate, so their computations tend to be based on the relative sizes or ranks of the data.

5.229 The point of Figure 5.8 is that quite a few of the 100 simulations resulted in confidence intervals that did not cover the true value.  These would be represented by gray vertical bars (the intervals) that do not overlap the horizontal black line.  See how many you can find.  (It is a pity this illustration does not visually highlight the non-covering intervals.)  For an Excel spreadsheet that reproduces this simulation (and highlights the non-covering intervals), go to the Confidence interval simulation page.  You can directly control the size (alpha), mean, and standard deviation.  By copying formulas, you can modify the sample size (n) and the number of simulations.
5.230 There is a rational basis for establishing confidence levels: you need to specify a "Loss function."  Such a function balances the cost of failing to cover the correct parameter against the cost of creating intervals that are too wide on the average.
5.231 You can, and should, draw pictures (using a Normal PDF and a Normal CDF) of all the equations on this page.  They are all just different ways of writing down the value of two regions under the PDF.
5.232 The t-distribution with 1 degree of freedom is also illustrated on the Properties of distributions page.
5.235 The numerator (n-1)s2 is equal to the sum of squares of residuals (with respect to the mean).
5.239 "Asymptotically" optimal means that this procedure may never actually be optimal!  It only means that it can get as close as you want to being optimal provided you have the luxury of obtaining (i.e., paying for) a really, really large data set.  In many cases "really, really large" may be as small as five, but in other cases it can be astronomically expensive.  Asymptotic results provide useful theoretical guidance, but the word "asymptotically" should always put you on the alert for possible discrepancies between theory and your data..
5.240 "Cox's" method, according to "Improved methods for calculating concentrations used in exposure assessments," computes an upper confidence limit of the mean in the form exp(m + s2/2 + Z*sqrt( s2/n + s4/(2n+2) ).  The variables in this expression are n (the number of sample values), m (the mean of the logarithms of the data), s2 (the variance of the logarithms), and Z (a percentage point of the standard normal distribution).  This reference states that the method may be used "as long as m + s2/2 is normally distributed."  This will occur when n is very large (depending on the CV of the underlying distribution) and could be tested for moderate n (say, 20 or greater) by bootstrapping from the empirical distribution.
5.240 The tables of H-statistics in Gilbert's book are available on Penn State electronic reserve.  Log in at http://www.gv.psu.edu/library/.
5.247 Equations 5.93 and 5.94 ought to have inequalities in them, as in

(5.93)  Prob(X >= x | p >= LCL) >= alpha/2
(5.94)  Prob(X <= x | p <= UCL) >= alpha/2.

Note that (5.94) is expressed in terms of the CDF and that (5.93) can be rewritten in terms of the CDF as

(5.93)  Prob(X <= x-1 | p >= LCL) <= 1 - alpha/2.

5.248 Other formulas like (5.95) and (5.96) for the binomial confidence limits in terms of the F-distribution exist.  Relationships among different F distributions allow for this; all the formulas (if correctly typeset) will give the same results.

There are also formulas in terms of the beta distribution.  See Hahn & Meeker for details.

 5.249 The "continuity correction" in (5.103) has a simple interpretation.  If you draw a regular histogram, then every bar has width 1/n.  The bars are centered at 0=0/n, 1/n, 2/n, ..., (n-1)/n, and n/n = 1.  The estimate p-hat is of the form x/n.  The confidence limits (5.101) are a certain distance to the left and right of this estimate.  The continuity correction moves the same distances, but marks them off from the left and right edges (respectively) of the bar centered at x/n.  This lowers the lower limit by half the width of a bar--1/2n--and raises the upper limit by the same amount.  This tends to give better answers, especially when the n is small (the histogram is very "choppy"--not continuous--and is not as well approximated by the Normal curve).
5.252 Equations (5.107) and (5.108) ought to have inequalities in them, as described above for page 5.247.

Go to Notes on Chapter 4

Return to the Environmental Statistics home page

The URL for this page is

This page was created 26 February and last updated 7 May.