Notes on Chapter 1 and 3

Go to Notes on Chapter 4

These notes are intended to amplify the text, point out questions and areas for further thought, and identify and resolve ambiguities and (the rare) mistakes.  You definitely should be aware of the mistakes, so their pages are displayed in bold face.

Chapter.Page Comments
1.2-1.6 Here are some of the problems I have encountered in the past decade or so that can be considered in the domain of environmental statistics:
Policy analysis
RCRA groundwater monitoring
Insurance cost recovery estimation and negotiation
Superfund (CERCLA) investigations (mainly soil and groundwater, but sometimes surface water, too)
Controlling industrial processes
Air monitoring (ambient air and in the workplace)
Predicting future conditions (for monitoring, cleanup, and land cover change, for example)
Surface water (river) monitoring, permitting, and TMDL ("total maximum daily load") estimation
Monitoring land cover and land use, species richness, ecological diversity, and so on
Waste discharge control
Land re-use and development
Determining when cleanup is over
Human and ecological risk assessment
Determining whether "waste" is "hazardous"
Managing natural disasters (floods, for example)
PCB detection, assessment, and cleanup (as regulated by TSCA)
Workplace monitoring and safety assurance (OSHA)
1.7 The text asks, "what is the probability of picking an ace out of a deck of 52 well-shuffled standard playing cards?" Obviously it is 4/52 because there are four aces in the deck.  A variation of this question is to ask what is the probability of picking an ace at least once in twenty draws of a card from this deck (replacing each card in the deck at random after each draw). This kind of question can model situations such as the following: you (or a client) has monitored waste discharges to a river quarterly for 13 years.  Four times, a result exceeded the permitted limit.  The permit is about to be renewed for five years (20 quarters of monitoring).  If the waste process and the permit limits remain the same, what is the probability there will be no violation during these five years?  (The answer makes many assumptions about the process, but we will not go into that here.)   It is helpful to be able to estimate answers to questions like this quickly.  In this case, the probability of not drawing an ace each time is 48/52 = 12/13 = 1 - 1/13.  Therefore the probability of not drawing an ace all 20 times is (1 - 1/13)20 .  If you are familiar with the approximation (1 - k/n)n ~ exp(-k), you will recognize this is a good approximation in the present instance, and so will compute (1 - 1/13)20 = ((1 - 1/13)13)20/13 ~ exp(-1)1.5 = exp(-1.5) = somewhere between 0.2 and 0.25 (using approximations we will learn).  Whence the probability of drawing an ace (or of violating the permit at least once) is around 70 to 80%.  With a little practice you can do this kind of computation in your head in a few seconds.
1.9 Make a copy of Figure 2.1.  Write down as many additional "sources of variability" as you can think of.  Keep updating your copy throughout the course.
3.53 The admonition to "thoroughly examine the data in as many ways as possible and relevant" is very good advice.  Consider this: one datum typically costs several to several hundred dollars.  (For example, a large soil sampling program will gather data on 10 to 20 relevant chemicals for around 1,000 soil samples: about 20,000 data.  It will cost hundreds of thousands of dollars, averaging maybe $25 per number.)  Despite this cost, careful examination of the data--which might take only a few hours' to a few days' time with a good set of computer software and therefore cost a few hundreds or thousands of dollars--is rare indeed.  It is unusual to encounter any environmental dataset that does not yield useful information upon close examination.  This state of affairs should be incomprehensible to any observer from the outside looking in: why should we spend over 99% of the budget just gathering the numbers and less than 1% actually looking at them?  Ideally, the proportions should be reversed!
3.53 You can add a lot to any statistical analysis by understanding what you are analyzing.  Think about this: what properties of 1,2,3,4-tetrachlorobenzene might be useful to know?  Make a short list and look them up.  It will help you understand the rest of the chapter a little better.  (You may consult Aldrich Chemical or Fisher Scientific to get started.)
3.57 The "MAD" displayed in these summaries is not the MAD as defined on page 3.65 (formulas 3.14 and 3.15).  The values shown on page 3.57 are the MAD divided by 0.68, approximately.  We will learn later why this has been done.
3.61 To test your understanding of the formulas, demonstrate mathematically that formula 3.2 for the trimmed mean gives formula 3.3 for the median when the trimming proportion (alpha) is 0.5.  If this causes you trouble, or you are not sure what to do, then ask in class (or privately by e-mail).
3.61 The remark "half of the observations lie below the median and half of them lie above the median" is not strictly true.  What are the exceptions?
3.62 Evidently, "log()" in this text means natural logarithm, not logarithm to base 10.  See page 175.
3.66 The unbiased estimator of skewness (formula 3.18) is the one used by most software, including Systat and Excel.
3.67 The formula is correct, but the text preceding the formula should say the 4th power is divided by the square of the variance (or the fourth power of the standard deviation).
3.67 There are also unbiased estimators of kurtosis.  See the Excel help for "Kurt" for one formula.
3.69 Tufte provides a graphical redesign of the dot plot.  It is a shame Tufte's ideas, even after 17 years, have not been accepted by software designers: you will not see many of his improvements even in the S-Plus software.
3.81 Note that the vertical positions of the dots in the strip plots of figure 3.9 are randomly "jittered."  This minimizes overlaps of dots from tied or nearly-tied values.
3.84 [Hoaglin et al] provides some guidance on how to choose histogram interval widths.
3.85 A density plot is a kind of "one dimensional contour plot."  For details on how the interpolation might be done (in one and two dimensions), see the Map Algebra- Resampling pages, for instance.
3.88 The assertion that "you would expect to see at least one 'outside value' only about 0.7% of the time if you were always looking at data from a normal distribution" is misleading.  The 0.7% applies to each datum.  For example, if you are looking at a batch of 20 values drawn independently from a common normal distribution, then a simple calculation suggests the chance of seeing at least one outside value is around 20 * 0.7 = 14%.  The simple calculation is incorrect, however, as "Student" (W. Gosset) pointed out in his 1908 paper.  The number 0.7% can be as high as about 22% (for a batch of five) according to our calculations and as confirmed in [Hoaglin et al].
3.96 "Bloom" in Table 3.7 should read "Blom".
3.98 You can compute the "normal quantiles" in Table 3.8 using Excel's NormSInv() function (apply it to the "plotting position" values).
3.98 The "Normal Quantile" column can easily be reproduced using Excel's NormSInv() function (as applied to the "Plotting position").
3.102 Note the different scales used on the X and Y axes.  This graphic should be adjusted to display equal intervals on both the vertical and horizontal scales.
3.103 The Tukey m-d plot is a Q-Q plot with a change of coordinates.  On a Q-Q plot, let x be the horizontal coordinate and y the vertical coordinate.  Compute new coordinates X = (x+y)/2 and Y = y-x.  The m-d plot displays Y versus X.  Geometrically, X is proportional to the distance along the diagonal line (y=x) and Y is proportional to the distance (in the Q-Q plot) to the diagonal line in the perpendicular direction.  This latter property distinguishes the m-d plot from a plot of residuals (relative to the diagonal line) because the residuals are vertical distances to the diagonal line.  Equivalently, by changing scale slightly (multiply X by sqrt(2) and divide Y by sqrt(2)), the m-d plot is obtained by rotating the Q-Q plot 45 degrees clockwise.  For example, rotate Figure 3.20 clockwise 45 degrees and compare it to Figure 3.23.
3.134 A partial set of answers to exercise 3.1 is available.  You should be able to do all but the density plot by hand or with the aid of a spreadsheet; indeed, you should do several exercises with manual and spreadsheet calculations until you are comfortable with the formulas and techniques.  Then learning to use any statistical software will be easy.
3.136 It is difficult to believe that the arsenic concentrations shown in problem 3.8 are really "ppm" (milligrams per liter), because they are so spectacularly high.  These values are most likely either (a) in ppb (micrograms per liter) or (b) have been made up by somebody.  The "ppm" is present in the original EPA guidance document.  The data appear on page 21 in my copy, not on page 6.

Go to Notes on Chapter 4

Return to the Environmental Statistics home page

The URL for this page is

This page was created 26 February.