Text problem 4.6a: "Plot the cumulative distribution function of an N(20, 6) distribution..."
![]() |
Click on the figure to link to an Excel spreadsheet. It uses the NormSDist() function to compute the cdf. |
![]()
Text problem 4.9: "Compute the value of the cdf at 23 for a N(20, 6) random variable, but use only a function that assumes a standard normal random variable."
Answer: Let F be the cdf for the standard normal distribution and let G be the cdf for the N(20, 6) distribution. Then G(23) = F((23-20)/6) = F(0.5) = 0.6915, approximately (use the spreadsheet or a table). Remember that all we're doing is converting the value 23 into a standard deviation scale relative to the mean. The formula (23-20)/6 expresses the arithmetic needed to compute that 23 is 0.5 standard deviations greater than 20.
![]()
Text problem 4.13: "Suppose the concentration of arsenic (in ppb) in groundwater sampled quarterly from a particular well can be modeled with a normal distribution with a mean of 5 and a standard deviation of 1."
a. "What is the distribution of the average of four observations taken in one year?"
Answer: It is impossible to know without making additional assumptions. The assumption you are expected to make is that the samples are independent. This is needed to model the results as if they were draws of tickets from a box (but it is not necessarily true of the groundwater monitoring results!). In this case, the average will be one-quarter of the sum. The sum of four normal distributions is normal. Its sum will be the sum of the component distributions, which is 5+5+5+5 = 20. Its variance will be the sum of the component variances. Because the sd is 1, the variance is 12 = 1. Therefore the sum of variances is 1+1+1+1 = 4, corresponding to a standard deviation of sqrt(4) = 2.
The average, therefore, will have a mean of 20/4 = 5 and a standard deviation of 2/4 = 0.5. The distribution is N(5, 0.5).
b. "What is the probability that a single observation will be greater than 7 ppb?"
Answer: The probability that a single observation will be less than 7 ppb is given by the cdf of N(5, 1) at the value 7. This is the same as the cdf of N(0, 1) at the value (7-5)/1 = 2. Using a spreadsheet or table (or even our memory!) we find this to be 97.72%. Therefore the answer is 100% - 97.72% = 2.28%.
c. "What is the probability that the average of four observations will be greater than 7 ppb?"
Answer: This time the relevant distribution (from part a) is N(5, 0.5). The value 7 is now four sds greater than the mean: (7-5)/0.5 = 4. The cdf of N(0, 1) is 99.9968%, so the answer is 100 - 99.9968% = 0.0032%. The average has a much smaller chance of exceeding a fixed high value.
![]()
a. "Suppose the coin is not fair and that the probability of a head is only 10%. Plot the distribution of a B(10, 0.1) random variable [which models the number of heads observed in ten independent flips of the coin]."
The following figures were produced with an Excel spreadsheet. It uses the BinomDist() function, which computes values of both the pdf and cdf.
![]() |
The ticks on the chart denote actual probability, not probability per unit value: this is a discrete distribution. |
b. "Now plot the distribution of a B(20, 0.1) random variable."
![]() |
The reason for using proportion of successes, rather than number of successes (heads), is to put these plots on a common x-scale. Otherwise, the x-scale would have a maximum ranging from 10 to 500 as we move through the figures. |
c. "Now plot the distribution of a B(500, 0.1) random variable."
![]() |
This shape should remind you strongly of the normal distribution, which it very closely approximates. |
d. "Why can you use the Central Limit Theorem to explain the change in the shape of the distribution?"
Answer: As explained in the preface to the problem, Binom(N, p) is the sum of N independent Binom(1, p) variables. The CLT implies the cdf of the sum will closely approximate the shape of a Normal distribution provided N is sufficiently large. It appears that N=20 is not sufficiently large, but that N=500 is.
![]()
Text problem 4.15: "Suppose that the random variable Y has a B(n, p) distribution."
a. "Find the probability that Y is greater than 2 when n = 10 and p = 0.1."
Answer: You may use Excel's BinomDist() function (see below). This one is easy to compute, however. Let's first compute the probability that Y is less than or equal to 2. This will be the sum of three probabilities: that Y equals 0, that Y equals 1, and that Y equals 2. The probability that Y equals K is, by definition, Comb(10,K)*0.1K*0.9N-K. We will use logarithms to estimate 0.1K*0.9N-K. Compute ln(0.9) = -0.1 - 0.01/2 - 0.001/3 = -0.1053, approximately. The table lays out the rest of the computation. Alternatively, you can read these probabilities to two decimal places from the answer to 4.14(a).
| K | N-K | 0.1K | (N-K)*ln(0.9) | 0.9N-K = exp((N-K)*ln(0.9)) | Comb(N,K) | Probability |
| 0 | 10 | 1 | -1.053 | 0.35 | 1 | 0.35 |
| 1 | 9 | 0.1 | -0.948 | 0.39 | 10 | 0.39 |
| 2 | 8 | 0.01 | -0.842 | 0.43 | 10*9/2 = 45 | 0.19 |
The sum of the probabilities is 0.35 + 0.39 + 0.19 = 0.93. Therefore the probability that Y exceeds 2 is approximately 0.07, or 7 percent. As a check, the Excel expression =1-BINOMDIST(2,10, 0.1,TRUE) gives 7.02% as the answer.
c. "Write Y as a function of n independent B(1, p) random variables."
Answer: Y is the sum of n B(1, p) random variables.
d. "What are the mean and variance of each of the B(1, p) random variables?"
Answer: The mean is p and the variance is p*(1-p). (You need to memorize this.)
e. "...determine the mean and variance of Y."
Answer: Means and variances add. Therefore, the mean of Y is n*p and the variance of Y is n*p*(1-p). (This agrees with equation 4.40 in the text.)
g. "The Central Limit Theorem says that the distribution of Y can be modeled as approximately a normal distribution. What are the mean and variance of this distribution?"
Answer: The approximating normal distribution will have the same mean and variance as Y; namely, n*p and n*p*(1-p), respectively.
For an example of how this might be applied, consider problem (a) above. In this situation n=10 and p=0.1, so Y has mean 10*0.1 = 1 and variance 10*0.1*0.9 = 0.9. Its standard deviation is therefore sqrt(0.9) = 0.95, approximately. The probability that Y is greater than 2 is the probability that Y is greater than some value between 2 and 3. Let's use 2.5 as the midpoint (this is a so-called "continuity correction"), so we are asking: if we assume Y is N(0.9, 0.95), what is the probability that Y exceeds 2.5? Proceeding as in problem 4.9 (above), the Z-value of 2.5 is (2.5 - 0.9)/0.95 = 1.68, the standard normal cdf of 1.68 is 95.4%, so the answer using the normal approximation is 100 - 95.4 = 4.6%. That's a little low, but it's the right size. As you can see from 4.14(a), Binom(10, 0.1) is highly skewed, so it's a little surprising the normal approximation works even this well. Note how easy all the computations are, especially compared to the computations in the table for problem (a) above. This means you can use your understanding of the normal distribution to reason about binomial probabilities, provided you have memorized the mean and variance formulas for B(1, p). This is very useful.
![]()
Text problem 4.20: "Suppose a regulation is promulgated specifying that an air quality standard cannot be exceeded more than once per year on average. Further, suppose that for a particular emission from a factory, the probability that the emission concentration will exceed the standard on any given day is 1/365."
a. "Consider a sample of one year of emissions data. Assume emission concentrations are independent from day to day. Using the binomial distribution to model these data, what is the expected number of exceedances?"
Answer: The question is asking for the mean of a Binom(365, 1/365) variable. The mean is 365 * 1/365 = 1.
b. "Given your answer in part a, does the distribution of emissions satisfy the regulation?"
Answer: Regulators require monitoring not only to characterize conditions, but also to detect changes. Therefore no regulation would base compliance on the parameters of an assumed probability distribution. Compliance would depend on the actual results. We are not told the actual results, so we cannot say whether the one-year sample satisfies the regulations. However, what the question appears to be driving at is that a very long-term sequence of data from a Binom(365, 1/365) variable will have an expected number of exceedances of one, suggesting that the facility generating these emissions complies with the intent of the regulation.
c. "What is the probability that the standard will be exceeded at least twice in any given year?"
Answer: It would be nice to use the normal approximation to Binom(365, 1/365), but--because this is extremely highly skewed--it will be a poor approximation. Instead, we must proceed as in problem 4.15(a). The computation is simpler now, because we need only two probabilities: the probability of zero exceedances and the probability of one exceedance. Here's the table, where now N=365 and p=1/365. The final column provides the Poisson approximation to the probabilities (exercise 4.21).
| K | N-K | (1/365)K | (N-K)*ln(1-1/365) | (1 - 1/365)N-K | Comb(N,K) | Probability | Poisson approximation |
| 0 | 365 | 1 | -1.00 | 0.37 | 1 | 0.37 | 1/e = 0.3679 |
| 1 | 364 | 1/365 | -1.00 | 0.37 | 365 | 0.37 | 1/e = 0.3679 |
| Total | 0.74 | 0.7358 |
The remaining probability, 1 - 0.74 = 0.26 = 26%, is the probability that any year's sequence of results will contain two or more exceedances. The moral of this exercise is that the condition of the environment is only imperfectly reflected in our observations of it. Therefore, if we are going to control or regulate our activities, we must account for natural variations, lest we over- or under-regulate.
![]()
Homework exercise 2: "Compute by hand, without reference to a table, calculator, or computer, the natural logarithms of the integers 1, 2, 3, ..., 20 to two decimal places. Then check your answers. (Excel will compute them for you using its LN() function.) If you got any incorrect--even by one in the last digit--identify the reason why and redo the computation until you can get them all correct."
Answer: The table shows some strategies for the computation. It is helpful to compute the logs of key numbers (2, 3, 5) with a little more accuracy then needed, because they will appear many times as intermediate results. The values are shown in the order calculated.
| X | Strategy for computing Ln(X) |
| 1 | 0. |
| 2 | 0.69 (memorized). |
| 3 | 3 = 2 * (1 + 0.2) * (1 + 0.25), so ln(3) = 0.693 + 0.2 - 0.22/2 + 0.25 - 0.252/2 + 0.252/3 = 1.097 = 1.10. |
| 4 | 4 = 22, so ln(4) = 2*ln(2) = 1.386 = 1.39. |
| 5 | 5 = 10/2, so ln(5) = ln(10) - ln(2) = 2.302 - 0.693 (those were memorized) = 1.609 = 1.61. |
| 6 | 6 = 2*3, so ln(6) = ln(2) + ln(3) = 1.79. As a check, 6 = 5 * (1 + 0.2), so ln(6) = ln(5) + 0.2 - 0.22/2 = 1.79. |
| 8 | 8 = 23, so ln(8) = 3*ln(2) = 3*0.693 = 2.08. |
| 7 | 7 = 8*(1 - 1/8), so ln(7) = 2.08 -1/8 - 1/82/2 = 1.95. Also 7 = 21/3 = 20*(1 + 0.05)/3, so ln(7) = 3.00 + 0.05 - 1.10 = 1.95. |
| 9 | 9 = 32, so ln(9) = 2*ln(3) = 2.20. This is approximate, so we check more carefully: 9 = 10*(1 - 0.1), so ln(9) = 2.302 - 0.1 - 0.12/2 = 2.197 =2.20. |
| 10 | 2.302 (memorized). |
| 11 | 11 = 10*(1 + 0.1), so ln(11) = 2.302 + 0.1 - 0.01/2 = 2.40. |
| 12 | 12 = 2*6, so ln(12) = 0.693 + 1.79 = 2.48. |
| 16 | 16 = 24, so ln(16) = 4*ln(2) = 2.772 = 2.77. |
| 15 | 15 = 5*3, so ln(15) = ln(5) + ln(3) = 1.609 + 1.097 = 2.71. Also 15 = 16 * (1 - 1/16), so ln(15) = 4*ln(2) - 1/16 - 1/512 = 2.708 (this is accurate) = 2.71. |
| 14 | 14 = 15 * (1 - 1/15), so ln(14) = ln(15) - 1/15 - 1/450 = 2.64. Also 14 = 2*7 so ln(14) = 0.693 + 1.95 = 2.64. |
| 13 | 13 = 12*(1 + 1/12) and 13 = 14*(1 - 1/14), giving ln(13) = 2.48 + 1/12 - 1/288 and ln(13) = 2.64 - 1/14 - 1/392 giving 2.56 and 2.57. One of these is too high and the other too low, so the correct answer must be around the middle, or 2.565. This means both 2.56 and 2.57 are correct to two decimal places. |
| 17 | 17 = 16*(1 + 1/16), so ln(17) = 2.772 + 1/16 - 1/512 = 2.83. |
| 18 | 18 = 9*2 giving ln(18) = 2.197 + 0.693 = 2.89. |
| 20 | 20 = 2*10 so ln(20) = 0.693 + 2.302 = 2.995 = 3.00. |
| 19 | 19 = 20*(1-1/20) so ln(19) = 2.995 - 1/20 -1/800 = a tiny bit less than 2.945 = 2.94. |
As a double check, we can estimate the changes in logarithm and compare these estimates with the expected rate of change (1/x). The agreement should be good as X gets large.
| X | Ln(X) | (Ln(X+1) - Ln(X-1))/2 | 1/X | Residual |
| 1 | 0 | 1.000 | ||
| 2 | 0.69 | 0.55 | 0.500 | +0.050 |
| 3 | 1.10 | 0.30 | 0.333 | -0.033 |
| 4 | 1.39 | 0.255 | 0.250 | +0.005 |
| 5 | 1.61 | 0.20 | 0.20 | 0 |
| 6 | 1.79 | 0.17 | 0.167 | +0.003 |
| 7 | 1.95 | 0.145 | 0.143 | +0.002 |
| 8 | 2.08 | 0.125 | 0.125 | 0 |
| 9 | 2.20 | 0.11 | 0.111 | -0.001 |
| 10 | 2.30 | 0.10 | 0.100 | 0 |
| 11 | 2.40 | 0.09 | 0.091 | -0.001 |
| 12 | 2.48 | 0.083 | 0.083 | 0 |
| 13 | 2.56(5) | 0.08 | 0.077 | +0.003 |
| 14 | 2.64 | 0.072 | 0.071 | +0.001 |
| 15 | 2.71 | 0.065 | 0.067 | -0.002 |
| 16 | 2.77 | 0.06 | 0.061 | -0.001 |
| 17 | 2.83 | 0.06 | 0.059 | +0.001 |
| 18 | 2.89 | 0.055 | 0.056 | -0.001 |
| 19 | 2.94 | 0.055 | 0.053 | +0.002 |
| 20 | 3.00 | 0.050 |
18 residuals, in units of 0.001:
Far out: -0.033(for ln(3))
2 -2 |0
H 6 -1 |0000
M(4) 0 |0000
8 1 |00
H 6 2 |00
4 3 |00
4 |
3 5 |0 (for ln(4))Far out: +0.050 (for ln(2))
The residuals are small and reasonably well balanced between positive (8 of them) and negative (6 of them). The H-spread is a tiny 0.003 for a step of 0.0045--almost exactly the maximum rounding error of 0.005. Any small mistake in the calculations would noticeably alter this pattern. For example, if we had computed ln(9) = 2.21, then the residuals for 8 and 10 would have been +0.005 and -0.005 instead of 0 and 0. The value of -0.005 would almost have been an outlier and the value of 0.005 would have been a little large. These would have directed our attention immediately to the small error in ln(9).
![]()
Homework exercise 3: "Compute by hand, to two decimal places, the logarithms of 9.5, 9.6, 9.7, ..., 10.5. (1 minute.)"
Answer: These numbers are all of the form 10 * (1 - 0.05), 10 * (1 -0.04), ..., 10 * (1 + 0.05). The logarithm of (1 + x) is x, to two decimal places, when |x| < 0.05, and we remember ln(10) = 2.30, so the desired logarithms are 2.25, 2.26, ..., 2.35. The point of this exercise is to reinforce the idea that logarithms merely re-express ratios close to 1 as percentage changes. That is what ln(1+x) ~ x means.
![]()
Homework exercise 4: "* A risk assessor assumes ... that the dose is proportional to concentration * duration. The concentration is lognormally distributed with a mean of 37 ppm and SD of 24 ppm; the duration is lognormally distributed with a mean of 5 years and SD of 2 years. The distributions are assumed to be independent. (Why is this reasonable? Why should it be checked?)"
Answer: Independence means that probabilities computed for one distribution will not depend on the values assumed by the other distribution. Usually, the distribution of concentration is derived from the spatial distribution. Provided there is no tendency for the duration of exposure and the region of exposure to be associated, there should be no relationship at all between duration and mean concentration. If the distributions are not independent, the calculations will be wrong, which is why it is prudent to check independence.
Continuation: "What is the probability that the product of concentration and duration will exceed 37 * 5 = 185 ppm-years? 1000 ppm-years?"
Answer: The natural logarithm of the product is the sum of the logs. The log of concentration and the log duration are each normally distributed. We would like to know their parameters. Let mu = mean of logs and sigma = sd of logs. From the formulas theta = exp(mu + sigma^2/2) and tau = sqrt(exp(sigma^2) - 1) we get, by algebra (as provided in an Excel spreadsheet), the following results.
| Parameter | Concentration | Duration | Product |
| Theta (mean) | 37 | 5 | 185 |
| SD | 24 | 2 | 149 |
| Tau (CV) | 0.649 | 0.400 | 0.804 |
| Mu (mean log) | 3.435 | 1.535 | 4.970 (sum of means) |
| Sigma (sd log) | 0.593 | 0.385 | 0.706 (=sqrt(0.499) |
| Variance (log) | 0.351 | 0.148 | 0.499 (sum of variances) |
The solution strategy is to compute Mu and Variance(log) for concentration and duration, add them to get Mu and Variance(log) for the product, and then compute Theta, Sigma, and SD for the product, because they will be needed for answering the remaining questions. (One can prove algebraically that Theta for the product is the product of Thetas for concentration and duration, but the value of 185 above was computed using the formulas above as a check of the arithmetic.)
The natural log of the product, being the sum of two normal distributions, will also be normally distributed. From the table, their mean is 4.970 and their variance is 0.499, so their SD is 0.706. We are ready to answer the questions, which can be rephrased as
a. What is the probability that a N(4.970, 0.706) distribution will exceed ln(185) = 5.220?
Answer: Compute Z = (5.220 - 4.970)/0.706 = 0.354. The standard normal cdf of this value equals 63.8%, so the probability of exceeding it is 36.2%. In other words, about 5/8 of all individuals--more than half--will receive doses less than the product of the average concentration and average duration.
b. What is the probability that a N(4.970, 0.706) distribution will exceed ln(1000) = 6.908?
Answer: Compute Z = (6.908 - 4.970)/0.706 = 2.745. The standard normal cdf of this value equals 99.70%, so the probability of exceeding it is 0.30%. 1000 seems a lot higher than 185, but there is still some chance of observing a dose of 1000 or greater in the modeled population.
Continuation: "For what value is this 'exceedance' probability exactly equal to five percent?"
Answer: We need to find the value Z where the cdf of the standard normal distribution equals 95%. From tables or from Excel's NormSInv() function we get Z = 1.645. The corresponding value for a N(4.970, 0.706) distribution is therefore Sigma*Z + Mu = 0.706 * 1.645 + 4.970 = 6.131. Its "antilog" (to express it in the original units, rather than logarithms) is exp(6.131) = 460 ppm-years. About 5% of the modeled population is expected to receive a dose this great or greater.
![]()
Return to the Environmental Statistics home page
This page is copyright (c) 2001 Quantitative Decisions. Please cite it as
This page was created 16 February and last updated 19 February 2001 (corrected numerical error in solution to homework exercise #4).