Solution to Practice Quiz 10

The full quiz is here.  The answers appear below.  Comments, which are not part of the answers, are italicized.

Time 20 minutes.  This quiz is open book, open notes.

On March 21, 2001, the Centers for Disease Control (CDC) issued the National Report on Human Exposure to Environmental Chemicals.  This report summarizes measurements of 27 environmental chemicals in the blood and urine of thousands of human test subjects.

The summary statistics are the number of subjects (N), the geometric mean (GM), and the 10th, 25th, 50th, 75th, and 90th percentiles.  All measurements are in micrograms per liter (ug/L).

The report also provides 95% confidence intervals for the GM and the percentiles.

1.    1024 subjects were tested for mono-ethyl phthalate.  The GM is 176 with a confidence interval from 132 to 220.  The logarithms of these numbers are 4.88, 5.17, and 5.39, respectively.  Assuming a Lognormal distribution, compute the standard deviation of the logarithms of the 1024 values.  (For such a large number of subjects, Student's t distribution agrees with the Normal distribution to two decimal places for all percentiles between 1 and 99.  Some values of the standard Normal CDF are shown below.)

The difference between the UCL (as a logarithm) and the LCL (as a logarithm) should be 2 * t * SD / sqrt(1024).  The factor of 2 occurs because we are looking at the difference between the CLs rather than the difference between a CL and the mean logarithm.  T is the upper 97.5 percentage point of Student's t distribution with 1023 degrees of freedom, which the table indicates is about 1.96.  SD is the standard deviation of the 1024 logarithms.  This gives an equation with only SD as the unknown: 5.39 - 4.88 = 2 * 1.96 * SD / sqrt(1024).  It reduces readily to 0.51 = SD * 3.92/32, so SD = 32 * 0.51 / 3.92.  If we round 0.51 to 0.50 and 3.92 to 4, this will decrease the result by 2% + 2% = 4% and gives 32 * 0.5 / 8 = 4.  The answer is about 4.16.

This is an extraordinary number.  It means, for example, that about 68% of the logarithms will lie within a range from -4.16 below the mean to 4.16 above it.  This total range of 8.32 represents a ratio of exp(8.32) = 4100 when the values are re-expressed in concentration units.  In short, the upper 16% of the numbers in the batch will be more than 4,000 times greater than the lower 16% of the numbers.

2.    The percentiles of the mono-ethyl phthalate results are reported as 27.7, 61.5, 171, 424, and 1160 ug/L.  Their logarithms are 3.32, 4.12, 5.14, 6.05, and 7.06.  Are these values consistent with your answer to question 1?

Definitely not.  From the Normal CDF table we see that the logarithms of the 25th and 75th percentiles should be approximately 0.67 SDs from the mean log (which is the log of the GM) and the 10th and 90th percentiles should be approximately 1.28 SDs from the mean log.  Let's tabulate these differences:

Percentile Value Log Deviation from mean T (expected deviation, in SDs) Ratio
10 27.7 3.32 -1.85 -1.28 1.44
25 61.5 4.12 -1.05 -0.67 1.56
50 171 5.14 -0.03 0.00
75 424 6.05 0.88 0.67 1.30
90 1160 7.06 1.89 1.28 1.47

 The ratio in the last column indicates the SD of the logarithms is between 1.30 and 1.56, which is far from the value of 4.16 computed in question 1.

Another way to answer this question is to note that an SD of 4.16 implies, for example, a 90th percentile near 5.17 + 1.28 * 4.16, which is greater than 10, but that exp(10) is enormous (it's over 20,000).  Again there is a striking inconsistency.

3.    Extra credit.  Demonstrate that the data are reasonably consistent with the assumption of lognormality made in question 1.

The relatively constant sizes of the ratios computed in problem 2 suffice for this demonstration.  Because the ratios are practically constant, a Q-Q plot of the data logarithms against the percentiles of a standard Normal distribution will be close to linear, at least at the 10, 25, 50, 75, and 90th percentiles.  This implies the batch of 1024 logarithms and a Normal distribution have roughly the same shape within the middle 80 percent.

The analysis suggested by these questions leads to the conclusion that the reported confidence interval is about three times too wide (on a logarithmic scale) than one would expect from lognormally distributed data.  Because a statistical error is unlikely in a report of such visibility from this organization, we must suspect there are some extraordinary values beyond the 10th and 90th percentiles causing this increase in uncertainty in the GM.

For more practice with these techniques, explore the other data in the National Report on Human Exposure to Environmental Chemicals.

Scoring: The passing score is 90.

Return to the Environmental Statistics home page

This page is copyright (c) 2001 Quantitative Decisions.  Please cite it as

This page was created 22 March 2001.