Properties of distributions

Links to web resources

Normal Approximation Applet from West Virginia University

Another approximation applet from Berkeley

An illustrated discussion from Georgia Tech

Discussion of moments with a variance applet from the University of British Columbia

Terminology

Probability paper Special graph paper for creating probability plots.  The most common is Normal probability paper, but you can create probability paper for any distribution.

Discussion

Memorize the red statements.

Ticket-in-a-box Models

Take the ticket tutorial!

The Normal approximation to Binomial probabilities

The figure shows the histogram for B(25, 0.4) as black bars.  Recall that B(25, 0.4) describes the probabilities of successes in 25 independent trials where each trial has probability 0.4 of succeeding.

The mean of B(25, 0.4) is 25 * 0.4 = 10.  The variance of B(25, 0.4) is 25 * 0.4 * (1-0.4) = 25 * 0.24 = 6.  The smooth blue curve shows the PDF for Normal(10, sqrt(6)): that is, this is the Normal distribution whose first two moments (mean and variance) match the first two moments of B(25, 0.4).  Evidently it is a good fit.  The Central Limit Theorem suggests as much.

The close correspondence of the bars and the smooth curve suggests how to compute binomial probabilities using Normal probabilities (and vice versa, although that is rarely done).  For example, the probability of exactly 10 successes is the area of the black bar centered at 10.  This will be very close to the area between 9.5 and 10.5 under the blue curve.

To find areas under the Normal curve, we do as always and standardize the values to their "Z scores".  The Z-score of a value just re-expresses it in terms of standard deviations away from the mean.  Thus, the Z-score of 9.5 is (9.5 - 10)/sqrt(6) = -0.204 and the Z-score of 10.5 is (10.5 - 10)/sqrt(6) = +0.204.  Therefore the area of the bar in question is CDF(0.204) - CDF(-0.204) = 58.1% - 41.9% = 16.2%  (16.17%, actually).  The true value is 16.12%, so the Normal approximation is pretty good..

This method works well because, as you can see, the Normal PDF in the vicinity of any given bar is higher than the bar about half the time and lower the other half, by about the same amount.  Therefore the area beneath the Normal PDF should be quite close to the area of the bar.

Note that the bar endpoints are halfway between the whole numbers 0 and 1, 1 and 2, ..., 24 and 25.

We can work the math in the other direction, too.  For example, let's find a likely range for the number of successes.  We can choose what we mean by "likely"; a common choice is that "likely" means "with a 95% chance".  For a Normal distribution we know that about 95.5% of the probability is located between -2 and +2 standard deviations from the mean; that is, between -2*sqrt(6) + 10 = 5.1 and +2*sqrt(6) + 10 = 14.9.  Looking at the figure, we see this range includes all the bars centered at 6, 7, 8, 9, 10, 11, 12, 13, and 14, plus 0.4 of the bar centered at 5 and 0.4 of the bar centered at 15.  We could, therefore, include the value 5 and not include the value 15, or include the value 15 and not include the value 5, and conclude that very close to 95.5% of the probability of a B(25, 0.4) variable is contained between 5 and 14 or between 6 and 15.  In fact, 95.6% of the B(25, 0.4) probability is between 5 and 14 and 95.7% is between 6 and 15.

Moments of Distributions

Graphical multiplication

A number line, when all its values are multiplied by a positive value x, expands (stretches) by a factor of x when x exceeds 1 and contracts (shrinks) by a factor of x when x is less than 1.

The figure shows a number line in the middle.  The top is the result of multiplying all values by 1/2: the tic marked "2", for instance, is now located just above where the "1" used to be.  The bottom is the result of multiplying all values by 2: the tic marked "2", for instance, is now located just above where the "4" used to be.

The next figure shows the result of multiplying 3 by 1/2.  This can be viewed in two equivalent ways: multiplying by 3 stretches all values; it triples the distance between 1/2 and 0, as shown by the arrow at the left pointing from 1/2 to 3/2.  Multiplying by 1/2 shrinks all values; it halves the distance between 3 and 0, as shown by the arrow at the right pointing from 3 to 3/2.

Now we are going to multiply curves (graphs of functions).  Every point on a curve has a height, Y.  We will take two curves, y = f(x) and y = g(x).  At each point x we multiply the two y-values f(x) and g(x).  The answer, abstractly, is y = f(x) * g(x).

The geometric interpretation shows us that when f(x) exceeds 1, multiplying g(x) by it expands g(x): that is, it pulls the height of g further from the y-axis.  When f(x) is less than 1, multiplying g(x) by it moves the height of g closer to the y-axis.

Evidently the roles of f() and g() are symmetric: we could just as well interpret y = f(x) * g(x) in terms of the values of g expanding or shrinking the graph of f relative to the y-axis.

In the next figure, the red and blue curves function as y = f(x) and y = g(x).  Their product is the black curve.  Because the value (height) of 1 is so important in the interpretation, it is shown with a solid horizontal line.

For example, the red curve exceeds 1 whenever x is larger than 1.0 or smaller than -1.0.  Thus, multiplying increases the height of the blue curve when x is extreme and decreases the height of the blue curve when x is between -1.0 and 1.0.

The red curve is close to zero when x is close to zero.  Therefore, multiplying by the red values will shrink the height of any curve--such as the blue curve--dramatically.  That is why the black curve is close to zero when x is close to zero.

Reversing the roles, now, note that the blue curve has values all less than 1 (actually, less than 0.40).  Therefore, multiplication of any curve by the blue curve will shrink the heights.  That is why the product (black) curve lies entirely below the red curve.

Notice, too, that the blue curve decreases in value from 0.40 at x=0 to less than 0.20 when x = -1.5 or x = 1.5.  Thus, the blue curve accomplishes over twice as much shrinking of the red curve out towards the ends, compared to the middle.  That is why the product (black) curve stops rising so quickly near its endpoints. 

Graphical moments

The second moment about the mean, or the second central moment, is an area under a curve.  The curve is a product, just as shown above.  One of the curves is the PDF of the distribution.  Its values are always positive.  The other curve is the square of the difference between x and the mean of the PDF.

In the figure above, the blue curve is (part of) the PDF for the standard Normal distribution, N(0, 1).  This distribution has a mean of zero.  The red curve is the square of the difference between x and the mean; namely, (x-0)2 = x2.  The black curve is their product.

The second central moment ("variance") of N(0, 1) is the area beneath the black curve.  The black curve actually continues infinitely far to the left and right.  Here is a more complete picture that focuses on the lower y-values:

Until we saw this, we should have worried whether the black curve even has an area.  After all, the curve extends infinitely far to the right and left.  This picture provides evidence that the product (black) curve squashes right down against the x-axis (height of zero), suggesting it may have finite area after all.

We can very roughly estimate the area from the figure.  Each of the black humps is almost triangular, with a base of about 31/3 and height of 0.3.  Therefore each hump has an area of about 0.3 * ( 31/3) * 1/2 = 0.5, so the two of them together have an area of about 1.  Indeed, calculus can be used to find this area, which is exactly 1.

Recall that the expression for the PDF of the standard Normal distribution is exp(-x2/2) * [some constant].  As x gets large (either positive or negative), x2/2 gets much larger much faster.  The value of exp(-x2/2) therefore diminishes extremely quickly.  This is why the red curve (or indeed, any curve of the form y = xk for positive whole values of k) eventually gets squashed right against the x-axis when multiplied by a Normal PDF.

As you might imagine, not every distribution has a variance.  A common distribution is the Cauchy distribution, also known as "Student's T with 1 degree of freedom."  The equation of its PDF is y = 1/(1 + x2) * [some constant].

Evidently, the product curve is leveling off to a non-zero value, so as we increase or decrease x without bound, the area beneath this curve becomes unbounded, too.  You can see this by forming the product, which is 1/(1 + x2) * [some constant] * x2 = x2/(1 + x2) * [some constant].  As x gets large, eventually x2 gets so big that the "1 +" term in the denominator becomes inconsequential.  The product reduces to a constant, as we had suspected.  (The black product curve approaches a height of 1/Pi asymptotically.)

Higher central moments are formed from areas under other product curves of the form y = xK * PDF.  K, which is usually 2, 3, 4, and so on, is the order of the moment.

Here is a plot showing y = xK * PDF for the N(0, 1) distribution.  K ranges from 2 through 7.

Some patterns are:

The curves for even K (2, 4, 6, ...) are always positive.  They are symmetric about the X axis.
The curves for odd K are positive for positive X and negative for negative X.  That is because odd powers of negative numbers are negative.  The areas to the right and left of the Y axis exactly balance out, so all odd central moments of N(0, 1) are zero.  This will be true of any distribution (not just a Normal one) that is symmetric about its mean.
All the curves have areas: N(0, 1) has moments of all orders.  The areas of the even-numbered curves evidently increase as K increases.
The peaks of these curves occur at +-sqrt(K) and they get higher with larger K.
Most of the areas under these curves occur in the regions around +-sqrt(K).

The last observation is insightful: it illustrates how the higher moments depend on the more extreme outcomes of a distribution.  Thus, low moments (K = 2, 3, 4) characterize the middle of a distribution, but as K increases, the higher moments depend more and more on the tails of the distribution.

Chebyshev's Theorem

EPA statisticians have recently proposed using Chebyshev's theorem to develop confidence intervals for the means of distributions.  Chebyshev's theorem is properly a statement about moments of a distribution, so this is the right place to discuss it.  A full discussion of the EPA proposal will have to wait until we study confidence intervals, which we will do shortly.

So far, we have been multiplying curves of the form y = (x-mean)K by the PDF of a distribution.  Let's simplify.  Instead of using y = (x-mean)K, we will use an expression that (a) is much, much simpler and (b) is guaranteed to be less in value than (x-mean)K.  By choosing smaller values, when we multiply by the PDF, we will get smaller heights.  Therefore the area we compute must be smaller.  We will come back to this thought: it will be our conclusion.

In the meantime, how simple can we get?  A constant value of zero is the simplest thing there is.  But that is of little interest; the area we get will be zero.  Therefore, let's do the next most complicated thing: we will set y equal to zero when x is close to the mean, but when x is larger than some distance from the mean (say Z), we will set y equal to some other constant.

We have many choices for that constant. They don't matter, but a good choice simplifies the algebra later.  Consider the curve y = (x - mean)2.  When x - mean equals +-Z, we get y = Z2.  Therefore we will use Z2 as our constant.  Specifically, we are about to consider the function y = f(x) defined by

f(x) = 0 whenever |x - mean| < Z

f(x) = Z2 whenever |x - mean| >= Z

This is the gray curve, labeled "Chebyshev," in the next illustration.

The figure shows the situation for Z = 2.  The gray curve f(x) sits at zero until x reaches +-2.  The heights of the gray curve are all less than the heights of the red curve y = x2, by design.

Let's compute the areas involved.  The product curve y = (x-mean)2 * PDF, as we have seen, has an area equal to the variance of the distribution.  What about the area under the product of f(x) (gray curve) and the PDF (blue curve)?  This product is shown in the figure as a green curve.  In the figure below, which focuses on small heights, the area beneath the green curve is shaded.  This is equal to Z2 = 4 times the PDF when x is more than Z = 2 away from the mean.  Its area is therefore Z2 times the area under the PDF for x >= Z or x <= -Z.

Here's the punch line: areas under the PDF are probabilities.  The area in question is Z2 times the probability that X is more than Z from the mean.

Return now to the thought we were holding: the area under the green curve is less than the area under the black curve.  (That should be pretty obvious in the preceding figure.)  In the example, this says that four times the probability that X is more than two away from the mean (green area) is less than the variance (gray area).

In general, this area inequality says that

Z2 times the probability that X is more than Z away from the mean is less than the variance.

That's Chebyshev's theorem.  Usually it is stated in terms of the probability and standard deviations, so let's do that.  Express Z in standard deviation units; say, Z is T standard deviations.  Thus Z2 is T2 (standard deviations)2, which is T2 variances.  Now divide both sides of Chebyshev's theorem by T2, giving

For any distribution and any T > 0, the probability that X is more than T sd's away from the mean is less than 1/T2.

For example, suppose you know a box with tickets has a mean of 10 and a standard deviation of 5.  What can you say about the proportion of tickets with values greater than 30?

First answer, using the first statement of Chebyshev's theorem: Z = 30 - 10 = 20, so Z2 = 400.  The variance is 52 = 25.  Therefore, "400 times the probability that X is more than 20 away from the mean is less than 25."  Dividing by 400 gives "the probability that X is more than 20 away from the mean is less than 25/400 =1/16."  In particular, the probability that X is greater than 30 is less than the probability that X is more than 20 away from the mean (because the latter statement includes any tickets with values less than -10).  In the preceding figure, all this says is that the area of the green figure to the right is less than the total of all green areas (left and right combined).

Second answer, using the usual statement of Chebyshev's theorem: 30 is four sd's away from the mean (30 = 10 + 4 * 5).  Chebyshev's theorem with T=4 tells us that no more than 1/42 = 1/16 of the tickets can be more than four sd's away from the mean: that is, either less than -10 or greater than 30.  So clearly, no more than 1/16 of the tickets can have values greater than 30.

Comments

The picture shows much more than stated here.  Chebyshev's inequality is rather crude: in the figure you can see the gray area is much greater than the green area.  For what distributions are those areas approximately equal?  Evidently the gray area between mean-Z and mean+Z should be really small.  The gray area peeking out above the green area for |X - mean| > Z should be small, too.  Recall that the curve determining the gray area is the product of y = x2, which never changes, and the PDF.  The only way to change the green or gray area is to change the PDF.  Therefore, to get the gray areas close to the green area, almost all the probability must be concentrated on values of X just slightly greater than mean+Z and slightly less than mean-Z.

Similar reasoning makes it clear that Chebyshev's inequality can be good, in the sense of relating two approximately equal quantities, only for a very narrow range of Z's.  In other words, for any distribution and for most values of Z, there will be much less tail probability than estimated using Chebyshev's 1/Z2 formula.

The inequality was proven here only for continuous distributions (those with PDFs), but it holds in general for any distribution.  One way to extend the proof is to use the ideas of measure theory that rigorously extend the idea of using area to represent probability and make it work for arbitrary probability distributions.  The proof requires no alteration in that case.

Rules and shortcuts, tips and tricks

Here are some things we learned:

The mean of a B(N, p) variable is N*p.  The variance of a B(N, p) variable is N*p*(1-p).  Memorize these.
To use the Normal approximation to B(N, p), sketch the histogram of B(N, p) and use it to guide you.  Remember that the bars are centered on the whole values, so the bar endpoints are located at half-integral values (the .5's).
The Normal approximation works by matching the first two moments (mean and variance) of the distributions.
The Normal approximation starts working well by the time the smaller of the expected number of success and expected number of failures exceeds 5, and works very well by the time this value exceeds 10.  (The expected number of successes is the mean, N*p.  The expected number of failures is N minus the expected number of successes, or N*(1-p).  In the example above, N*p = 25 * 0.4 = 10 and N*(1-p) = 25 - 10 = 15.)

 

Return to the Environmental Statistics home page

This page is copyright (c) 2001 Quantitative Decisions.  Please cite it as

This page was created 21 February and last updated 1 March 2001.