Probability plots

Links to web resources on probability plots

Engineering Statistics Handbook (NIST): Good general reference.  See the first chapter on EDA and find the discussions of probability plots and Q-Q plots.

Terminology

Common logarithm Logarithm to the base 10: log(b) = ln(b) / ln(10).
Linear function A relationship between variables y and x of the form y = m*x + b where m and b are constants.  m is the slope, b is the intercept.
Linear interpolation Geometrically, finding points on a given line segment.  Numerically, the process of estimating values "between the lines" in a table.
"Shape" (of a batch) The characteristics of a batch after factoring in location and scale.  That is, two batches have identical shapes when they contain the same values after possibly changing the location and scale of one of the batches.

Discussion

Geometric means

Here is a link to a copy of the e-mail on geometric means, exp(), ln(), powers, roots, and so on.

Linear interpolation

Linear interpolation is a simple way of estimating values in a table that fall "between the lines."  Suppose the table has two columns, X and Y, and that you have a Y-value for which you want to find the corresponding X-value.  Find the rows that bracket the Y-value; that is, one row will have a smaller Y-value and the next will have a larger Y-value.  Here's a picture of the relevant part of the table:

X Y
X0 Y0
X1 Y1

The interpolated X value is a weighted average of X0 and X1.  The weights are determined by how far Y is from Y0 and Y1, as a proportion of the total distance from Y0 to Y1.  These proportions are therefore (Y - Y0)/(Y1 - Y0) and (Y1 - Y)/(Y1 - Y0).  The formula for X therefore is

X = X1 * (Y - Y0)/(Y1 - Y0)  +  X0 * (Y1 - Y)/(Y1 - Y0)

It's easy to get mixed up by switching the proportion that multiplies X0 and the proportion that multiplies X1, so we do a quick check.  If you plug Y = Y0 into the formula, the first term drops out (it is zero) and the second term is X0 times 1, which is X0.  Similarly, plugging Y = Y1 into the equation gives X1.  These are the values already in the table, so the formula works for them.  Finally, by looking at the formula you can see it's of the form X = something * Y + something else, which is a linear equation.  We know a line is determined by two distinct points and we have verified that the formula correctly reproduces two distinct points (assuming either X0 <> X1 or Y0 <> Y1), so it follows the formula is correct.

Here is a picture of the situation in the X-Y plane:

Geometrically, we locate Y on the Y-axis, read across to the line segment between the tabulated points (heavy blue circles), then read down to find the corresponding X value.

Applications to percentile plots

Now suppose the X-values are values in a batch and the Y-values are their percentiles.  The "percentile plot" is created by drawing the points in the X-Y plane and connecting them with straight line segments.  To compute any percentile (between the lowest and highest assigned to the batch), we find the percentage on the Y-axis, read across to the percentile plot, and then look straight down to find the corresponding percentile, as shown in the next figure.

The formulas look ugly when written down, but they are simple in concept: just linearly interpolate along the relevant segment of the plot.

Reading Q-Q plots

A Q-Q plot, or "quantile-quantile" plot, graphically compares two batches.  The Q-Q plot simply matches corresponding percentiles in each batch.  For example, if the 20th percentile in batch X is 347.1 and the 20th percentile in batch Y is -0.0095, then the point (347.1, -0.0095) lies on the Q-Q plot for Y versus X.  The principle of mathematical laziness suggests that we minimize the work involved by inspecting each value in the batch with the smaller number of values.  The corresponding percentile is easy to find.  Then we linearly interpolate, if necessary, between percentiles in the other batch.

To interpret a Q-Q plot, we first fit a line to the portion of the plot in which we are interested.  This may be the entire plot, but often it's either (a) the largest values (when we are concerned about concentrations of an environmental contaminant, for instance) or (b) the middle values, because these characterize the "bulk" of the data.  The method of line fitting is usually unimportant: just lay a ruler over the paper or computer screen.

This Q-Q plot uses comparable scales on the X and Y axes to help you estimate the slope of the fitted line.  Evidently, the line closely approximates the plot, although we can see relatively large random "wiggles" for the larger values.  This good linear approximation justifies the conclusion that these batches have essentially the same "shapes".  The slope of the line is about 1/2, indicating the Y batch ("normal mixture") has only about half the spread of the X batch ("lognormal").  The two batches have comparable medians (13.0 for X, 10.9 for Y).

This conclusion of similar shapes is borne out by the histograms, provided we use appropriate bins.

This is Excel's histogram of the mixture batch using bins of width 10.  It looks almost symmetric, with just a bit of positive skewness.  (If, for example, the two highest values were slightly reduced, the rightmost two bins would be combined, producing an almost perfectly symmetrical histogram.  That represents tiny changes in just ten percent of the data.)

The data labels are strange: each label names the rightmost (highest) value in the bin.  Thus the first bin contains values from 0 through 10 and the last contains values between 60 and 70.

(Excel is a useful tool for making illustrations and running quick tests, but many aspects of its calculations and graphics are too imprecise or error-prone for serious statistical work.)

This is a histogram of the lognormal batch using the same bins of width 10.  This batch is heavy-tailed, strongly positively skewed, and not symmetric.
This is a histogram of the mixture batch using finer bins (their widths are five instead of ten).  Now the shape looks remarkably like that of the lognormal batch.

One of the merits of the Q-Q plot technique is that it directly compares the shapes of two batches without the necessity of fiddling with bin widths.  As we will later see, it is also useful for detecting and quantifying important differences between batches.

Technical note, for the record:  Each of these batches has 20 elements.  First, 20 random percentages were drawn independently from the range 0 to 100%.  Lognormal values were computed by exponentiating percentiles from a Normal(2.5, 1) distribution.  Normal values were computed by from the same percentiles of a Normal(10, 3) and a Normal(25, 10) distribution.  The "normal mixture" values were obtained by selecting, for each percentage, a value from the Normal(10, 3) distribution with a 75% probability or the corresponding percentile from the Normal(25, 10) distribution with 25% probability.

Often, however, a Q-Q plot is not even approximately linear, as in the next example.

Here we have fitted a line to some of the lowest values.  Evidently the largest six values in each batch wander away from the line.  You can interpret this in two ways.  You may mentally move the points vertically to make them coincide with the line.  A vertical motion changes the Y value only.  This means we are considering the unchanged X values as being a "reference distribution."  Evidently, we would have to substantially decrease the highest values from the Y batch to bring them down to the line.  Thus: the highest values of the lognormal batch are higher than one would expect from an examination of the remaining values.  Alternatively: the lognormal batch has a heavier right tail than the normal batch.

You may also mentally move the points horizontally to make them coincide with the line.  A horizontal motion changes the X value only.  Now the Y batch is the "reference distribution."  Evidently, we would have to substantially increase the highest values of the X batch to bring over to the line (or rather, with its extension toward the right).

By the way, the "normal batch" in this figure consists of points drawn randomly from a Normal distribution (mean 25, variance 125) and the "lognormal batch" consists of points drawn randomly from a Normal distribution (mean 2.5, variance 1) and then exponentiated.

There are many other interpretations of this Q-Q plot, depending on our choice of reference distribution (X or Y), how we fit a line to the points, and even on how we express the data (for example, we could transform the axes using ln() or square roots or whatever we think might be helpful).  All these interpretations will be consistent with one another; they are just different aspects of the same thing.  Which interpretation we choose usually depends on why we are examining the data in the first place.

Rules and shortcuts, tips and tricks

Here are some things we learned:

A reasonable estimate of the number of stems to use in a stem-and-leaf diagram is to compute 10*log(N) (this is the common logarithm--log base 10).  For example, if N = 20, log(N) = 1.3, so try somewhere near 13 stems at first.
All logarithms are the same up to some constant multiple.  Using the definition and basic properties of logarithms, we derived the equation logab = ln(b) / ln(a).  In particular, for a = 10, ln(a) = 2.303 approximately, giving log10b = ln(b) / 2.303 for all positive numbers b.
You can think of a density plot as a "melted histogram."  If we were to build an image of a histogram using candles of widths and heights corresponding to the histogram bars, and then partially melted the candles, their masses (representing frequencies) would be conserved (all quibbles aside :).  Thus the resulting outline would still represent the correct area, but would be a smoothed version of the original histogram.  Mathematically, density plots are constructed just like this, by "melting" each bar in a standard way.
The principle of mathematical laziness governs much good technical thinking.  We encountered it when considering how Q-Q plots are developed: it's much less work to start with the percentiles of values in the smaller-size batch, interpolating (if necessary) percentiles within the larger-size batch, than it is to begin with the larger-size batch.  The principle favors thinking ahead (and computing later) over computing now (and thinking later).
Sometimes you can do a lot of algebra with little computation.  For example, we verified the expression for linear interpolation by observing (1) it was linear in form (y = m*x + b) and (2) it achieved the correct values at two distinct points.
Many of the equations in the text have geometric interpretations.  It is helpful to draw pictures.  For example, formulas (3.24) through (3.29) on pages 97 through 101 merely state that percentiles are found by (a) connecting the dots in a percentile plot using line segments and then (b) finding the x-coordinates (the values) corresponding to given y-coordinates (the percentages).

 

Return to the Environmental Statistics home page

This page is copyright (c) 2001 Quantitative Decisions.  Please cite it as

This page was created 24 January and last updated 18 February 2003 to clarify the basis for characterizing a histogram as "almost symmetric."