Distributions

Links to web resources

http://www.alceon.com/FitMixfini.pdf is a paper (that appeared in Risk Analysis, Vol 20, #2) by Burmaster and Wilson on "Fitting Second-Order Finite Mixture Models to Data with Many Censored Values Using Maximum Likelihood Estimation."  It illustrates how mixture distributions can successfully be applied to understanding environmental data.

Terminology

continuous distribution A distribution whose CDF has no vertical leaps in value anywhere.
cumulative distribution function (CDF) Cumulative distribution function: the value of this function at a number x is the probability that the outcome of a random variable will be equal to or less than x.  (In terms of the ticket-in-a box model, the cdf answers the "how many" question: it gives the proportion of tickets in the box bearing values equal to or less than x.)  It is the integral of the pdf when the pdf exists.  A cdf always exists for any random variable whose outcomes are real numbers.
distribution A mathematical representation of probability for experiments whose outcomes are real numbers.
empirical distribution function (EDF) The distribution defined by a batch of numbers by writing each number on a ticket and putting those tickets into a box.
experiment An activity that yields one or more results we can write down--the observations.  An experiment requires an accurate description of what kind of preparation is made, what actions are performed, and what is observed.
interval The interval (a, b] is the set of all numbers x such that a < x and x <= b.  We allow a or b to be infinite.  For example, (-infinity, 0] is the set of all non-positive numbers; [0, 1] is the set of all numbers between 0 and 1 (including 0 and 1 themselves); [1, infinity) is the set of all numbers greater than or equal to 1.
mixture distribution A distribution formed by averaging the PDFs or CDFs of two or more distributions.  The averaging must be deterministic: that is, the relative amounts of the component distributions are not allowed to be random values.  (Some statisticians allow non-deterministic averaging of distributions in their definition of mixture distribution, but in this course we do not.)
outcome The set of observations made during an experiment.
PDF Probability distribution function: the values of this function describe the relative probabilities of outcomes of a random variable.  (In terms of the ticket-in-a box model, this function describes the relative proportion of tickets in the box having any given value.)  A PDF does not always exist.  When it does, it is the derivative of the CDF and the distribution is therefore (absolutely) continuous.
probability model A box with tickets labeled by potential experimental outcomes.  Repeated draws from this box (replacing the ticket each time and shaking the tickets thoroughly before each draw) are supposed to emulate the results of repeated applications of an experiment.
support The support of a random variable X consists of those values a where the probability that X is "close to" a is not zero, regardless of how "close" we really mean.  (Remember, for many random variables, the probability that X actually equals a is often zero.  That's why we have to work with "close to".)  In terms of the ticket-in-box model, think of the support as the set of all values on all tickets in the box.  This is not quite the same thing, but in any real application it is the same.
ticket-in-a box model We can think of the outcome of a random variable as equivalent to drawing a ticket from a box.  The box may have infinitely many tickets in it, so this is a conceptual model only, but even so, real boxes with very large numbers of tickets can closely approximate the behavior of any random variable.  The tickets are labeled with the possible values of the random variables, one value per ticket.  Several (or even infinitely many) tickets may share the same value.  This is the mechanism that causes some values to occur with more probability than other values.
Taylor series A series approximation to a function y = f(x) of the form
y = c0 + c1(x-a) + c2(x-a)2 + ... + cN(x-a)N + ...
that converges for values of x close to a constant a.  The values c0, c1, .., cN, ... do not depend on x.  When x is close to a, the first few terms of the Taylor series provide excellent, easily computed approximations to y.

Discussion

Memorize the red facts below.

PDFs, CDFs, and problems solved with probabilities

What is probability?

"Probability" can mean a lot of things: a degree of belief, the quality of a prediction, a "propensity," and a "long run frequency" have all been proposed.  (See Mario Bunge, Foundations of Physics, 1967, Springer-Verlag, NY, Chapter 2, for example.)  Most of these have little or no value either as definitions or as scientific concepts.  To avoid most of these problems, we will only use probability as a tool to think about and model environmental phenomena and decisions made about them.  In some cases we will indeed use probability to model beliefs, but in most cases we will want to apply it to physical, chemical, or biological systems.  To do so, we need two ideas: the experimental setting and the ticket-in-a box model.

Millard and Neerchal (our text) put it succinctly and well:

Probability distributions are idealized mathematical models...  A random variable is "the value of the next observation in an experiment" (Watts, 1991, as quoted by Berthouex and Brown, 1994, p. 7).

Key words such as "model," "observation," and "experiment" require definition and explanation, but the essence is there: probability is a model that applies when the data can be considered as a set of observations from experiments.

An experiment is a special kind of activity that yields one or more results we can write down--the observations.  What is most important about an experiment is its reproducibility.  To be reproducible, an experiment requires an accurate description of what kind of preparation is made, what actions are performed, and what is observed.

We tend to think of experiments as activities performed in laboratories, but they do not have to be.  Sampling soil at an industrial site is an example.  However, it's not good enough to say "we obtained the data by taking canisters of soil from the site and sending them to the laboratory for BTEX measurements."  This is not precise or accurate enough.  This description lacks important details, such as

Exactly which soil constitutes "the site"?  You need an accurate, detailed map or its equivalent.
How deep does the (potentially sampled soil) extend?
What procedure was used to select locations for sampling?  Were adjustments to the locations made in the field?  How, why, and by whom?
What physical procedures were used to obtain the canisters of soil?
How were the canisters of soil packaged, shipped, and handled?
How did the laboratory treat the samples when preparing them for measurement?  Did it mix them?  Take a piece from each one?  How did it select the piece if it did?
What kinds of chemical preparation and quantitative measurements did the laboratory perform?
How did the laboratory convert the instrument readings into concentrations?
What additional variables were observed during the process: temperature, humidity, weather, background air concentrations?

A sampling plan is supposed to describe these and related details.  Unless such a plan exists and is followed carefully, an investigator can never really know whether she is "reproducing" the experiment.  The sampling plan describes the experimental preparation: what steps must be followed in order to reproduce the experiment.

(It is my experience that many environmental sampling plans are incomplete and are not accurately followed.  Often, important aspects of the preparation are omitted, such as details of how the sample locations were selected and then found and adjusted in the field.  The experiment often is not conducted as planned.  Field personnel frequently write such phrases as "sampling was done in general conformance with accepted procedures" to indicate they paid little or no attention to the plan itself, but just did what they usually did--whatever that is.)

Similar comments apply to air sampling, water sampling, groundwater sampling, and ecological sampling.  (For an amusing description of what can go awry, see Farley Mowat's description in Never Cry Wolf of sampling flora in the Arctic.)

Every experiment has an outcome.  This is the set of observations for which the experiment was designed.  The simplest experiment has a single outcome.  It may be a single measurement, for example.  The outcomes of complex experiments may be entire databases of observations, both numerical (quantitative) and non-numerical (qualitative and narrative).

The interesting thing about experiments, from our present point of view, is that when they are reproduced, the outcomes are often different.  In some sense, almost every experiment has a variable outcome, because it is almost impossible to measure something with perfect accuracy, reliability, and infallibility.

Sometimes the variation in outcome is so small that it does not matter to the experimenter, or it is undetectable by the experimenter's instruments.  Carl Hempel describes an interesting historical example in his monograph Philosophy of Natural Science (1966, Prentice-Hall, NJ, pp 23-24).  Tycho Brahe rejected the Copernican theory that the planets revolve about the sun.  His reasoning was in part based on a parallax argument.  Brahe believed the stars were much closer to us than we now know them to be.  Therefore, Brahe reasoned, the apparent positions of those stars in the sky should change during the earth's annual course around the sun.  This change in position is the stellar parallax.  Brahe measured the parallaxes of stars and found them all to be zero.  His experiment had an unvarying outcome.  However, stellar  parallaxes do exist and we now have instruments accurate enough to measure them and accurately gauge the distances to the nearest stars.  Brahe's instruments just could not detect the true parallax.  If Brahe had been using a probability model for his parallax experiments, he would have treated his unvarying zeros as random variables, thereby leading him to understand better the limitations of his results.  (Probability models were not available in Brahe's time, so this is no criticism of his work.)  That is why it can be useful to model any experimental outcome as a random variable, even though the outcome does not actually vary when the experiment is repeated and even though theoretical hypotheses suggest no variation will occur.

This potential variation in outcomes is why we need something like probability to model and understand experiments.  However, It is not clear that probability is even appropriate for some experiments.  Consider this one.  Situate yourself on the southwest corner of Spruce and Broad streets at 4:00 pm any Thursday afternoon.  Repeatedly accost the nearest pedestrian who is headed east, until you can find someone to give a valid answer to your question, "who will win the next election?"  The outcome consists of the answer, which must be the name of a living person.  This experiment is reproducible (at least once any remaining ambiguities in its description are cleared up), but I am doubtful that a probability model is a good or useful way to understand the results.

As another non-example, consider the observed densities of the five planets known to the ancients (from F. Mosteller and R. Rourke, Sturdy Statistics, Addison-Wesley, 1971, p. 54, as quoted in Freedman, Pisani, Purves, and Adhikari, Statistics, Second Edition, W. W. Norton & Co., 1991, p. 508):

Mercury Venus Mars Jupiter Saturn
0.68 0.94 0.71 0.24 0.12

Evidently the three inner planets (Mercury, Venus, and Mars) have substantially larger densities than the two outer planets.  But what kind of probability model would apply to these observations?  The only variation that could occur in an experiment would be due to measurement and would not help us understand the variation in densities (unless the variation in repeated measurements was comparable in size to the variation among observed densities, for then one reasonable hypothesis is that all variation in observed density merely reflects "measurement error".  But the measurement process is much better than that.)

A probability model is worth considering whenever we have evidence that the outcomes behave like draws of a ticket from a box.

(People use probability models in other situations, too, such as to reason about uncertain beliefs.  Our discussion does not rule out the use of these models, but it clarifies the distinction between what such a subjective model refers to--a state of mind--and the objective, experimental applications we have in view in the preceding discussion.)

Describing the contents of a box

We have already considered boxes with a finite number of tickets; see Probability paradoxes and simulation.  The behavior of a box is fully determined by describing  what is written on the tickets and how many tickets of any given value are contained in the box.

The contents of a box are not uniquely determined.  For example, a box with two tickets--one with "0" and the other with "1", say--will behave exactly like a box with a million tickets--500,000 with "0" and 500,000 with "1"--because we always replace the ticket after drawing it so that the box's contents remain unchanged.  The proportion of tickets with "0" and the proportion of tickets with "1" are what matter.

Many random variables, however, can in theory have an infinite number of outcomes.  A Normally-distributed random variable, for example, can have any real number for its outcome.  How do we compute the proportions?  One answer, developed about a hundred years ago, is to agree that we will answer questions about finite proportions only and see where we can go from there.  A finite proportion is one that is not zero.  Thus, we will not ask about the proportion of zeros in a box describing a normal distribution.  That proportion, if we could even talk about it, must be zero, because the chance that a normal random variable will exactly equal zero is zero.  However, we can compute the proportion of tickets less than zero, or the proportion with values greater than 2, or the proportion with values between 0.7071 and 0.7072.

For experimental outcomes that are real numbers, we can reduce all questions about proportions to questions of the form "what is the proportion of tickets with values equal to or less than Z?" where Z is some definite number.  The answer to this question is the value of the cumulative density function, or CDF, of the box for the number Z.  Evidently every box of real-valued tickets has a CDF.  The only theoretical issue is whether the CDF is a sufficiently rich description of the box's contents for modeling experimental results.

We can see that the CDF is effective for modeling experimental outcomes because all practical questions of probability concern intervals of real numbers. (I will dodge a technical point by allowing that a particular number can be modeled as a very small interval around it, allowing us to consider the probability of specific outcomes occurring.) 

We find the most common questions are of the form "what is the probability that X is between a and b or equal to b," or "what is the probability that X is less than or equal to a," or "what is the probability that X is greater than b?"  Using interval notation we might write Prob((a, b]) for the first, Prob((-infinity, a]) for the second, and Prob((b, infinity)) for the third.  These can be written in terms of the CDF as Prob((a, b]) = CDF(b) - CDF(a), Prob((-infinity, a]) = CDF(a), and Prob((b, infinity)) = 1 - CDF(b).  These follow from the axioms of probability (see Bunge, op. cit.), which are:

P1    There is at least one possible experimental outcome.
P2    All probabilities are zero or positive.
P3    The probability that some outcome occurs is 1 (100%).
P4    Probabilities are additive.

Axiom P4 say that if A is a set of outcomes and B is a set of different outcomes (A and B are said to be disjoint), then the probability that an experimental outcome is in A or B is the probability that it is in A plus the probability that it is in B: Prob(A union B) = Prob(A) + Prob(B).

It is clear that using proportions of tickets in a box produces values that meet these axioms, provided we put at least one ticket into the box (P1).  The last axiom, P4, is an obvious property of proportions.  It is true because in computing Prob(A union B) we are counting all tickets with values in A or B.  We can count them by counting those in A and adding that to the count in B, because A and B have no tickets in common.

The most delicate part of all this is that it is not possible, in general, to talk about the probability of some arbitrary sets A or B.  These sets must be constructed from intervals in a controlled way (using unions, intersections, and possibly infinite sums, if we extend axiom P4 to cover countably infinite sums).  This restriction to sets formed from intervals, however, poses no difficulty in practical applications.

Examples of distributions

You can appreciate the power and subtlety of the CDF formulation of probability by considering some examples.

Discrete distributions

The very simplest box of tickets has a single ticket.  Let's suppose it has the value "0" written on it.  (In general, I will say an "x-box", such as a "7-box", is a box with one ticket containing the number x.)  Evidently if x is less than zero, CDF(x) = 0 and if x is greater than or equal to zero, CDF(x) = 1.

The next CDF describes a box with four tickets.  One ticket says "0", two tickets say "1", the last ticket says "2", so the proportions are 0.25, 0.50, and 0.25, respectively.  This is the "Binomial(2, 0.5)" box.

These CDFs show examples of discrete distributions.  The CDF for a discrete distribution consists entirely of vertical jumps.  The jumps occur at the values of the tickets.  Their heights are the proportions of tickets in the box.

It is evident from axioms P2 and P3 that every CDF ranges in value from 0 to 1. (We don't allow -infinity as an outcome.  If, as x approaches -infinity, the CDF approaches a nonzero limit, that would be telling us the probability of no outcome is positive.  This would contradict axioms P3 and P4.  Therefore the CDF must equal or at least approach zero in the limit as x approaches -infinity.) 

It should be almost as evident that every CDF increases monotonically.  This statement means that if x1 > x0, then CDF(x1) >= CDF(x0).  But CDF(x1) is defined as the probability that the outcome is less than or equal to x1.  This event can be expressed as the union of two disjoint events; namely, that the outcome is less than or equal to x0 (event 1) and that the outcome is in the interval (x0, x1] (event 2).  (We could not define event 2 unless x1 > x0.)  The additivity axiom (P4) and the positivity axiom (P2) then imply CDF(x1) = CDF(x0) + a number which is greater than or equal to zero, which is another way of stating CDF(x1) >= CDF(x0).

Here's another interesting example of great importance.  Any batch of numbers determines a distribution: just write down each number on a ticket and put those tickets into a box.  The distribution of this box is the empirical distribution function, or EDF, of the batch.  Thus its CDF is defined by

CDF(x) = (Number of values less than or equal to x) / (Size of the batch)  [see formula 3.20 in the text]

For example, consider the batch (0.8, 1.3, 1.3, 1.4, 2.5).  Here is the CDF of its EDF:

This example is important because it shows we can create a probability distribution from any batch of numbers whatsoever.  Just write each number on a ticket and put the tickets into a box.  We can study properties of this distribution in order to understand the batch of numbers better.

Distributions describing infinitely many outcomes

CDFs that do not have large finite leaps do not describe discrete distributions and do not correspond to boxes with a finite number of tickets.  Here, for example, is part of the CDF of the standard normal distribution.  It is smooth: it has no leaps, it has a slope at every point, and indeed has derivatives of all orders.  This is just the opposite of discrete: a continuous distribution.

The next CDF describes a uniform distribution.  Consider a box whose tickets have values between -1 and 2 with any such value just as likely to occur as any other.  Its CDF must equal 0 at x = -1 (no tickets have values less than -1) and 1 at x = 2 (all tickets have values less than or equal to 2).  In between, it must rise at a uniform rate of 1/3 per unit x because each unit interval within [-1, 2] contains 1/3 of all the tickets.

Probability density functions (PDFs)

It is useful to have different ways to represent and think about things.  By analogy with a histogram, which represents frequency by area, we can attempt to represent probability (or proportions of tickets in a box) by area as well.  When we can do this, we have a probability density function (PDF).

The x-axis of a PDF represents the values of a random variable: parts per million, dose, count, area, or whatever.  The units of the y-axis must therefore be probability per unit (of x) so that areas in the plot (computed in units of y times x) are in the correct units of probability.  We can also think of the y-axis as representing the proportion (of tickets) per unit x or percent per unit x.

For a simple example, consider the uniform distribution above.  It has a PDF:

The probability that this random variable has an outcome between 0 and 1.5, for example, is represented by the area under the PDF between x=0 and x=1.5.  Evidently this area is 0.5.

In general, you use a PDF like this to compute probabilities using areas.  The probability of the outcome being in some set E (called an event) is the area of the PDF over E.  It follows immediately from the definition of CDF that the CDF at any point x is the area to the left of x under the PDF:

The is the standard Normal N(0,1) PDF.  The area to the left of x=1 is about 84%. On the CDF, we can just look up the value at x=1.  It is about 84%.

Visualizing the PDF usually makes it easier to compute probabilities.  For example, to find the probability (for the standard Normal distribution N(0,1)) that the outcome is either less than -1 or greater than 1, we visually add and subtract areas.

= CDF(-1)
+ 1
- CDF(1)

That is, 

Prob(|X| >= 1) = CDF(-1) + 1 - CDF(1).

Not every distribution has a PDF.  The simplest ones do not.  Consider the 0-box with one ticket labeled "0".  The area under the PDF to the left of any negative number must be zero, whereas the area to the left of zero must be one.  Thus all the area is concentrated at zero.  We could cover this region with a rectangle of very, very small base.  If the base has width e, the height must be 1/e to make the area come out to 1.  The smaller we make e, the larger 1/e must get.  Thus the value of the PDF at 0 cannot exist and the value of the PDF everywhere else is zero.

The same reasoning implies that no distribution containing any vertical leap has a PDF.

Memorize these (approximate) values for the CDF of the standard normal distribution, N(0, 1):

X CDF(X) - CDF(-X) CDF(X) = 50% + (CDF(X) - CDF(-X))/2
1 68[.3]% 84.1%
2 95[.45]% 97.725%
3 99.7[3]% 99.865%

(You don't really need to know the last digits, shown in brackets, but it can help.)

Visualize CDF(X) - CDF(-X) as the area between -X and X: it is the probability that a Normal variate will be X standard deviations or less from its mean.  Evidently, the probability of being more than three standard deviations from the mean is small.

Mixture distributions

An important way of constructing new distributions out of old is by mixing them.  Suppose we have two distributions F and G represented by tickets in boxes.  If F and G can be represented by a finite number of tickets each, then a mixture of F and G is obtained by dumping the contents of F's box and G's box into a new box.  If p is the proportion of all tickets originating from F's box, then the proportion of tickets from G's box is 1-p and we use the expression p*F + (1-p)*G to represent the mixture.  The proportion p is just a number; it is fixed; it is one of the parameters of the mixture; it is not a random variable.

We can achieve different mixing proportions by exploiting the fact, noted above, that the numbers of tickets in a box for a distribution are not uniquely defined: we could, for example, make three copies of every ticket in F's box and five copies of every ticket in G's box before combining them all into a new box, thereby changing their mixture proportions.

This intuition is made formal in terms of the CDFs.  Let's use F to denote the CDF of distribution F, and likewise for G.  The proportion of values in the mixture less than some number x must be p*F(x) + (1-p)*G(x).  This both defines and justifies the use of the p*F + (1-p)*G expression.  Similarly, let f be the PDF for F and g the PDF for G (assuming F and G have PDFs).  Then the PDF for the mixture at any number x is p*f(x) + (1-p)*g(x).

For example, the Binomial(2, 0.5) box shown above is a mixture of three distributions: a 0-box, a 1-box, and a 2-box, mixed in proportions of 25%, 50%, and 25%.  Every discrete distribution is similarly a mixture of x-boxes.

Let's mix a continuous with a discrete distribution.  The so-called "delta distribution" (as it is named in some US EPA documents) is a mixture of a zero-box and a Normal distribution.  (The term "delta," in the sense of distributions, is more commonly reserved for the distribution with 100 percent probability situated at zero: the zero-box.)

This is a 60% mixture of a zero-box with 40% of an N(3, 1) distribution.

This mixture is neither discrete nor continuous: it shares aspects of both.  The Lebesgue decomposition theorem (pronounced Luh-beg') states that all distributions can be expressed as a mixture of a purely discrete distribution (vertical leaps only) and a continuous distribution (no vertical leaps at all).  See Paul Halmos, Measure Theory, 1974, Springer-Verlag, NY, Section 32.  This seems obvious, but its subtleties are concerned with distributions that may contain an infinite number of vertical leaps.  It's nice to know about this theorem, because it says we can understand distributions very generally once we understand discrete ones, continuous ones, and how to mix them.

The Excel spreadsheet that made these plots of CDFs and PDFs is available here.

The extended discussion of logarithms and how to compute them has been moved to a separate page at Logarithms.

Return to the Environmental Statistics home page

This page is copyright (c) 2001 Quantitative Decisions.  Please cite it as

This page was created 16 February and last updated 18 February 2003 (minor clarifications).