Probability

Links to web resources on probability models and density functions

http://www.epa.gov/superfund/programs/risk/rags3adt/appends.pdf  These are the appendices to "RAGS 3A," the US EPA's "Process for Conducting Probabilistic Risk Assessment."  See Appendix C, "Probability Distributions for PRA," for examples of distributions.

Terminology

"Box model" A short term for "ticket in a box model," which describes random phenomena in terms of independent random draws of slips of paper ("tickets") in a box.
Expected value When tickets in a box have numerical values, the expected value of the box is the sum of the values divided by the number of tickets in the box.  If the values represent an amount "won" when a ticket is drawn, then the expected value represents the average amount that will be won in many, many independent draws (with replacement) from the box.

If we write down all the different values written on the tickets (some tickets may have the same value, so there often will be fewer different values than tickets), then the values will have frequencies corresponding to the proportion of tickets containing each value.  Writing V(i) for the ith value and F(i) for its frequency, we get the formula

Expected value = Sum { V(i) * F(i) }

NPDES National Pollutant Discharge Elimination System (U.S.).  This program requires any facility discharging wastes to a surface water body (creek, river, estuary, or ocean) to apply for a permit.  The permit specifies maximum allowable concentrations and "loadings" (mass per unit time) of chemical parameters in the waste stream.  The facility must monitor its waste stream periodically, measuring the permitted parameters.  Any values exceeding the permit limits can cause a "notice of violation" and stiff daily fines to be assessed.

A box model can be used to understand and predict these "exceedances."  The tickets in the box say either "violation" or "no violation".  The proportions of tickets in the box determine the chances that any measurement will violate the permit.  This model can be criticized on many grounds, but it shows why even the simplest theoretical considerations of "unfair coins" (which is really the same kind of thing, because the box model for an unfair coin is exactly the same) have immediate application.

Discussion

Paradoxes in Probability

The sibling mystery

We discussed several of the paradoxes in probability, beginning with the Sibling Mystery:

A boy you meet on the street tells you he comes from a family of two children.  What is the probability he has a sister?  (What assumptions are needed for this question even to make sense?)

We agreed that in two-child families, the frequencies of boys among the first children are about 0.50 and the frequencies of boys among the second children are also about 0.50.  These values cannot be derived theoretically, but are statements of fact about the world that have to be learned by observation.

We also assumed that the gender of the second child in a family is independent of the gender of the first.  This assumption, too, is subject to empirical testing, but our experience indicates this is at least approximately true.

These frequencies and this independence assumption let us determine the frequencies of the four kinds of two-child family.  Writing the gender of the eldest sibling first, these four kinds are boy-boy, boy-girl, girl-boy, and girl-girl, each occurring with a frequency of 0.50 * 0.50 = 0.25.  Therefore, the frequency of two-boy families is 0.25, of two-girl families is 0.25, and of boy-girl families (in any order) is 0.25 + 0.25 = 0.50.

We modeled this problem using tickets in a box.  The box has one ticket for every two-child family.  On the ticket is written the family composition.  We have just deduced that about 25 percent of those tickets say "two boys." about 25 percent say "two girls," and the remaining 50 percent say "boy and girl."

Warning!  The following analysis is incorrect.  Keep reading to find out why.

The problem situation can now be rephrased entirely in terms of drawing a ticket out of the box.  The ticket has the word "boy" on it.  What are the chances that it is one of the "boy-girl" tickets?  (The extensive debate over the "correct" solution to this problem revolves ultimately around whether this model is appropriate.  It is possible to construct alternative scenarios that require a different probability model.)

If we were to repeat the drawing many, many, times (each time replacing the previous ticket, to leave the box contents unchanged), then evidently we would observe about twice as many boy-girl tickets as boy-boy tickets, because there are twice as many boy-girl tickets as boy-boy tickets in the box.  Thus the chance that the boy has a sister is 2/3 (about 67 percent).  In terms of the proportions in the box, this number is 0.25 / (0.25 + 0.50).

* * * * * * * * *

Nick Hobson (to whom I am most grateful for his time and attention) has been kind enough to point out the flaw in the previous argument.  The ticket-in-a-box approach was not incorrect; it was just incorrectly executed!  The process of encountering a boy (which is known as a "convenience sample" in the statistical literature) is not the same as noticing the ticket has the word "boy" on it.  The reason is that when we encounter the boy on the street, we see his gender, but not his sibling's gender.  In terms of tickets, we chance to notice only half the information on the ticket.

Because this can be confusing, let's use a ticket-in-a-box model that more faithfully represents what is going on.  Previously, we let tickets represent families.  In order to model our sampling correctly, the tickets need to add one more piece of information: namely, which sibling we encounter on the street.

Let's do this carefully.  The purpose is to model the encounter in the street, without yet taking into account it's a boy we meet.  So, in a quarter of the cases (corresponding again to a quarter of all two-child families), the ticket will say "two boys".  We replace each of those tickets with two tickets.  Each says "two boys" on the back, but on the front one of them, corresponding to meeting the older child, says "you meet  a boy" and the other, corresponding to meeting the younger child, also says "you meet a boy".

In another quarter of the cases, the ticket will say "boy and girl."  We replace each of those again with two tickets.  On the front one of them says "you meet a boy" but the other one says "you meet a girl."  We continue like this with the other two kinds of tickets, "girl and boy" and "two girls."

In this fashion the box becomes populated as follows:

Proportion Back Front
1/8 Two boys "You meet a boy."
1/8 Two boys "You meet a boy."
1/8 Boy and girl "You meet a boy."
1/8 Boy and girl "You meet a girl."
1/8 Girl and boy "You meet a girl."
1/8 Girl and boy "You meet a boy."
1/8 Two girls "You meet a girl."
1/8 Two girls "You meet a girl."

Meeting a boy on the street is tantamount to removing all the tickets that say "you meet a girl."  The contents of the box are now

Proportion Back Front
1/4 Two boys "You meet a boy."
1/4 Two boys "You meet a boy."
1/4 Boy and girl "You meet a boy."
1/4 Girl and boy "You meet a boy."

That leaves half of the tickets saying either "boy and girl" or "girl and boy" on the back.  We conclude that the probability the boy has a sister is 1/4 + 1/4 = 1/2, not 1/3.

I was able to reconcile the first (mistaken) analysis with this (correct) analysis by realizing that a two-boy family is twice as likely to have a boy on the street as a boy-girl family.  This doubles the probability of encountering a boy from a two-boy family, thereby raising the chance he has a brother from 1/3 = 0.25 / (0.25 + 0.50) to [2*0.25] / ([2*0.25] + 0.50) = 1/2, implying the chance he has a sister is 1 - 1/2 = 1/2.

By the way, Hobson's analysis was much simpler.  He reasoned by analogy with flipping coins, arguing that if "A friend grabs one (without looking at the other) and announces that it shows heads... [then] the probability that [the other coin shows tails] is 1/2."  That's clear, because the coin tosses were independent.  However, the whole point of this page is to show how we can use ticket-in-a-box models to solve problems in probability so that when we encounter much trickier situations, where coin-flipping analogies and the like become suspect (or harder to prove correct), we have a hope of deriving a correct solution.

The lesson I learned in making this mistake is that one must be careful to ensure that the process of drawing the tickets from the box perfectly emulates the process by which information is actually obtained; it does not suffice just to populate the box with the correct proportions of tickets.

(Updated 26 August 2004.)

* * * * * * * * *

The three coins problem

To solve this problem, we modeled its outcome using a box with tickets.  There are six outcomes, which we deemed equally likely, so we put six tickets in the box.  Each ticket has three items on it, although we are interested only in the last.  These items are the coin (two-headed, two-tailed, or normal), the side facing up, and the side facing down.  To distinguish the sides of the two-headed and two-tailed coin, we imagined they were painted red on one side, blue on the other.  This would not change the probabilities.  The third thing written on each ticket, the side facing down, is what we're ultimately interested in.

This table shows the tickets.

Ticket # Coin Side up Side down
1 Two heads Head (blue) Head (red)
2 Two heads Head (red) Head (blue)
3 Two tails Tail (blue) Tail (red)
4 Two tails Tail (red)  Tail (blue)
5 Normal Head Tail
6 Normal Tail Head

The problem tells us heads are up.  In other words, we know the ticket just drawn is either the first, second, or fifth.  Of these, two--numbers 1 and 2--have a head on the other side.  Therefore the probability that the other side is a head is 2/3.  A computer simulation bears this out:

Number of simulations 10000
Number of times head is up 4963
Head is up AND other face is head 3301
Frequency 66.5%

That is, tickets were drawn from the box 10,000 times (and replaced each time, of course, for the next draw).  Of those, 4,963 corresponded to a coin landing heads up.  3,301 (66.5%) of those indicated the other face was a head.

This was not just a matter of luck.  Additional, independent simulations (in Excel) produced comparable results.  The table summarizes them all:

Number of simulations 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000
Number of times head is up 4936 4963 5007 5037 4962 5032 4877 4995 5060 4989
Head is up AND other face is head 3265 3301 3354 3307 3311 3360 3237 3380 3418 3362
Frequency 66.1% 66.5% 67.0% 65.7% 66.7% 66.8% 66.4% 67.7% 67.5% 67.4%

The median frequency is 66.75%, the H-spread is 1.0%, and the range is 2.0%: very consistent with the predicted value of 2/3 (66.67%) and not at all consistent with the "intuitively obvious" value of 1/2 (50.00%).

The envelope problem

We found this one straightforward.  To model the situation, we supposed there were $X in one envelope (ticket) and $2X in the other.  In drawing tickets from the box with replacement, we would get  $X half the time and $2X the other half.  A strategy of opening the selected envelope would therefore have an expected value of $1.5X.  A strategy of switching envelopes would get $2X half the time and $X the other half, again yielding an expected value of $1.5X.  Therefore the two strategies have the same expected value.

In our discussion we recognized that the size of $X might govern the strategy.  For example, if you were in dire need of $20,000, and the first envelope drawn contained $10,000, then you would probably switch.  If, however, $10,000 would change your life, you might decide to "take the money and run" to avoid the risk of losing all of it.  Therefore probability calculations and expected values are only part of the information a (rational) decision maker will use to determine their action.

For an interesting recent analysis of this problem, see Samet et al., One Observation Behind Two-Envelope Puzzles. American Mathematical Monthly 111 (April 2004) pp 347-351.

Simulation with Excel

We spent a period in the computer laboratory.  The purpose was to build a simulation of the Monty Hall problem.  We had partially analyzed this (using a box model) but the analysis was not entirely convincing.  Most people maintained that switching doors had a 50% chance of winning, because the prize evidently was behind one of two doors (1/2 = 50%).

Some of the fundamental principles of using Excel include:

Save your work early and often (the Louisiana voting principle).
Label the cells to document your formulas (the decoration principle).
Use Excel's drag-and-drop interface to create large simulations out of one row of formulas.
Use zeros and ones (see below) to record discrete (yes-no, true-false) events rather than Excel's TRUE and FALSE values.  This will let you count results by summing the zeros and ones.

The functions used for the Monty Hall simulation are

RAND():    Produces a pseudo-random number between 0 and 1 (but never equal to 1).  These numbers are supposed to be uniformly and independently distributed, but they are not.  RAND() is the worst pseudo-random number generator (PRNG) in existence.  It repeats after about 1,000,000 calls and it is not even uniformly distributed.  So don't use it for formal research.  It's ok for testing out ideas in a spreadsheet.  See http://www.quantdec.com/arcview.htm for a formal evaluation, with explanations.

INT():    Returns the largest integer not exceeding its argument.  For example, INT(1.3) = 1; INT(-1.3) = -2; INT(3) = 3.  The expression INT(6*RAND()) produces uniformly distributed numbers in the set {0, 1, 2, 3, 4, 5}.

IF():    Computes a conditional result.  This lets you select among alternative formulas depending on some other value.  For example, IF(RAND() < 0.2, 1, 0) produces the value 1 about 20% of the time and 0 the remaining 80% of the time.

MOD():    Computes a remainder after division.  For example, MOD(7, 3) = 1.

The simulation contains 10 columns:

1.    Randomly put the prize behind either door 0, door 1, or door 2: INT(RAND()*3).

2.    Let the player choose a door.  You may use any algorithm you like.  We selected to choose at random, so the formula again is INT(RAND()*3).

3-8.    We need to open a door that (a) is not chosen and (b) does not have the prize.  To do this, we found the unchosen doors using the MOD() function: if I is the number of the chosen door, then MOD(I+1, 3) is either I+1 or I-2 and MOD(I-1, 3) is either I-1 or I+2.  Thus, the two unchosen doors are MOD(I+1, 3) and MOD(I-1, 3).  If either one of these is the prize door, we replace it with the other unchosen door.  This is done using the IF() function.  Finally, we may use any algorithm to simulate Monty Hall's choice of door to open: we may always pick the first one we compute, or we may choose among them at random, for example.  The spreadsheet on these web pages chooses at random.

9-10.    We coded the results using zeros and ones.  For the no-switching strategy, we put a 1 if the chosen door contains the prize.  For the switching strategy (not done in class, but included here), we put a 1 if the unopened door contains the prize.  This again uses the IF() function.

(The spreadsheet provided here includes two more columns to check the validity of the computations.  One of them verifies that Monty Hall never opens the door first chosen by the contestant; another verifies that one of the two strategies wins in every situation.  These checks help assure that we have not made stupid mistakes in some of the important columns.)

Summing the values in columns 9 and 10 counts the simulations in which these strategies won the prize.  We performed 1,000 simulations (many times over), finding that switching tends to win about 2/3 of the time: in line with the box model analysis and contradicting our intuition.

Binomial and Normal Probability Distributions

The binomial distribution is the histogram of the idealized probabilities associated with independent coin flips.  We saw how N independent flips of a coin have 2N possible outcomes.  The binomial distribution tracks only the numbers of heads in the outcomes, rather than the outcomes themselves.  Thus, although each possible outcome has the same probability (of 1/2 * 1/2 * ... * 1/2 = 2-N), the numbers of heads have varying probabilities.

For example, two independent flips of a coin have outcomes HH, HT, TH, and TT with equal probabilities of 1/2 * 1/2 = 2-2 = 25%.  There are two different ways to achieve one head: HT and TH, so the probability of one head is 2 * 25% = 50%.  Three independent flips have outcomes HHH, HHT, HTH, HTT, THH, THT, TTH, and TTT with equal probabilities of 1/2 * 1/2 * 1/2 = 2-3 = 12.5%.  Three of these outcomes have two heads and three have one head, so the probability of two heads and the probability of one head are both 3 * 12.5% = 37.5%, whereas the probability of three heads or the probability of no heads are each 12.5%.

Figuring these probabilities is just a counting problem, really, not a probability problem.

The number of sequences of N coins that have exactly K heads is variously written C(N,K), Comb(N,K), Binom(N,K), NChooseK, and .  (You can tell that this counting problem has occurred in many contexts.)  It has a straightforward formula: C(N,K) = N*(N-1)* ... *(N-K+1) / [K * (K-1) * ... * (2) * (1)], which is equivalent to N!/(K! * (N-K)!) when we write, as usual, N! = N*(N-1)* ... *2*1.  For example, C(3,2) = 3*2/(2*1) = 3 is the number of sequences of K= two heads in N=three independent flips of a coin.  In general, begin by writing down the value 1.  Multiply by the fraction N/K, reduce both numerator and denominator by 1 (getting (N-1)/(K-1)), multiply by that, and continue in this manner until you can go no further (because you would end up with a denominator of zero).  To illustrate, compute C(5,3) = 1 * 5/3 * 4/2 * 3/1 = 10; C(10,0) = 1 (you stop immediately, because the fraction 5/0 is meaningless).

The binomial probabilities therefore are C(N,K)*2-N = the probability that a sequence of N (N=1, 2, ...) independent flips of a fair coin contains exactly K (K=0, 1, ..., N) heads.  More generally, if the coin is not fair, we may still write p = the constant probability of heads, q = the probability of tails (so q = 1-p), getting C(N,K)*pK*qN-K for the probability of K heads.  (For a fair coin, p = q = 1/2, so pK*qN-K = (1/2)K*(1/2)N-K = 2-N, as before.)  You will see this formula in the textbook's description of the binomial distribution (p. 179) and you will see one like it in the description of the related hypergeometric distribution (p. 182).  Whenever you see this formula, you should remember that the C(N,K) term counts sequences with heads and the pK*qN-K term computes probabilities of individual sequences.

As N gets very large, the binomial probabilities more and more closely approximate the Normal distribution.  This approximation must be understood in a very specific sense, because--clearly--as N increases, there will tend to be more and more heads on the average and the variation in the number of heads will also increase.  It is only after standardizing the results, by making the mean the center and the standard deviation the scale, that we see the histograms approach a stable limit.  (This result was first stated and proven by Abraham De Moivre in the first half of the eighteenth century.)

The same result--that is, achieving a Normal limiting distribution--holds even when the coins involved have varying probabilities of heads, provided we make some assumptions.  This is the Central Limit Theorem (CLT).  Many statisticians are all too ready to apply this theorem--which is strictly theoretical--to real data.  If you are tempted, please remember

1.    The Central Limit Theorem makes an asymptotic statement.  This means that it is rarely exactly true, but is only approximately true provided N is "big enough."  How big is big enough depends on the circumstance.  In some cases N=5 is fine (rolling five dice is a quick and dirty way to approximate a normal variable), but in other cases you need N=1000, N=10000, or even larger.

2.    The conditions for the CLT to apply, even asymptotically, do not always hold.  For example, a sample of soil obtained at random from a definite site will be subjected to laboratory measurements that themselves fluctuate due to the combined contributions of many, many small errors.  However, the combined measurement uncertainty often is much smaller than the variation in true concentration.  The CLT, naively applied, would predict that the batch of results will be almost Normally distributed.  That quite clearly is not the case in most applications.  One of the CLT's assumptions is that no single random factor influencing the outcome may have a variability much larger than all the others.  That condition is violated in this instance, so the CLT just does not apply.  Assuming its conclusion is a mistake.

Rules and shortcuts, tips and tricks

Reading statistical formulas

The expression for the standard Normal distribution is exp(-x2/2)dx.  (Remember, distributions, like histograms, represent probability by area, so we need to multiply the height--exp(-x2/2)--by the width: dx.)  The area under this curve is not 1 (it is the square root of 2*Pi), so to make it a valid probability distribution function (pdf) we have to multiply it by a constant to reduce its area to unity.  For the same reason, you should expect to see some constant expression multiplying any pdf.  The value of this constant is obtained by integrating an expression like exp(-x2/2)dx, so obtaining it is a matter of calculus.  Its actual value is not of much interest in applications.

As we have already seen, you can change the location and scale of a variable x with an expression like (x - mu)/sigma.  Look for this kind of expression in most statistical formulas.  If you replace it by a single variable, the formula will usually become much simpler, but its essence--its mathematical "shape"--will be unchanged.

We used these ideas to understand the Gamma distribution (xAlpha * exp(-x)dx/x) as an example.  The variable Alpha determines the shape of the distribution.  The variables mu and sigma determine its location and scale, as usual.  These three variables are the parameters of the distribution.

Return to the Environmental Statistics home page

This page is copyright (c) 2001-4 Quantitative Decisions.  Please cite it as

This page was created 10 February 2001 and last updated 26 August 2004 to point out the error in the Sibling Mystery analysis and to provide a reference to a paper on the Envelope problem.