http://www.epa.gov/superfund/programs/risk/rags3adt/appends.pdf These are the appendices to "RAGS 3A," the US EPA's "Process for Conducting Probabilistic Risk Assessment." See Appendix C, "Probability Distributions for PRA," for examples of distributions.

"Box
model" |
A short term for "ticket in a box model," which describes random phenomena in terms of independent random draws of slips of paper ("tickets") in a box. |

Expected
value |
When
tickets in a box have numerical values, the expected value of the box is
the sum of the values divided by the number of tickets in the box.
If the values represent an amount "won" when a ticket is drawn,
then the expected value represents the average amount that will be won in
many, many independent draws (with replacement) from the box.
If we write down all the different values written on the
tickets (some tickets may have the same value, so there often will be
fewer different values than tickets), then the values will have Expected value = Sum { V(i) * F(i) } |

NPDES |
National
Pollutant Discharge Elimination System (U.S.). This program requires
any facility discharging wastes to a surface water body (creek, river,
estuary, or ocean) to apply for a permit. The permit specifies
maximum allowable concentrations and "loadings" (mass per unit
time) of chemical parameters in the waste stream. The facility must
monitor its waste stream periodically, measuring the permitted
parameters. Any values exceeding the permit limits can cause a
"notice of violation" and stiff daily fines to be assessed.
A box model can be used to understand and predict these "exceedances." The tickets in the box say either "violation" or "no violation". The proportions of tickets in the box determine the chances that any measurement will violate the permit. This model can be criticized on many grounds, but it shows why even the simplest theoretical considerations of "unfair coins" (which is really the same kind of thing, because the box model for an unfair coin is exactly the same) have immediate application. |

We discussed several of the paradoxes in probability, beginning with the Sibling Mystery:

A boy you meet on the street tells you he comes from a family of two children. What is the probability he has a sister? (What assumptions are needed for this question even to make sense?)

We agreed that in two-child families, the frequencies of boys among the first children are about 0.50 and the frequencies of boys among the second children are also about 0.50. These values cannot be derived theoretically, but are statements of fact about the world that have to be learned by observation.

We also assumed that the gender of the second child in a family is independent of the gender of the first. This assumption, too, is subject to empirical testing, but our experience indicates this is at least approximately true.

These frequencies and this independence assumption let us determine the frequencies of the four kinds of two-child family. Writing the gender of the eldest sibling first, these four kinds are boy-boy, boy-girl, girl-boy, and girl-girl, each occurring with a frequency of 0.50 * 0.50 = 0.25. Therefore, the frequency of two-boy families is 0.25, of two-girl families is 0.25, and of boy-girl families (in any order) is 0.25 + 0.25 = 0.50.

We modeled this problem using tickets in a box. The box has one ticket for every two-child family. On the ticket is written the family composition. We have just deduced that about 25 percent of those tickets say "two boys." about 25 percent say "two girls," and the remaining 50 percent say "boy and girl."

**Warning! The following analysis is incorrect.
Keep reading to find out why.**

The problem situation can now be rephrased entirely in terms of drawing a ticket out of the box. The ticket has the word "boy" on it. What are the chances that it is one of the "boy-girl" tickets? (The extensive debate over the "correct" solution to this problem revolves ultimately around whether this model is appropriate. It is possible to construct alternative scenarios that require a different probability model.)

If we were to repeat the drawing many, many, times (each time replacing the previous ticket, to leave the box contents unchanged), then evidently we would observe about twice as many boy-girl tickets as boy-boy tickets, because there are twice as many boy-girl tickets as boy-boy tickets in the box. Thus the chance that the boy has a sister is 2/3 (about 67 percent). In terms of the proportions in the box, this number is 0.25 / (0.25 + 0.50).

*** * * * * * * * ***

Nick Hobson (to whom I am most grateful for his time and
attention) has been kind enough to point out the flaw in the previous
argument. The ticket-in-a-box approach was not incorrect; it was just
incorrectly executed! The process of encountering a boy (which is known as
a "convenience sample" in the statistical literature) is not the same
as noticing the ticket has the word "boy" on it. The reason is
that when we encounter the boy on the street, we see his gender, but not his
sibling's gender. In terms of tickets, we chance to notice *only half
the information on the ticket*.

Because this can be confusing, let's use a ticket-in-a-box
model that more faithfully represents what is going on. Previously, we let
tickets represent families. In order to model our sampling correctly, the
tickets need to add one more piece of information: namely, *which sibling we
encounter on the street*.

Let's do this carefully. The purpose is to model the
encounter in the street, without yet taking into account it's a boy we
meet. So, in a quarter of the cases (corresponding again to a quarter of
all two-child families), the ticket will say "two boys". We
replace each of those tickets with *two* tickets. Each says "two
boys" on the back, but on the front one of them, corresponding to meeting
the older child, says "you meet a boy" and the other,
corresponding to meeting the younger child, also says "you meet a
boy".

In another quarter of the cases, the ticket will say "boy
and girl." We replace each of those again with two tickets. On
the front one of them says "you meet a boy" but the other one says
"you meet a *girl*." We continue like this with the other
two kinds of tickets, "girl and boy" and "two girls."

In this fashion the box becomes populated as follows:

Proportion |
Back |
Front |

1/8 | Two boys | "You meet a boy." |

1/8 | Two boys | "You meet a boy." |

1/8 | Boy and girl | "You meet a boy." |

1/8 | Boy and girl | "You meet a girl." |

1/8 | Girl and boy | "You meet a girl." |

1/8 | Girl and boy | "You meet a boy." |

1/8 | Two girls | "You meet a girl." |

1/8 | Two girls | "You meet a girl." |

Meeting a boy on the street is tantamount to removing all the tickets that say "you meet a girl." The contents of the box are now

Proportion |
Back |
Front |

1/4 | Two boys | "You meet a boy." |

1/4 | Two boys | "You meet a boy." |

1/4 | Boy and girl | "You meet a boy." |

1/4 | Girl and boy | "You meet a boy." |

That leaves half of the tickets saying either "boy and girl" or "girl and boy" on the back. We conclude that the probability the boy has a sister is 1/4 + 1/4 = 1/2, not 1/3.

I was able to reconcile the first (mistaken) analysis with this (correct) analysis by realizing that a two-boy family is twice as likely to have a boy on the street as a boy-girl family. This doubles the probability of encountering a boy from a two-boy family, thereby raising the chance he has a brother from 1/3 = 0.25 / (0.25 + 0.50) to [2*0.25] / ([2*0.25] + 0.50) = 1/2, implying the chance he has a sister is 1 - 1/2 = 1/2.

By the way, Hobson's analysis was much simpler. He reasoned by analogy with flipping coins, arguing that if "A friend grabs one (without looking at the other) and announces that it shows heads... [then] the probability that [the other coin shows tails] is 1/2." That's clear, because the coin tosses were independent. However, the whole point of this page is to show how we can use ticket-in-a-box models to solve problems in probability so that when we encounter much trickier situations, where coin-flipping analogies and the like become suspect (or harder to prove correct), we have a hope of deriving a correct solution.

*The lesson I learned in making this mistake is that one
must be careful to ensure that the process of drawing the tickets from the box
perfectly emulates the process by which information is actually obtained; it
does not suffice just to populate the box with the correct proportions of
tickets.*

(Updated 26 August 2004.)

*** * * * * * * * ***

To solve this problem, we modeled its outcome using a box with
tickets. There are six outcomes, which we deemed equally likely, so we put
six tickets in the box. Each ticket has three items on it, although we are
interested only in the last. These items are the coin (two-headed,
two-tailed, or normal), the side facing up, and the side facing down. To distinguish the sides of
the two-headed and two-tailed coin, we imagined they were painted red on one
side, blue on the other. This would not change the probabilities.
The third thing written on each ticket, the side facing *down*, is what we're ultimately interested in.

This table shows the tickets.

Ticket # |
Coin |
Side up |
Side down |

1 |
Two heads |
Head (blue) |
Head (red) |

2 |
Two heads |
Head (red) |
Head (blue) |

3 | Two tails | Tail (blue) | Tail (red) |

4 | Two tails | Tail (red) | Tail (blue) |

5 |
Normal |
Head |
Tail |

6 | Normal | Tail | Head |

The problem tells us heads are up. In other words, we know the ticket just drawn is either the first, second, or fifth. Of these, two--numbers 1 and 2--have a head on the other side. Therefore the probability that the other side is a head is 2/3. A computer simulation bears this out:

Number of simulations | 10000 |

Number of times head is up | 4963 |

Head is up AND other face is head | 3301 |

Frequency | 66.5% |

That is, tickets were drawn from the box 10,000 times (and replaced each time, of course, for the next draw). Of those, 4,963 corresponded to a coin landing heads up. 3,301 (66.5%) of those indicated the other face was a head.

This was not just a matter of luck. Additional, independent simulations (in Excel) produced comparable results. The table summarizes them all:

Number of simulations | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |

Number of times head is up | 4936 | 4963 | 5007 | 5037 | 4962 | 5032 | 4877 | 4995 | 5060 | 4989 |

Head is up AND other face is head | 3265 | 3301 | 3354 | 3307 | 3311 | 3360 | 3237 | 3380 | 3418 | 3362 |

Frequency | 66.1% | 66.5% | 67.0% | 65.7% | 66.7% | 66.8% | 66.4% | 67.7% | 67.5% | 67.4% |

The median frequency is 66.75%, the H-spread is 1.0%, and the range is 2.0%: very consistent with the predicted value of 2/3 (66.67%) and not at all consistent with the "intuitively obvious" value of 1/2 (50.00%).

We found this one straightforward. To model the
situation, we supposed there were $X in one envelope (ticket) and $2X in the
other. In drawing tickets from the box with replacement, we would
get $X half the time and $2X the other half. A strategy of opening
the selected envelope would therefore have an *expected value* of
$1.5X. A strategy of switching envelopes would get $2X half the time and
$X the other half, again yielding an expected value of $1.5X. Therefore
the two strategies have the same expected value.

In our discussion we recognized that the size of $X might govern the strategy. For example, if you were in dire need of $20,000, and the first envelope drawn contained $10,000, then you would probably switch. If, however, $10,000 would change your life, you might decide to "take the money and run" to avoid the risk of losing all of it. Therefore probability calculations and expected values are only part of the information a (rational) decision maker will use to determine their action.

For an interesting recent analysis of this problem, see Samet
et al., *One Observation Behind Two-Envelope Puzzles*. American
Mathematical Monthly 111 (April 2004) pp 347-351.

We spent a period in the computer laboratory. The purpose was to build a simulation of the Monty Hall problem. We had partially analyzed this (using a box model) but the analysis was not entirely convincing. Most people maintained that switching doors had a 50% chance of winning, because the prize evidently was behind one of two doors (1/2 = 50%).

Some of the fundamental principles of using Excel include:

Save your work early and often (the Louisiana voting principle). | |

Label the cells to document your formulas (the decoration principle). | |

Use Excel's drag-and-drop interface to create large simulations out of one row of formulas. | |

Use zeros and ones (see below) to record discrete (yes-no, true-false) events rather than Excel's TRUE and FALSE values. This will let you count results by summing the zeros and ones. |

The functions used for the Monty Hall simulation are

**RAND**(): Produces a pseudo-random
number between 0 and 1 (but never equal to 1). These numbers are supposed
to be uniformly and independently distributed, but they are not. RAND() is
the worst pseudo-random number generator (PRNG) in existence. It repeats
after about 1,000,000 calls and it is not even uniformly distributed. So
don't use it for formal research. It's ok for testing out ideas in a
spreadsheet. See http://www.quantdec.com/arcview.htm
for a formal evaluation, with explanations.

**INT**(): Returns the largest integer
not exceeding its argument. For example, INT(1.3) = 1; INT(-1.3) = -2;
INT(3) = 3. The expression INT(6*RAND()) produces uniformly distributed
numbers in the set {0, 1, 2, 3, 4, 5}.

**IF**(): Computes a conditional
result. This lets you select among alternative formulas depending on some
other value. For example, IF(RAND() < 0.2, 1, 0) produces the value 1
about 20% of the time and 0 the remaining 80% of the time.

**MOD**(): Computes a remainder after
division. For example, MOD(7, 3) = 1.

The simulation contains 10 columns:

1. Randomly put the prize behind either door 0, door 1, or door 2: INT(RAND()*3).

2. Let the player choose a door. You may use any algorithm you like. We selected to choose at random, so the formula again is INT(RAND()*3).

3-8. We need to open a door that (a) is not
chosen and (b) does not have the prize. To do this, we found the unchosen
doors using the MOD() function: if I is the number of the chosen door, then
MOD(I+1, 3) is either I+1 or I-2 and MOD(I-1, 3) is either I-1 or I+2.
Thus, the two unchosen doors are MOD(I+1, 3) and MOD(I-1, 3). If either
one of these is the prize door, we *replace* it with the other unchosen
door. This is done using the IF() function. Finally, we may use any
algorithm to simulate Monty Hall's choice of door to open: we may always pick
the first one we compute, or we may choose among them at random, for
example. The spreadsheet on these web pages chooses at random.

9-10. We coded the results using zeros and ones. For the no-switching strategy, we put a 1 if the chosen door contains the prize. For the switching strategy (not done in class, but included here), we put a 1 if the unopened door contains the prize. This again uses the IF() function.

(The spreadsheet provided here includes two more columns to check the validity of the computations. One of them verifies that Monty Hall never opens the door first chosen by the contestant; another verifies that one of the two strategies wins in every situation. These checks help assure that we have not made stupid mistakes in some of the important columns.)

Summing the values in columns 9 and 10 counts the simulations
in which these strategies won the prize. We performed 1,000 simulations
(many times over), finding that __switching tends to win about 2/3 of the time__:
in line with the box model analysis and contradicting our intuition.

The binomial distribution is the histogram of the idealized
probabilities associated with independent coin flips. We saw how N
independent flips of a coin have 2^{N }possible outcomes. The
binomial distribution tracks only the numbers of heads in the outcomes, rather
than the outcomes themselves. Thus, although each possible outcome has the
same probability (of 1/2 * 1/2 * ... * 1/2 = 2^{-N}), the numbers of heads have varying
probabilities.

For example, two independent flips of a coin have outcomes HH,
HT, TH, and TT with equal probabilities of 1/2 * 1/2 = 2^{-2} =
25%. There are two different ways to achieve one head: HT and TH, so the
probability of one head is 2 * 25% = 50%. Three independent flips have
outcomes HHH, HHT, HTH, HTT, THH, THT, TTH, and TTT with equal probabilities of
1/2 * 1/2 * 1/2 = 2^{-3} = 12.5%. Three of these outcomes have two
heads and three have one head, so the probability of two heads and the
probability of one head are both 3 * 12.5% = 37.5%, whereas the probability of
three heads or the probability of no heads are each 12.5%.

Figuring these probabilities is just a counting problem, really, not a probability problem.

The number of sequences of N coins that
have exactly K heads is variously written C(N,K), Comb(N,K), Binom(N,K),
The binomial probabilities therefore are C(N,K)*2 |

As N gets very large, the binomial probabilities more and more
closely approximate the *Normal* distribution. This approximation
must be understood in a very specific sense, because--clearly--as N increases,
there will tend to be more and more heads on the average and the variation in
the number of heads will also increase. It is only after *standardizing*
the results, by making the mean the center and the standard deviation the scale,
that we see the histograms approach a stable limit. (This result was first
stated and proven by Abraham De Moivre in the first half of the eighteenth
century.)

The same result--that is, achieving a Normal limiting
distribution--holds even when the coins involved have varying probabilities of
heads, __provided we make some assumptions__. This is the *Central
Limit Theorem* (CLT). Many statisticians are all too ready to apply
this theorem--which is strictly theoretical--to real data. If you are
tempted, please remember

1. The Central Limit Theorem makes an

asymptoticstatement. This means that it is rarely exactly true, but is only approximately true provided N is "big enough." How big is big enough depends on the circumstance. In some cases N=5 is fine (rolling five dice is a quick and dirty way to approximate a normal variable), but in other cases you need N=1000, N=10000, or even larger.2. The conditions for the CLT to apply, even asymptotically, do not always hold. For example, a sample of soil obtained at random from a definite site will be subjected to laboratory measurements that themselves fluctuate due to the combined contributions of many, many small errors. However, the combined measurement uncertainty often is much smaller than the variation in true concentration. The CLT, naively applied, would predict that the batch of results will be almost Normally distributed. That quite clearly is not the case in most applications. One of the CLT's assumptions is that no single random factor influencing the outcome may have a variability much larger than all the others. That condition is violated in this instance, so the CLT just does not apply. Assuming its conclusion is a mistake.

The expression for the standard Normal distribution is exp(-x^{2}/2)dx.
(Remember, distributions, like histograms, represent probability by area, so we
need to multiply the height--exp(-x^{2}/2)--by the width: dx.) The
area under this curve is not 1 (it is the square root of 2*Pi), so to make it a
valid probability distribution function (pdf) we have to multiply it by a
constant to reduce its area to unity. For the same reason, *you should
expect to see some constant expression multiplying any pdf*. The
value of this constant is obtained by integrating an expression like exp(-x^{2}/2)dx,
so obtaining it is a matter of calculus. Its actual value is not of much
interest in applications.

As we have already seen, you can change the location and scale of a variable x with an expression like (x - mu)/sigma. Look for this kind of expression in most statistical formulas. If you replace it by a single variable, the formula will usually become much simpler, but its essence--its mathematical "shape"--will be unchanged.

We used these ideas to understand the Gamma distribution (x^{Alpha
}* exp(-x)dx/x) as an example. The variable *Alpha* determines
the shape of the distribution. The variables mu and sigma determine its
location and scale, as usual. These three variables are the *parameters*
of the distribution.

*Return
to the Environmental Statistics home page
*

*This page is copyright (c) 2001-4 Quantitative Decisions.
Please cite it as
*

*This page was created 10 February 2001 and last updated 26
August 2004 to point out the error in the *Sibling Mystery*
analysis and to provide a reference to a paper on the *Envelope*
problem.*