http://risk.lsd.ornl.gov/homepage/bjc_or416.pdf Detailed discussion of parameter and interval estimators for normal and lognormal distributions, including how to cope with censoring (nondetects).
| (statistical) sample | A collection of observations obtained in a way that can be accurately modeled by draws of tickets from one or more boxes. |
| stochastic simulation | The process of replacing numerical inputs to a mathematical model by probability distributions, drawing random values from those distributions in repeated independent runs of the model, and reporting on the distributions of model outputs. |
![]()
In order to de-mystify probability distributions, we constructed several of our own.
Take a bunch of numbers. You can use infinitely many, but keep them separated from one another. Draw a horizontal axis with these numbers marked. Choose one of these numbers. It is your start number.
Now draw a vertical axis. Mark it off from 0 to 100%. Choose any height between 0 and 100% (but not equal to 0 or 100). Associate that height with your start number. This will be the value of the CDF for the start number.
You now have a bunch of numbers less than the start and another bunch greater than the start. (One or both of these bunches might be missing, for example if you start was the smallest number. That's ok.) Let's call these the "smaller bunch" and the "greater bunch," respectively.
Repeat the process for the smaller bunch and the greater bunch. The only difference is that for the smaller bunch, you will select a height between 0 and the CDF of the start number. For the larger bunch, you will select a height between the CDF of the start number and 100.
Do this procedure recursively until you have assigned a height to every one of your chosen numbers. There are only two simple rules: (1) you have to make the heights approach 0 as you approach the smallest (leftmost) horizontal value and (2) you have to make the heights approach 100 as you approach the largest (rightmost) horizontal value.
You can now fill in the rest of the CDF so that it leaps upwards only at the horizontal values you originally chose. It defines a discrete probability distribution because (a) it is monotonic (b) it rises from 0 to 100% and (c) it rises only in leaps at distinct horizontal points.
Building a continuous distribution is even easier. We will build its PDF. To do this, draw any kind of continuous curve in the plane. Well, almost. There are a few simple rules. First, a vertical slice through any point on the curve must not intersect any other point. Second, the curve's points must all be above some minimum height. Third, the area between the curve and a line of constant minimum height must be finite. (This is an issue if you extend your curve infinitely far to the right or left.)
To finish the process, draw the X-axis at the line of minimum height. Change the scale on the Y-axis to make the curve's area equal to 1 (100%). You now have a PDF. It defines a continuous distribution.
Note: There exist continuous distributions whose PDFs are not constructed in this fashion. PDFs do not have to be continuous and indeed they can have vertical asymptotes ("singularities").
Many distributions have at least one of two nice features. First, they have nice mathematical properties. Second, they arise through consideration of some physical process, much as the Normal distribution arises in the theory of errors or the Binomial distribution arises by studying sequences of experimental "successes" and "failures." This makes them useful tools for modeling many phenomena. A distribution you make up might or might not be so useful.
![]()
We have to stay sharp when we read statistics books because they use nouns like "mean" in many distinct ways. Some of the uses we have encountered include
To illustrate the distinctions made in point 3, suppose you have a sample (x1, x2, ..., xN) from a box (N>=2). That box has a mean (sense #2), but you do not know its value. The sample, considered as a batch, has a mean (sense #1): it is equal to (x1 + x2 + ... + xN)/N. There are many possible estimators (sense #3) of the box's mean (sense #2) that can be constructed from the sample. Some of the better ones among these are:
| The midrange (x[1] + x[N])/2 | |
| The interquartile range | |
| The mean of the hinges | |
| The trimean | |
| The mean of the 16th and 84th percentiles of the sample | |
| The median of the sample | |
| And, of course, the mean (sense #1) of the sample. |
You can already see from the illustration above that many, many estimators of any distribution's properties can and do exist. Each estimator is just a formula intended to be applied to the values in a statistical sample. To use an estimator, you look up and apply the formula. That's all there is to it!
You can even make up your own estimator without any knowledge of probability and statistics whatsoever. (We will see some examples later. Creativity is not limited to people: even government agencies have gotten in on the act.) Nobody will stop you. You only need to specify two things:
No, there does not have to be any connection between #1 and #2. It helps if there is, though, at least intuitively. This makes it easier to persuade people of your prowess as a statistical expert.
The challenge lies in choosing the estimator. If there are so many, are they all equally good? (No!) So then how do you measure how good an estimator is? How do you compare estimators? How do you select the best one if you have a bunch to choose from? How do you even determine whether an estimator is adequate for your needs? These are questions we will take up shortly.
Statistical sampling can be complex and often looks it. The trick to dealing with complexity is to hide it. We saw an example of this in class.
Our example concerns a sample of four values. We assumed each of those values was independently drawn from an N(0, 2) distribution. From that sample we can compute the standard deviation statistic.
This procedure I just described ultimately produces one number: a standard deviation. One way to model the entire process is to systematically create a very large quantity of four-number samples. Write each sample's standard deviation statistic on a ticket and put that ticket into a box. This new box approximates the sampling distribution of the standard deviation.
Another useful way to contemplate the sampling distribution is to think of drawing tickets out of a box as a process. We don't actually need to know what is inside the box. We just need to know that the process acts like draws of tickets, with replacement, from some box of definite, unchanging composition.
The process of obtaining four independent N(0, 2) values and computing their standard deviation defines a single new probability distribution, the sampling distribution of the standard deviation.
We used Crystal Ball software in class to see how this process works. The next figure, however, was produced with @Risk software, which is also an Excel add-in that performs essentially the same tasks as Crystal Ball.

Both these software products enable you to replace spreadsheet input cells with random variables--tickets in boxes. They monitor other cells, usually containing calculated values. With the push of a button these products independently draw tickets from all the boxes, recalculate the spreadsheet, and make a record of the output cells. They repeat this process many, many times. This creates a large statistical sample of the output cells.
The figure above, produced by @Risk, is similar to the one Crystal Ball produced in class. It displays a histogram of 10,000 standard deviations, each one computed from four independent values drawn from an N(0, 2) distribution. In this figure, the vertical axis evidently shows probability per unit interval, as a histogram should.
The histogram contains some interesting information. In particular, its mean of 1.83 is noticeably lower than 2, the standard deviation of the "underlying" N(0, 2) distribution.
The @Risk software lets us treat this entire simulation process as if it were the drawing of a single ticket--the mean--from one box. To speed things up, I reduced the simulation size from 10,000 to 1,000. The software then drew 100 tickets from this box. That is, it ran the 100-draw simulation (using different pseudo-random numbers each time) 100 times, collecting the mean from each run. The following histogram summarizes the results.

Evidently, almost any statistical sample of 1,000 standard deviations will itself have a mean noticeably less than 2. This is strong evidence that the standard deviation formula tends to underestimate the true standard deviation, at least when a sample of four values from a Normal distribution is concerned.
(Theoretical calculations show the expected value of the standard deviation of four independent N(0, 2) values is 2 * sqrt(2/3) * Gamma(2) / Gamma(3/2) = 1.843, approximately. The general formula for n independent N(mu, sigma) values is sigma * sqrt(2/n) * Gamma((n+1)/2) / Gamma(n/2) . This is always less than sigma, but approaches sigma as n gets large.)
![]()
Here are some things we learned:
| There's nothing mysterious about a probability distribution. It's easy to create your own. | |
| To understand a word like "mean" or "standard deviation," you need to pay special attention to the context. First determine whether a probability model is involved. Then identify the purpose of the word: is it to characterize a batch of numbers? Characterize a distribution? Estimate a distribution's properties? Or something else? | |
| A statistical estimator is just a mathematical formula. To use it, you look it up or have a computer program calculate it. | |
| Well-defined mathematical procedures let us create new probability distributions--in some very complex ways--from old. Simulation software helps us understand the new distributions. |
![]()
Return to the Environmental Statistics home page
This page is copyright (c) 2001 Quantitative Decisions. Please cite it as
This page was created 3 March 2001 and last updated 1 April 2001 (to state the theoretical bias in the standard deviation estimate).