Hypothesis Tests

Links to introductory web resources on hypothesis tests

http://www.quantdec.com/Articles/pcbpipe/pcbpipe.pdf  This report, which I wrote for an industrial client, illustrates how many ideas from this course--estimation, testing, the meaning of probability, rational decision theory--can be applied in a practical environmental setting.  The sensitive details have been removed, of course.  What remains is "a self-contained analysis of a tolerance limit test ...  An appendix develops the theory, beginning with definitions, and assesses the performance of the test in two ways."

http://www.statistics.com/content/teaching/Biology/HYPOTEST.htm  Notes from a U. of Waterloo course on "Statistics and experimental design."  Thoughtful discussion of hypothesis tests.

http://www.cnr.colostate.edu/~anderson/thompson1.html  A lengthy bibliography on articles and books about statistical hypothesis testing.

http://www.sjsu.edu/faculty/gerstman/EpiInfo/basics.htm#The%20Basis%20of%20Inference  "A brief review of some principles" includes some choice remarks about hypothesis testing.  (From San Jose State University, California.)

http://ericae.net/edo/ED366654.htm  "The Concept of Statistical Significance Testing: ... This Digest will help you better understand the concept of significance testing."

http://www.cnr.colostate.edu/~anderson/nester.html  "A few quotes regarding hypothesis testing" (most of them arguing against it).

Discussion

The mythology of hypothesis testing

An amazing amount of nonsense has been written about statistical tests of hypotheses.  Some of the myths include

Myth Explanation
"In order to formulate [an hypothesis] test, usually some theory has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved... No theory at all is needed to formulate an hypothesis test.  Indeed, many tests in environmental statistics are formulated in the absence of any theory or basis for argument.  Consider tests that determine whether a mean concentration meets a standard or not, for example.

"We give special consideration to the null hypothesis. This is due to the fact that the null hypothesis relates to the statement being tested...

This is meaningless, because both the null hypothesis and its alternative "relate" to what is being tested.

"The alternative hypothesis, H1, is a statement of what a statistical hypothesis test is set up to establish."  [This and the preceding two statements appear at http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#hypothtest]

This is out and out false.  The null hypothesis typically is set up to enable a sampling distribution to be computed.  The alternative hypothesis, by definition, comprises every other possible state of nature.

"Null hypothesis: This posits that there is no difference whatsoever between variables being tested." [http://www.niwa.cri.nz/pgsf/stats/defn.html ]

The null hypothesis can be any set of states of nature.  Positing "no difference" is just a common application of hypothesis testing, but does not define the null hypothesis.

"The convention in science is to reject the Null Hypothesis when there is a 95% chance that it is wrong (or conversely a 5% chance that it's true)." [http://biology.soton.ac.uk/bs209/revision2.shtml]

This is a criterion for a work to appear in some scientific publications, but has no basis in scientific or statistical theory.

"A Type I error happens if the null hypothesis is rejected when it should not be (the probability of this is called "alpha")..." [http://ericae.net/edo/ED410231.htm

The probability of the null hypothesis being rejected when it is true is at most alpha, by definition; often, in fact, it is much less than alpha.

"Sometimes scientists state their hypothesis in such a way that they expect the prediction to be false.  Why?  Because it may be difficult to get test results that are absolutely "Yes" or "No"."  [http://members.home.net/kfuller/project.html]

This outrageous statement is entirely contrary to scientific method.  A scientist's ability to achieve unambiguous experimental outcomes is completely unrelated to the issue of what to choose as the null hypothesis.

The logic of hypothesis tests

Statistics is empirical: it uses observations to help us draw conclusions.  The goals of statistics are to improve how we reason with data and to help us understand our reasoning.

Empirical deductions do not have the same finality as logical proofs in mathematics.  Using data alone we can never incontrovertibly establish that something is true.  There is always the possibility of exception.  Uncertainty attends all empirical reasoning.

Therefore statistical tests cannot, and do not, aim to establish that something is the case.  They can only evaluate how strongly the data support or contradict a conclusion.

We hypothesize a class of probability models for an experimental outcome.  These are the possible "states of nature."  Estimation (discussed previously) consists of identifying the model most consistent with the outcome.  By contrast, hypothesis testing consists of identifying some special model or models within the class (the "null hypothesis") and assessing the degree to which the experimental outcome constitutes evidence against the special model, relative to all other models we are considering.

If the test concludes that the outcome is strong evidence against the null hypothesis, then at least one, and maybe more than one, of the following are true:

The null hypothesis contains the correct model, but the observed outcome was a rare event
The null hypothesis is not a correct model of the experiment
The entire class of hypothetical models does not contain any model that adequately describes the state of nature
The experimental outcome is not adequately described in terms of a probability model.

If the test concludes that the outcome is not strong evidence against the null hypothesis, then

The null hypothesis may contain the correct model
Some other model in the class may be correct
Some model we have not considered at all is correct
The experimental outcome is not adequately described in terms of a probability model.

There are several schools of thought concerning hypothesis tests and many different views about what they are for and what they should tell us.

One view is that a test assigns a number, the "confidence," to the data and that the confidence is the probability that a particular state of nature (the "hypothesis") is true.  That is not the view taken in this course, in part because environmental statistics are used by many different stakeholders in any real application: property owners or managers, local, state, and national regulatory agencies, people who live nearby, and people whose livelihoods depend on the outcome.  It is not possible to provide an objective meaning to the probability of a hypothesis in this setting.

To maintain objectivity, we assert that an hypothesis is either an effective model of the experiment's behavior or it is not.  We then try to select procedures that have a good chance of interpreting the data correctly in this regard.  One function of the statistician is to select good decision procedures.  One merit of this approach is that the applicability and desirability of a statistical procedure can usually be evaluated before all the data are available.  This makes it much easier to achieve agreement among the stakeholders about which procedure to use.

(For instance, it is commonplace for investigators to get into trouble by choosing the procedure after reviewing all the data.  As we have seen, different procedures can give different results.  A naive or biased or dishonest statistician can sometimes "steer" the statistical results in a desired direction by a clever post hoc choice of test.  Everybody knows this, which is the basis for the adage that statistics is the "third lie" (after white lies and black lies).  This possibility leaves the honest investigator in an indefensible position.

A better approach is to formalize the choice of statistical procedure.  That is the role of the investigation work plan.)

From our point of view, then, the "confidence" is the chance--computed before the data are gathered--that our procedure will arrive at the correct decision provided a particular probability model (the null hypothesis) is sufficiently accurate.  Clearly the confidence does not tell us everything we would like to know.  It might not even be particularly relevant: often the null hypothesis is not an accurate model of the experiment at all.

These considerations tell us that

We should select a null hypothesis that has some relevance to the decision making even if it turns out to be false.
We should evaluate how well the procedure might work for a reasonable selection of possible states of nature (the "alternative hypothesis"--note the ungrammatical singular case), not just for the null hypothesis.

The latter is measured by the "power" of the test.  U.S. EPA guidance (see DQO or RCRA GW guidance, for instance) frames the test selection process in terms of power.

An example

The need to distinguish a special class of models, or null hypothesis, is not universal, as the following example shows.  This example uses the language and concepts described in the Decision Theory article.

Parts of the southeastern United States consist of limestone formations that contain caves and solution channels which may rapidly conduct groundwater for great distances beneath the surface.  In the 1930's, the Tennessee Valley Authority created a large system of dams in this area, inundating many natural springs and seeps where this groundwater was ultimately conducted to the surface.

In the 1990's, an investigation of groundwater contamination included a dye test to track groundwater flows.  Groundwater was believed to flow beneath a nearby range of hills to emerge beneath the surface of a large nearby TVA lake.  Historical maps showed (and here I simplify the actual problem) that two large springs had existed there but were inundated.

A non-toxic dye was introduced into sinkholes and wells onsite.  Any dye subsequently emerging at the lake bottom from a former spring would appear as a diffuse colored spot near the surface.  One object of the test was to look for these colored spots to determine whether the groundwater was flowing into the lake, and if so, to establish which spring it was being carried to.

The dye was indeed detected in the lake within a few days of its introduction into the groundwater.  Let us, therefore, consider the secondary objective, that of identifying the spring from which the dye emerged.  To do this, physical theory suggests the dye spot will be located randomly above the spring with a Normal distribution centered at the true spring location.  The depth of water over each spring was about the same, so the variance of the distribution should be the same, regardless of the spring.

Let's set up a system of local coordinates and length measurements so that  the springs are at coordinates (-1, 0) [spring A] and (1, 0) [spring B].  The probability model is that the dye spot is governed by one of two (bivariate) Normal distributions, N( (-1, 0), sigma ) or N( (1, 0), sigma ).  A dye spot observation consists of its coordinates (x, y).

We contemplate two decisions, although there are really three:

  1. The groundwater emerges at spring A.
  2. The groundwater emerges at spring B.
  3. (The groundwater emerges at some unexpected location.)

Without additional information (such as some idea of what sigma is), we will not know when it's appropriate to make the third decision.  Therefore, until we get such information, we shall limit our decisions to one of the first two.

The hypothesis test in this case therefore consists of some rule that tells us, for any possible dye spot location (x, y), which spring (A or B) produced it.

The test divides all coordinates (x, y) into two sets: the set A associated with decision A and the set B with decision B.  Conversely, any subset of the set of all coordinates corresponds to the decision that associates that set with spring A and its complement with spring B.

To decide among possible tests, we need a loss function.  Classical hypothesis testing uses simple loss function which is 0 when the decision is correct and 1 when it is not.

  Spring from which the dye is emerging
Decision A B
A 0 1
B 1 0

The risk function is a pair of values: one for when spring A is the correct choice and another for when spring B is the correct choice.  Let's compute it.

When spring A is the correct choice, the probability distribution of the dye spot is modeled by N( (-1, 0), sigma ).  For simplicity, let's just call this distribution FA (with a similar notation for spring B's distribution).  If this model is at least approximately correct, then the probability of making decision B is close to FA(B), the total probability associated with the set where decision B will be made.  The probability of making decision A is FA(A).  The risk is FA(A) * Loss(make decision A when A is correct) + FA(B) * Loss(make decision B when A is correct) = FA(A) * 0 + FA(B) * 1 =  FA(B).  This is r(A), the risk for state A.

Similarly, r(B), the risk for state B, is FB(A) = 1 - FB(B)  (the probability of making decision A when B is correct).

To reduce the risk, we therefore need simultaneously to reduce FA(B) and increase FB(B).  Since the forms of FA and FB are beyond our control, the only variable left to consider is set B.

We can make FB(B) as large as we like by making set B very, very large.  Unfortunately, this makes FA(B) very large too.  The trick is to focus set B in the regions where spring B is the most likely source and to have set B avoid regions where spring A is the most likely source.

The balanced (minimax and invariant) solution chooses set B to coincide with those points where spring B is more likely than spring A to have been the source of the dye.  Those are exactly the points closer to spring B than to spring A (the ones with positive x coordinate)..  Set A therefore coincides with the points closer to spring A than to spring B (the ones with negative x coordinate).  (Points on the line equidistant from the springs--the y axis--may belong to either set, but it doesn't matter: the probability that the dye appears exactly halfway between the springs is so small we needn't worry about it.)

If the springs are very close compared to the dye dispersivity (sigma), then r(A) and r(B) can be almost as large as 0.50--the test is simply too imprecise to distinguish the dye's source and choosing among the springs is little better than flipping a coin.  If the springs are very far apart compared to the dispersivity, then r(A) and r(B) are both practically zero.  The risk cannot exceed 0.50 for the minimax solution.

There are some very interesting features to this example:

We did not need to make any assumption about which spring was the correct one.  That is, there are no "expectations" about what is true and what is false.  We are using the dye to find out!
There is no "null hypothesis."  There is no "alternative hypothesis."
There are no alphas or betas, "significance levels" or "power".  Indeed, we don't even know the exact values of the risk function, because we don't know sigma.
We did not assert that the probability model is absolutely true.  The Normal distribution is just a model, meaning we know it isn't quite correct, but we have reason to believe it is approximately correct.  As long as it is nearly correct (and, with a deeper analysis, we could quantify what we mean by "nearly"), the analysis carries through and the decision procedure is essentially the same.  (In fact, all we really need to assume is that FA(E) is approximately the same as FB(E-) for all measurable sets E; E- is the set of coordinates (-x, y) where (x, y) is in E.)
There is a clear, direct connection between the test procedure and the loss function.
The situation is a "composite" hypothesis test, meaning that the decision is distinguishing an entire class of states of nature N( (-1,0), sigma ) (sigma could be any positive value) from another large class of states N( (1,0), sigma ).  In particular, neither decision is associated with a state of nature where we know the exact probability distribution of the test statistic (x,y).
No matter where the dye emerges, we have to make a decision about its source.  Even at points extremely far from either spring (relative to sigma), a decision must be made.  At such points the presence of the dye actually constitutes strong evidence against either spring being the source.  The decision procedure is merely settling on the least unlikely: the best of a bad set of options.  Thus, deciding that the spring was A, for instance, does not necessarily mean that A has a "high probability" of being the correct source.

This practical example shows how the statements quoted in the introduction can be false or misleading.

Return to the Environmental Statistics home page

This page is copyright (c) 2001 Quantitative Decisions.  Please cite it as

This page was created 4 April 2001 and last updated 9 May 2001.