# Decision Theory

## Links to introductory web resources on decision theory

I cannot find any.  Please write with your suggestions to decisiontheory @ quantdec.com.

## Terminology

Memorize these.

 Admissible A procedure is admissible when there exists no other procedure that can equal or improve on the risk for all possible states of nature. Loss function A non-negative numerical value associated with every possible combination of experimental outcome and decision. Minimax Any procedure whose largest risk is smallest among the maximum risks attained by all other procedures. Non-parametric A statistical problem where the states of nature do not have a useful finite parameterization. Parametric A statistical problem where the states of nature are readily parameterized (described) by a finite number of real values. Risk The expected loss of a statistical procedure.  This is a function that assigns a numerical value to every state of nature. Sample space Set of possible outcomes of an experiment. State of nature Set of possible probability distributions that could realistically describe the behavior of an experiment. Statistical procedure A recipe for assigning a decision to any possible experimental outcome. Subjective Bayes An approach to selecting procedures that uses beliefs about the possible states of nature.

## Discussion

### A rational decision theory framework

A framework for evaluating estimators, tests, and other statistical procedures has at least four parts.

1.    S, a "sample space."  An experiment produces outcomes.  The set of possible outcomes is the sample space.

2.    Omega, the "states of nature."  Omega contains two or more probability distributions defined on the sample space S.  These distributions are possible models of the experiment.  Exactly one will be "true"--the best model.  However, we do not know which one is the true one.

Here are some examples of states of nature.  In a problem where the underlying distribution is Binomial, the distribution must be B(n,p).  The value of n is determined by the experiment, but p--the binomial probability--is unknown.  The states of nature are the set of all B(n,p) where p ranges from 0 to 1.  In some cases we might restrict the possible values of p based on prior knowledge; for example, we may only consider values of p between 1/2 and 1, or we may have just a few--a finite number--of candidates in mind for p.

In a problem where the underlying distribution is Normal, the distribution must be N(mu, sigma).  If we do not know mu or sigma, then Omega must consist of all distributions N(mu, sigma) where mu is any real number and sigma is any non-negative real number.

In other problems we may make few or no assumptions about the underlying distribution.  For example, Omega could be the set of all distributions.  It could be the set of all distributions of zero mean.  It could be the set of all distributions with finite variance.  The possibilities are endless.

Omega captures a lot of prior knowledge about how the experiment works.  If we limit Omega to the set of Normal distributions, for instance, we are making a very strong assumption about the nature of the experimental outcomes.

On a technical note, we usually do not want to include in S any experimental outcome x that has no probability of occurring, regardless of the state of nature.  That is, there must be at least one distribution in Omega whose pdf at x has a nonzero value.

A state of nature is a definite probability distribution; nothing about it is left unspecified.  For instance, when we are investigating the health of an ecosystem, we might be tempted to divide nature into two "states": "healthy" and "impaired".  However, both these conditions likely include many different degrees of health and impairment.  It is unlikely we could identify a single probability distribution to describe either condition.  Thus "healthy" is not a state of nature in the sense defined here, nor is "impaired."

3.    D, the "decision space."  D is the set of possible actions that we (or the stakeholders we advise) could take after observing the experimental results.

Decisions are not abstract things.  For instance, a landfill permitted under the U.S. RCRA statute is required to monitor groundwater nearby at regular intervals, typically once to eight times annually.  Initially, any new landfill is presumed not to have leaked.  This is the "detection monitoring" program.  If monitoring data eventually supply strong evidence of leakage, then the facility is required to move into "assessment [compliance] monitoring."  This can consist of additional water sampling and performing more expensive chemical analyses.  It incurs additional compliance and reporting costs.  Often consultants have to be hired to investigate groundwater conditions.  Moving from detection monitoring into assessment monitoring is a decision.  Staying in detection monitoring is also a decision.  Both decisions have evident consequences to the landfill owners and operators.

4.    L, the "loss function."  For each possible combination of a state of nature, f, and a decision, d, the loss function provides a numerical value.  It describes how much we (or the stakeholders we advise) lose if in fact the true state of nature is f and decision d is taken.  Usually the loss is expressed relative to the best decision that could possibly be made for each state of nature.  The best loss is zero; all other losses are zero or greater.

In the simplest interesting case, there are two possible states of nature--say f0 and f1--and two possible decisions, d0 and d1.  We can completely describe the loss function L with a two-by-two table.  Let's suppose decision d0 is the best one for state f0, while decision d1 is the best one for state f1.  Here's the table.

#### Table 1: The General Loss Function

 Loss State is  f0 State is f1 Decision is d0 0 L(f1, d0) Decision is d1 L(f0, d1) 0

In many practical cases, however, there is a continuum of states of nature, so we cannot easily create a table to describe the loss function.  The loss function is more usually given by a formula.

For instance, if the states of nature are all possible Normal distributions and the decision consists of our best estimate of the mean, then a typical loss function will be of the form (d - mu)2 or maybe |d - mu| (absolute value).  These functions have the intuitively necessary characteristics of being close to zero when the estimate (d) is close to the true value (mu) and of increasing as the deviation d - mu increases in size.

There is no requirement for the loss function to be symmetrical.  The cost of underestimating mu may be very different than the cost of overestimating mu.

A statistical procedure is a definite recipe for assigning a decision d to any possible outcome x of an experiment.  X is an element of the sample space S and d is an element of the decision space D, so we can think of a procedure t as a function from S to D: t: S --> D.  We use notation like t(x) to denote the decision made when the experimental outcome is x.

### The framework in terms of tickets-in-boxes: RCRA groundwater detection monitoring

The sample space S describes what can be written on a ticket.  The states of nature are different boxes.  Each box contains tickets on which are written elements of S: experimental outcomes.  Different boxes contain different proportions of the experimental outcomes.

A statistical procedure t assigns decisions to experimental outcomes.  Thus, we might as well examine every ticket in every box and on each ticket we will write the decision assigned by the procedure.  At the same time, for completeness, let's also write down the name of the box from which the ticket was obtained.  A typical ticket therefore has three entries  | f | x | d |, where f is the name of the box the ticket is in, x is the original outcome printed on the ticket, and d is t(x).  Notice that d does not depend on f.

For instance, in the simplest case of RCRA detection monitoring a landfill operator will sample three downgradient wells each year and analyze those samples for a chemical parameter related to the landfill contents.  Suppose this chemical parameter indicates the presence of a compound that does not naturally occur in the groundwater near the facility.  Further suppose, to simplify (but not weaken) our discussion, that the analytical result is not a concentration, but just a detect/nondetect result (somewhat like a pregnancy test or an AIDS test.)

An intuitively reasonable decision procedure, t, is then of the form "if all three results are nondetects, then stay in detection monitoring; otherwise, move into assessment monitoring."

There are eight possible experimental outcomes corresponding to the 2 * 2 * 2 combinations of detect/nondetect at the three wells.

Suppose the laboratory analysis has a fifty percent "false positive" rate.  This means that any sample not containing the chemical parameter of interest still has a fifty percent chance of resulting in a detect result.  Further suppose that any leak from the landfill quickly will affect water in all three wells, which will thereafter consistently yield detect results when analyzed.

We can create two boxes to describe the "clean" (hasn't leaked yet) and "leaky" states of the landfill.  The leaky box is simplest: it contains one ticket.  The experimental outcome written on it consists of three detects, one for each well.  Next to those three detects we write the procedure's determination, which is to move into assessment monitoring.  Here's the fully marked ticket, showing the box name (f), the experimental outcome (x), and the decision (d):

| Leaky | detect, detect, detect | assessment monitoring |

The clean box is more complex.  It contains eight tickets.  They are shown below, fully annotated.  But first let's consider the loss function.  If the landfill is leaking and the decision is to move into assessment monitoring, no loss is incurred: this is the correct decision.  Similarly, if the landfill is not leaking and the decision is to stay in detection monitoring, then again no loss is incurred.  However, the other two cases: clean-->assessment monitoring, leaky-->detection monitoring, cause losses.

Let's look at this from the facility owner's point of view.  To this corporation, the only evident loss (in the near term) arises from the additional costs of the assessment monitoring program if it turns out to be unwarranted.  Let's suppose those costs are expected to total one dollar.   (Ok, that's unrealistic, but it's an easy number to work with!  If you want, think of it as one million dollars, or one gazillion pesos, or one of your favorite large unit of money: it will not change the analysis.)

Here, then, are the tickets in the clean box, showing f, x, d, and the loss L(f, d):

| Clean | detect, detect, detect | assessment monitoring | \$1
| Clean | detect, detect, nondetect | assessment monitoring | \$1
| Clean | detect, nondetect, detect | assessment monitoring | \$1
| Clean | detect, nondetect, nondetect | assessment monitoring | \$1
| Clean | nondetect, detect, detect | assessment monitoring | \$1
| Clean | nondetect, detect, nondetect | assessment monitoring | \$1
| Clean | nondetect, nondetect, detect | assessment monitoring | \$1
| Clean | nondetect, nondetect, nondetect | detection monitoring | \$0

The colored entries (last two on each ticket) depend on the procedure.  If we consider a different procedure, we will have to recompute those entries.

The risk of any box is its expected loss.  Remember that this is the average loss written on all tickets.  The risk of the leaky box (in this example) is zero.  The risk of the clean box is (7 * \$1 + 1 * \$0)/8 = \$7/8.

Statistical decision theory compares procedures in terms of their risks.

The risk is a function: it is a number rt(f) assigned to each state of nature.  The name of the procedure, t, is written as a subscript to remind us that each procedure determines its risk.  The risk function worked out just above is simple because it can be described by just two assignments: clean-->\$7/8, leaky-->\$0.

To summarize the example, we have

 S = {"detect, detect, detect", "detect, detect, nondetect", ..., "nondetect, nondetect, nondetect"} is the set of all possible test outcomes from the three wells. Omega = {clean box, leaky box} D = {stay in detection monitoring, move to assessment monitoring} L = {\$0 except when the decision is to move into assessment monitoring when in fact no leak has occurred, in which case the loss is \$1}

In practice the possibilities for Omega are more complex, but this simple example captures many of the fundamental considerations involved in evaluating RCRA groundwater detection monitoring programs.

The main source for this material is chapter two of Jack C. Kiefer, Introduction to Statistical Inference.  (See the Links page.)  Kiefer uses "W" for the loss function instead of "L".

### A Simple Worked Example

The simplest possible case of practical interest has two possible experimental outcomes, two possible states of nature, and two possible decisions. Despite its simplicity, analyzing this situation uncovers some fundamental aspects of the decision theory approach.

So, to be concrete, suppose we have the opportunity to conduct a single experiment with two possible outcomes, which we arbitrarily label "success" (S) and "failure" (F).

Based perhaps on prior information about the experiment, we know the probability of success (p) is either 50% or 95%, but we do not know which.  These probabilities label, or "parameterize," the states of nature, which are binomial distributions B(1, 0.50) and B(1, 0.95), respectively.

We will guess which is the correct state of nature.  If the true state is B(1, 0.50) let's call the experiment "fair" (f) but otherwise we will call it "biased" (b).

The loss function always equals zero when we make the best possible decision for a given state of nature.  Thus, making decision f when the state of nature p = 0.50 incurs no loss.  Making decision b when p = 0.95 likewise incurs no loss.  A table succinctly provides the full loss function.

#### Table 2: A Particular Loss Function

 States of nature (p) 0.50 0.95 Decisions (d) f 0 \$3 b \$2 0

Other loss functions will differ from this one only by changing the values of the \$2 and \$3 entries.

The (unrandomized) statistical procedures are definite rules telling us what decision to make for any possible experimental outcome.  Formally, a procedure is a function t: S --> D.  Because S and D are so simple, we can tabulate all possible procedures.  The rows correspond to experimental outcomes, the columns to procedures, and the entries are decisions.

#### Table 3: All Possible Procedures

 Procedure t1 t2 t3 t4 Outcome S f b f b F f b b f

For instance, procedures t1 and t2 would be made by people with "blinders" on: regardless of the outcome, t1 decides the experiment is fair, t2 decides the experiment is biased.  Procedure t3 is a "cynical" procedure: although an F outcome would be evidence against the biased state of nature (which almost always produces an S), t3 decides the experiment is biased anyway.  Otherwise it decides the experiment is fair.  Procedure t4 is an intuitively good procedure.  The outcome S is more likely to be produced by the p=0.50 state (f), whereas the outcome F is more likely to be produced by the p=0.95 state (b).  (This makes t4 the maximum likelihood estimator.)

How do we decide which procedure to use?

Think of the states of nature as two boxes containing tickets.  In the p=0.50 box are two tickets, one labeled S, the other labeled F.  In the p=0.95 box are 20 tickets, 19 labeled S, the last one labeled F.

Let's analyze procedure t3.  The analysis begins by relabeling the tickets in the boxes.  Because t3 assigns a decision to each outcome, we can show the decision next to each "S" or "F" on every ticket.  This process is purely mechanical, requiring a simple lookup in the t3 column of Table 3.

Next, we compare the new ticket labels to the name of the box.  The box determines a column and the label determines a row in the loss table, Table 2.  We write the loss on the ticket.  The illustration shows the relabeled tickets.

If we were to use procedure t3 many times over, then when the true state is p=0.50, we would undergo a loss of 0 half the time (because half of its tickets have zero loss) and we would undergo a loss of 2 the other half of the time (because half of its tickets have a loss of 2).  The expected loss, or average loss on the tickets in this box, is 1.  This value depends on the procedure, the loss function, and the proportions of tickets in the box.

The expected loss for the p=0.95 box under procedure t3 is 2.85.

Recall that the risk of any state of nature for a given procedure is the expected loss of that state of nature.  We use the risk to evaluate and compare procedures.  If t is a procedure and p is a state of nature, we may write rt(p) for the risk of using t when the state is p.

Repeating these calculations for procedures t1, t2, and t4 gives the following results (check them!):

#### Table 4: Risk functions of all (nonrandomized) procedures

 State of nature P = 0.50 P = 0.95 Procedure t1 0 3 t2 2 0 t3 1 2.85 t4 1 0.15

It is helpful to graph these results.  In the graph, the horizontal axis designates the states of nature.  The vertical axis shows loss.  The points denote risk.  Their coordinates are (p, rt(p)), where t varies over the four possible procedures.  The dotted lines are not part of the graph, but are shown only to help you connect the two points in each graph.

The risk graph makes it apparent that t3 is dominated by t4: that is, t4 never has higher risk than t3, and in some cases (p=0.95) has distinctly lower risk.  This is a compelling argument against ever using t3 (at least once we are aware that t4 exists and know how to compute it).  This is a formal demonstration in support of our intuition that the MLE (t4) works better than the cynical estimator (t3).  Procedures like t3, that are dominated by some other procedure, are called inadmissible.

(Surprisingly, many common procedures are inadmissible.  For instance, Andrew Rukhin, in Improved Estimation in Lognormal Models (JASA 81 no 396 pp 1046-1049, Dec. 1986), demonstrates the inadmissibility of the MVUE and MLE of the mean, median, and moments of lognormal distributions under quadratic loss (the usual loss function).  His numerical studies demonstrate that the MVUE is extremely poor (relative to other available estimators) for small sample sizes when the logarithmic standard deviation exceeds 1.0.  The MVUE is the basis of Land's confidence limit procedure.  Read more about this in the next class notes, Statistical tests.)

Most of these four procedures incur a large risk for some state of nature.  Remember, we do not know what the true state of nature is.  Thus, procedures t1, t2, and t3 could achieve risks of 2 or greater, depending on the true state of nature.  The risk of procedure t4, however, never exceeds 1.  Its maximum risk is smallest among the maxima for all procedures: it is minimax.  A pessimistic decision maker, or one who simply cannot tolerate a risk higher than 1, would prefer this procedure.

Procedure t4 is not the best in all cases, however.  If the true state of nature is B(1, 0.50), then the decision maker using procedure t1 will be much better off.  (She will always be correct!)  The difficulty is that neither decision maker knows what the true state is, so neither can claim their procedure is uniformly superior.

### A More Complex Example: Estimating a Proportion

To illustrate what can happen in practice, consider the problem of estimating a proportion (or probability) based on limited data.

For instance, we might wish to estimate the proportion of a site that is clean.  To do so, we will take surface soil samples at locations selected randomly and independently of each other.  Taking the samples acts just like drawing tickets from a box.  The tickets are labeled with the measured concentrations of some chemical (or chemicals) in the soils.  They will be relabeled simply as "clean" or "dirty," depending on whether the concentrations are all below their cleanup standards or not.

The most extreme case of limited data--one sample only--contains all the elements of the general problem, so we will analyze this.  (As an exercise you may analyze the situation where more data are available.)

The experimental outcomes are "clean" and "dirty" (the analogs of "success" and "failure" above).  There is a continuum of states of nature, parameterized by the true proportion (p) of site soils that are dirty.  The value of p may be anywhere between 0% and 100%.  The decision is a number, d, which is our estimate of the proportion.  In principle d can be any value, but it makes sense to restrict d also to lie between 0% and 100%, because anything else would be obviously wrong.

We can no longer use a table to show the loss function.  Instead, we need a formula for the loss, given any possible combination of p and d.  This is a general mathematical function of two variables.  There are many possible forms it can take.

A reasonable form will be zero whenever the decision d exactly agrees with the true proportion p.  Otherwise, the loss should increase as the difference between p and d grows.  One of the simplest, mathematically "nice," expressions with this property is L(d, p) = (d-p)2.  For any given state of nature p, its graph is an upward curving parabola with vertex touching zero at p.  This is the "quadratic loss" function.

As before, a procedure is a definite process for deciding, on the basis of a single soil sample that is labeled either "clean" or "dirty," what the clean proportion is onsite.  That's a tough decision to make with one sample result, but having one result is better than no data at all!

(It is commonplace in the application of environmental statistics in the private sector to have very little data.  We do not always have the luxury of demanding more data.  The challenge is to make the best possible decision with the data available.)

Again, each state of nature contains only two kinds of tickets, labeled "clean" and "dirty".  We can, as before, imagine relabeling these tickets according to a given procedure.  If the procedure decides the proportion is p0 when the sample is clean, then we relabel the "clean" tickets with p0.  Then we compute the ticket's loss (p0 - p)2 by referring to the state's value p--the true proportion of clean soils.  We may never learn the value of p, but that does not prevent us from computing risk functions and using them to evaluate potential procedures.  Similarly, if the procedure decides the proportion is p1 when the sample is dirty, then we relabel the "dirty" tickets with p1 and compute the loss (p1 - p)2.  Here is a table showing the computations.

#### Table 5: Calculation of expected loss for procedure  t(p0, p1)

 Sample Decision Loss Chance of loss Contribution to expected loss Clean p0 (p0 - p)2 1-p (1-p) * (p0 - p)2 Dirty p1 (p1 - p)2 p p * (p1 - p)2

When the state of nature is p, then, a proportion p of the tickets are labeled with a loss of  (p1 - p)2 and the remaining proportion (1-p) are labeled with a loss of  (p0 - p)2.  The expected loss is just p *  (p1 - p)2  +  (1-p) * (p0 - p)2.  The values p0, p1 are constants determined by the procedure in question and p is the variable.  This is the risk function for the procedure, which is designated t(p0, p1) in the next figure.

The figure shows the risk functions for some of the more interesting procedures:

 t(0, 1) decides the entire site is clean when the sample is clean and it decides the entire site is dirty when the sample is dirty.  It is the maximum likelihood procedure.  The graph of its risk function is the large upside down parabola.  The risk is highest when half the site is dirty and lowest of all in the extreme cases. t(0.5, 0.5) decides that half the site is clean, regardless of the sample result.  This might be the response of someone who recognizes that one sample result is very little evidence and wishes to "split the difference" by guessing a middle value for the proportion.  The graph of its risk function is the large dashed parabola.  It achieves very good (low) risk for sites that are about half clean, but has very high risk for sites that are very clean or very dirty. t(0.25, 0.75) has constant risk (of 1/16).  It is the minimax procedure: every other procedure achieves a risk greater than 1/16 for some states of nature. t(0.6, 0.99) might be used by someone who has a strong belief that the site is very dirty.  If their belief is correct, it achieves low risk.  If their belief is incorrect, however, the risk becomes very high.  (The risk exceeds 0.25 for proportions less than about 0.25, and so cannot even be shown on the figure.)

There is nothing inherently wrong about using beliefs about site conditions to guide the choice of procedure.  (People who do this are using a "subjective Bayes" approach to decision making.)  This does not invalidate the analysis based on risk functions.  Indeed, the risk function analysis helps us see what an unfounded dependence on a belief could cost, if nature does not conform to that belief.

Using a different loss function can change the results dramatically.  As an example, suppose we are more concerned about relative error than absolute error.  Thus, the difference between p = 0.01 and d = 0.02, or between p = 0.99 and d = 0.98, might cost more than the difference between p = 0.40 and p = 0.60.  One way to model this is to divide the quadratic loss expression both by p (to increase the loss when p is small) and by 1-p (to increase the loss when p is close to 1).  The next figure uses the previous graphical symbols to plot the new risk functions for the same five procedures.

(It makes no sense to compare these risks with the risks in the previous figure.  They are based on different loss functions.  We could rescale either loss function by some constant without changing the analysis.  For example, the quadratic loss function could be expressed in millions of dollars and the relative loss function in tenths of pesetas.  So ignore the absolute values on the risk axes and concentrate on the relative heights of the risk functions.)

For the relative loss function, procedure t(0, 1) (the horizontal line with constant risk of 1.00) is minimax.  It is the only procedure without infinite loss at either p=0 or p=1.  The procedure that formerly was minimax is no longer minimax.

### How and Why People Lie With Statistics

Finally, let's consider an asymmetric loss function.  This loss equals the difference between the decision, d, and the true proportion p, only when d overestimates p.  Otherwise, the loss is zero: there is no penalty for underestimating the true proportion.  (A decision maker who values an accurate estimate, but who would otherwise prefer a result that exaggerates the cleanness of the site, might use a loss function closely approximating this one.)

The figure shows the risk of a new procedure, t(0, 0.3).  This procedure decides the site is perfectly clean if the sample is clean.  Otherwise it decides 30% of the site is dirty.  It is clearly superior to the other five procedures shown.  This is intuitive: if the site is more than 30% dirty, t(0, 0.3) cannot overestimate the value, so its risk is zero.  Otherwise, it cannot overestimate the risk by much.

For this loss function there is a unique best procedure, t(0, 0).  It is minimax and it is the only admissible procedure at all.   t(0, 0) always decides the site is clean, regardless of the sample result.  Its risk is uniformly zero, because it can never overestimate the true value.  (This procedure has one of the worst possible risks for the loss functions previously considered, however.)

I have seen procedure t(0, 0)--or rather, its generalization to larger data sets--used many, many times.  It happens frequently in long-term RCRA groundwater monitoring.  An environmental consultant establishes a monitoring program and creates a template for the quarterly (or annual) monitoring report.  The report includes the monitoring data as a separate appendix, but no serious effort is put into analyzing the data.  At most, the data are dumped into a (boilerplate) spreadsheet that purports to conduct the required statistical test.  However, if the test indicates action is required, the consultant inserts additional boilerplate to "explain" why this is not evidence of a release from the facility.  Usually the explanation impugns the test (which, more often than not, was selected by a previous consultant and truly is a bad test).  No effort is made to change the test, however: that would require a formal modification of the RCRA permit.

If the regulatory agency, represented perhaps by an overworked or inexperienced or cynical case worker, never subjects these reports to serious review, then this condition can continue for years or decades.  I have seen this occur in states and EPA regions across the U.S.

The principal source for the worked examples is section 4.2 of Lehmann's book, Theory of Point Estimation.  Lehmann analyzes the n-sample case.