Looking at Data

Links to web resources on stem-and-leaf plots and letter summaries

EDA Short Course- Part 1- Examining Univariate Distributions (Michael Friendly, York University): Step by step explanations with examples.  Includes material we will encounter in the next several classes, too.

Penn State University Statistics (Department of Educational and School Psychology): Lots of examples, little applets provide interaction, shows many graphics, and some good links.  However: explanations of hinges are not technically correct and there are some other lapses to watch out for.

Examples of stem-and-leaf plots: Includes sample software and data.

Terminology

Adjacent

Any value closest to, but not equal to, an inner fence.

Batch

A set of data under examination.

Box

A graphic drawn from the median and hinges of a batch.  Used in boxplots (also known as box-and-whisker plots).
Count The number of values in a batch.  Symbols "N" or "n" are often used to signify the count in formulas.
CV Coefficient of variation: the ratio of the standard deviation to the mean.  Used only for batches of positive numbers.

Depth

How far "in" a value is within a batch.  Specifically, every batch can be ordered from lowest to highest value, or highest to lowest.  Each ordering assigns an order to the value.  The value's depth is the smaller of the two orders.
Descriptive statistic A number, derived from a batch, used to convey some property of the batch as a whole.   In some sense all statistics are "descriptive," but by convention this term is applied to statistics with simple interpretations as measures of "central tendency" or "location," of "spread" or "dispersion," and of "shape."

Eighth

Every batch has two fourths (hinges).  The upper eighth is the median of the sub-batch of values that equal or exceed the upper fourth.  The lower eighth is the median of the sub-batch of values that equal or are less than the lower fourth.

Extreme

A maximum or minimum of a batch.

Far out

Any value as or more extreme than a far fence.

Fence

Every batch has four fences: the "inner" fences are one Step beyond the hinges.  The "outer" fences are two Steps beyond the hinges.

Fourth

The median divides a batch into the upper data (those equal to or greater than the median) and the lower data (those equal to or less than the median).  A fourth is a median of the upper or lower data.

h

Shorthand for one-half; for example, 17h is 17.5.  Using this shorthand implies we will ignore any values smaller than one-half.  It has a special meaning when used with order statistics (see below).

H-spread

The difference between the upper and lower hinge.  This is one resistant measure of spread, or "dispersion," of data in a batch.

Hinge

A fourth.

Inside A value (from a batch) is "inside" another number when the value lies between the number and the median of the batch.  For example, the value 4 in the batch {0, 2, 3, 4, 5} is inside 4.5 because 4 is between 3 (the median) and 4.5.
Leaf What remains after the "stem" has been split off of a value, usually truncated to one decimal place.  For example, when summarizing data in groups of 10, the stem of 413.7 is 41 and the leaf is 3.
Letter statistic A median, fourth (hinge), eighth, etc.  These statistics are abbreviated by single letters M, F (or H), E, D, etc., proceeding backwards through the alphabet to A, then continuing Z, Y, X, etc.
N-letter summary A tableau laying out the median, the two hinges, and optionally additional letter statistics, plus the extremes.  The "N" here refers to the total number of letters; the smallest effective N-letter summary is the 5-letter summary of median, hinges, and extremes.

MAD

Median absolute deviation.  This is the median of the absolute values of residuals relative to a batch's mean.

Maximum

The largest value; equivalently, the 1st order statistic.  Abbreviated "max."

Mean

Arithmetic average: the sum of the values divided by the count.

Median

The (N+1)/2 order statistic of a batch of N values: a very robust measure of central tendency.

Midrange The mean of the extremes: a (non-robust) measure of central tendency.

Minimum

The smallest value; equivalently, the Nth order statistic.  Abbreviated "min."

Order statistic

Let {X1, X2, ..., XN} be a batch.  When the values are ordered from highest to lowest we change indexes and write the values X[1], X[2], ..., X[N].  For an integer i between 1 and N, the ith order statistic is just X[i].  For a half-integer i+1/2 between 1 and N, the ith order statistic is (X[i] + X[i+1])/2, the average of the order statistics surrounding i+1/2.  For example, in the batch {0, 7, 8, 12, 14}, the 2nd order statistic is 12 and the 2h (= 2 1/2) order statistic is (12 + 8)/2 = 10.

Outlier

Any value singled out in a batch because it is "far away" from most of the other values.

Outside Any value between the inner and outer fences: specifically, a value as or more extreme than the inner fence, but not far out.

Range

The difference between the maximum and the minimum.

Residual

The difference between a value and some reference value.  We can always write Residual = Value - Reference or, equivalently, Value = Reference + Residual.

Resistant

A statistic is "resistant" when it changes very little even when one or more values in the batch is altered by quite a lot.  Some statistics, such as the median, are resistant to changes in up to 50% of the data in a batch.  Other statistics, such as the mean or standard deviation, have no resistance at all.

Robust

Resistant.

Standard deviation

The square root of the variance.  This is a conventional measure of spread in a batch.  Its units of measurement are the same as the original units.  Abbreviated "sd."

Statistic The output of some numerical algorithm applied to a batch of values.  Conventional statistics include the mean, median, extremes, variance, and H-spread, for example.

Stem

The sequence of initial digits forming the spine of a stem-and-leaf diagram.  See "leaf" for an example.

Step 1.5 times the H-spread.  Used to establish the fences.
Value One number in a batch.

Variance

The variance is formed by first squaring the residuals relative to the mean.  The sum of these squared residuals, divided by one less than the count, is the variance.  Abbreviated "var."

Whisker

A line extending from one end of the box to the "adjacent" value in a box-and-whisker plot.

Xi

A notation for referencing values in a batch according to their place (i) within the batch, considered as an ordered list of values.

X[i]

A notation for referencing values in a batch according to their place (i) within the batch, after its values have been sorted from highest to lowest: the ith order statistic.

1 The notation on an N-letter summary for the extremes.

Rules and shortcuts, tips and tricks

Here are some things we learned, in no particular order:

Constructing a stem-and-leaf plot automatically sorts the data very efficiently (a "radix sort").  Therefore you need not bother to sort a batch if you begin by making a stem-and-leaf plot.
Often the step will be about twice the standard deviation.
The classical descriptive statistics mean, variance, mean, and CV are not robust.  Neither are the max, min, or range.  Other order statistics and statistics derived from them (like the H-spread and Step) have varying degrees of resistance.
The mean is always between the min and the max.
Use the "h" notation to save time and avoid fussing with halves.
Check your work by counting things.  For example, in a stem-and-leaf plot add the three middle lines (two depths plus the count of leaves in the middle) to confirm they sum to the count.
Watch out for transcription errors when manipulating data.  Unpublished studies consistently indicate about 5% of all transcribed data records will be in error.  This means the odds are greater than 50-50 that any set of more than 10 records you transcribe will contain some error.  When your work matters, double-check all transcriptions.

Notes on Chapters 1 and 3

Return to the Environmental Statistics home page

This page is copyright (c) 2001 Quantitative Decisions.  Please cite it as

This page was created 11 January and last updated 26 February 2001 (the "Notes on the Text" has been moved).