Smoothing

Home
Up

 

Overview

This software contains a set of VBA macros for Excel (versions 97/2000/2002/2003) that implement a single new statistical function, SMOOTH.  This function can perform all the standard smoothing methods of exploratory data analysis with a high degree of flexibility.  (By "standard" we mean every one described by John Tukey in his book "Exploratory Data Analysis".)  These include:

  • Local smoothing by medians, of any window size.
  • Indefinitely repeated median smoothing.
  • Skip means.
  • Hanning.
  • Splitting.
  • Twicing, thricing, ...
  • Re-roughing.
  • Simultaneous smoothing of multiple coordinates.
  • Choice of copying-on or extrapolating end values.

You can accomplish such a wide variety of things with just one function by providing it a smoothing command, which is a string, like "3RSSH", that describes the sequence of operations to perform on a series of numbers.

The software is available here as an Excel add-in (96KB, not zipped).  Do not open it directly: save it somewhere on your computer, then see the installation instructions below for how to install it.

Syntax

SMOOTH is an array function.  This means it operates on a collection of data within a row or column (or array of rows and columns) and its result is an array of the same size.  To accomplish this, you first select the array where the result will go, you type the function name and its arguments in the formula line, and then you press Ctrl-Shift-Enter to complete the formula (rather than the usual Enter).  Excel will confirm this by putting the formula in braces {}.  Note, though, that you do NOT type the braces themselves.

For example, to smooth the series of numbers in cells B2 through B30 with the 3RSSH smoother, you would type

=SMOOTH(B2:B30, "3RSSH")

after selecting cells B2 through B30.  The formula will appear in Excel as {=SMOOTH(B2:B30, "3RSSH")}.

When the selection is a portion of a single row, SMOOTH will operate as expected and smooth the sequence across the row.

When the selection is an array, then by default, SMOOTH will operate separately on each column of the array.  If instead you want it to operate separately on the rows, then set the optional third argument to true, as in:

=SMOOTH(A1:Z2, "3RSSH", true)

which smooths two rows of 26 values each.

SMOOTH does not accept disjoint ranges of input: it expects a single, contiguous array for its first argument.

How to Specify a Smooth

An EDA smooth is a sequence of separate operations.  Each operation is specified by a short code, called a command.  The smooth specification is just the list of commands, from left to right.  In most cases no spaces or delimiters of any kind are needed.  Where a delimiter is needed, use a space.

For example, "3" is the command for a median-of-three smooth and "H" is the command for Hanning.  To perform a median-of-three followed by Hanning, use "3H" or "3 H" (the number of spaces separating the two commands does not matter).

The number "5" is the command for a median-of-five smooth.  To perform a median-of-three followed by a median-of-five, you MUST put a space in: "3 5" is right; "35" is wrong--it specifies a median-of-35 smooth!  Following one median smooth immediately by another is the only situation where you need to insert spaces between the commands.

Here are the available commands.

0, 1, 2, 3, 4, 5, ...  Perform a median smooth with the given window size.  For medians of three, smooth the two end values using Tukey's extrapolation method.  Of course, no smoothing is done for window sizes of 0 and 1, but these sizes are accepted as valid commands anyway.
R When a median smooth is immediately followed by an "R" (repeat) command, then continue to apply the median smooth until no more changes occur.
' "Copy on".  When a median-of-three smooth (or a following "R") command is followed immediately by a prime, then do not smooth the end values (which themselves cannot directly be smoothed).  " 3'R " and "3 R' " mean the same thing.
S "Split" the sequence.  This process dissects the sequence into shorter subsequences at all places where two successive values are identical, applies a 3R smooth to each subsequence, reassembles them, and polishes the result with a 3 smooth.
H "Hann" the sequence.  This is convolution with a symmetrical kernel having weights (1/4, 1/2, 1/4).  That is, each value x[i] is replaced by x[i-1]/4 + x[i]/2 + x[i+1]/4.  The end values are not changed.
> "Skip mean" the sequence.  This is convolution with a symmetrical kernel having weights (1/2, 0, 1/2).  That is, each value x[i] is replaced by x[i-1]/2 + x[i+1]/2, the average of its neighbors.  The end values are not changed.
, "Re-roughing".  At the first appearance of a comma, the smoothed sequence will be saved, to be replaced (temporarily) by its "rough", which is the series of residuals (differences between the original and smoothed values).  Until the next appearance of a comma, the commands will be applied to the rough.  Another comma (or the end of the commands) will cause the original smooth to be added back to the smoothed rough.  This process can be repeated as many times as desired for repeated re-roughing.

SMOOTH does NOT implement a common shorthand, "twice".  For example, the specification "3RSSH,twice" is shorthand for "3RSSH,3RSSH": that is, the smoother 3RSSH is applied to the original data, then to its rough, and the smoothed rough is added back to the original smooth.  As you can see, this shorthand--although sometimes convenient--is not necessary.

The letter commands are not case sensitive: 3rssH is as good as 3RSSH, for instance.  Nor do any extra spaces matter; "3 r s s   h" has the same meaning as "3rssh".

If SMOOTH encounters any command it cannot process, it returns an error.  This is intended to prevent mis-interpretations.

Examples

Tukey uses a series of annual United States coal production, in millions of net tons per year, as an ongoing example.  The data for the years 1920 through 1968 inclusive are 

569 416 422 565 484 520 573 518 501 505 468 382 310 334 359 372 439 446 349 395 461 511 583 590 620 578 534 631 600 438 516 534 467 457 392 467 500 493 410 412 416 403 422 459 467 512 534 552 545

Suppose these have been entered into a spreadsheet in cells R2:R50.  For the 3RSSH,3RSSH,>,3RSSH smooth the formula is

=SMOOTH(R2:R50, "3RSSH,3RSSH,>,3RSSH")

For even greater flexibility you can type the smooth specification in another cell, say S1, and reference it:

=SMOOTH(R2:R50, S1)

Now it's fast and easy to experiment with the smoothing just by typing commands into cell S1.

Here is an Excel graph of these data along with two smooths produced this way.  About the last one Tukey writes, 'If arithmetic is no effort--perhaps because we have a friendly computer all programmed--we can be serious about trying "3RSSH3RSSH3,twice" or even "(3RSSH3RSSH3,3RSSH3RSSH3),3RSSH3RSSH3"..."  (Notice that SMOOTH does not need the parentheses.)

Interpretation: The first smooth--3RSSH,3RSSH,>,3RSSH--consists of a 3RSSH smooth (repeated medians of 3 with end smoothing, two splits, and Hanning) followed by applying the same smooth to the rough and adding that back in ("twicing"), skip-meaning the result, then smoothing the new rough yet again with 3RSSH and adding that back to the smooth for the final result.  Despite all the smoothing going on, this one does a very good job of following most of the data in detail, even at the endpoints, while ignoring a few large fluctuations around the general trend (in the 1920's especially).

Advice

Good smooths to start with include 3RSSH and 4 2 5 3H.  Experiment with twicing them: 3RSSH,3RSSH and 4 2 5 3H,4 2 5 3H.

To analyze the effects of individual commands, apply them separately.  For instance, first compute the 3 smooth, then the 3R smooth, then compute the S smooth of that, then the S smooth of that, and finally the H smooth.  (This is fast and easy to do by dragging and dropping the array formulas across a bunch of columns and referencing the commands in header cells.)

To smooth several sequences of the same length at once, put them into neighboring columns (or rows) and smooth the entire array in one go.  Tukey recommends this approach when observations are made at irregular times: smooth both the sequence of times and the sequence of data using the same smooth.  (We have discovered some problems near the ends using this approach.  Experimenting with SMOOTH will quickly give you insight into what works and what does not.)

There is no right or wrong smooth for your data.  The point is to glean useful information.  Depending on what you are trying to discover, you may be satisfied with a very light smooth (3, for instance) or an extremely heavy one (as shown in the figure above).

You will want to analyze the residuals separately.  They are straightforward to compute as the differences between the original values and the smoothed values.  Elsewhere on these pages we have posted Excel spreadsheets that compute useful exploratory statistics, such as the hinges, letter values, and so on.

Installation

Download the software at  smooth.xla into any folder you like.  It is an Excel add-in.  To install it, open any workbook.  Select the "Tools|Add-Ins..." menu.  Press the "Browse..." button.  Navigate to smooth.xla.  Select it.  Press OK, the OK again.

To uninstall the software, delete, move, or rename the smooth.xla file.

Quick Start

Test the SMOOTH function: type the values 1, 2, 3, 4, and 5 in cells A2 through A6.  Select cells B2 through B6.  In the Formula box type

=SMOOTH(A1:A5, B1)

Press Ctrl-Shift-Enter to complete the formula.  Copies of the values in A2:A6 should appear in the selection: because nothing is specified in cell B1, no smoothing has been done.  Now experiment by typing smooth specifiers in cell B1.  For instance, after typing

3R

in cell B1, you should see the repeated medians-of-three smooth in cells B2:B6.  (The values will be 2, 2, 3, 4, 5, 5).

When you are comfortable with this process, apply it to smooth any set of rows or columns in any of your spreadsheets.  To smooth along rows, remember to include a third argument, set to true, as in

=SMOOTH(A2:Z2, "3RSSH", true)

You can also use Excel's "Insert|Function..." dialog.  As before, select the cells where the results will go.  Open the "Insert|Function..." dialog.  Look for SMOOTH among the statistical functions.  Fill in its arguments.  When completing this dialog, hold down the control and shift keys while pressing OK.

Technical notes

When the median window size is odd, we can write it as 2*K+1.  The smoothed value at location i in the series x[1], x[2], ..., x[N]  is the middle value of {x[i-K], x[i-K+1], ..., x[i+K]}, provided i-K >= 1 and i+K <= N.  Otherwise, the smoothed value is the original value (except for medians of three).  For medians of three without copying on (that is, for the "3" smoother), the end values are smoothed last.  The first end value y[1] is smoothed by taking the median of {y[2], y[3], 3*y[2] - 2*y[3]} where y[] is the series of smoothed values.  The last end value y[N] is found in a symmetric fashion.

When the median window size is even, we can write it as 2*K.  The smoothed value at location i in the series is the average of the two middlemost values among {x[i-K+1], x[i-K+2], ..., x[i+K]}.  Thus this window "reaches out" slightly further to the right than to the left in the series.

Splitting might not always work as expected when your data are arbitrary floating-point numbers.  This is because two numbers can appear to be identical to you when in fact they might differ in one of the least significant digits.  Splitting works best when all original values are integers or at least are not too "complicated" when written in binary.  Preceding splitting by odd-length median smooths--which is standard--will usually avoid this potential problem.

SMOOTH produces results that do not always agree exactly with Tukey's calculations, even after accounting for his rounding methods.  In every case I have found, this is due to an error in the book, not in the software.  (Tukey performed tens of thousands of calculations for this book, probably all of them in his head or with pencil and paper, so it is no surprise some errors crept in.  Indeed, it is testimony to the robustness of his techniques that such errors were never of any consequence in the analysis.)

The 2, 3, and 3' smooths are computed with code specially optimized for them.  For wider medians (4, 5, ...), the algorithm moves a window across the data.  At each step, the value to be removed is searched for (using a binary search) and replaced by the new value in the window.  The window is then sorted using a heapsort.  Because most of the values in the window are already in sorted order after the first step, this process (search, then sort) should take about O(K) effort rather than the nominal O(Kln(K)), where K is the window size.  Thus the total effort is approximately proportional to O(NK), which should be reasonably fast even for large window sizes and long series.  (Excel seems to be able to process about NK = 150,000 such operations per second per GHz of CPU speed.  I have tested with N up to 5461 and K up to 101R.)

The length of any series you can process cannot exceed 2^15 / 6 = 5461 elements.  This is an artificial limitation apparently built into Excel.

The macros implementing SMOOTH make minimal use of the Excel object model.  Thus, they can easily be ported to any other VBA environment, such as Access, Manifold, or ArcGIS, with very little change.

Google
ColorRamp, Memorized Calculations, Rotate, Sample, XSect, and Tissot  are  trademarks of Quantitative Decisions.  All other products mentioned are registered trademarks or trademarks of their respective companies.
Questions or problems regarding this web site should be directed to information (@quantdec.com).
Copyright © 2000-2002 Quantitative Decisions.  All rights reserved.
Last modified: Wednesday March 03, 2004.