|
|
|
OverviewThis software contains a set of VBA macros for Excel (versions 97/2000/2002/2003) that implement a single new statistical function, SMOOTH. This function can perform all the standard smoothing methods of exploratory data analysis with a high degree of flexibility. (By "standard" we mean every one described by John Tukey in his book "Exploratory Data Analysis".) These include:
You can accomplish such a wide variety of things with just one function by providing it a smoothing command, which is a string, like "3RSSH", that describes the sequence of operations to perform on a series of numbers. The software is available here as an Excel add-in (96KB, not zipped). Do not open it directly: save it somewhere on your computer, then see the installation instructions below for how to install it. SyntaxSMOOTH is an array function. This means it operates on a collection of data within a row or column (or array of rows and columns) and its result is an array of the same size. To accomplish this, you first select the array where the result will go, you type the function name and its arguments in the formula line, and then you press Ctrl-Shift-Enter to complete the formula (rather than the usual Enter). Excel will confirm this by putting the formula in braces {}. Note, though, that you do NOT type the braces themselves. For example, to smooth the series of numbers in cells B2 through B30 with the 3RSSH smoother, you would type
after selecting cells B2 through B30. The formula will appear in Excel as {=SMOOTH(B2:B30, "3RSSH")}. When the selection is a portion of a single row, SMOOTH will operate as expected and smooth the sequence across the row. When the selection is an array, then by default, SMOOTH will operate separately on each column of the array. If instead you want it to operate separately on the rows, then set the optional third argument to true, as in:
which smooths two rows of 26 values each. SMOOTH does not accept disjoint ranges of input: it expects a single, contiguous array for its first argument. How to Specify a SmoothAn EDA smooth is a sequence of separate operations. Each operation is specified by a short code, called a command. The smooth specification is just the list of commands, from left to right. In most cases no spaces or delimiters of any kind are needed. Where a delimiter is needed, use a space. For example, "3" is the command for a median-of-three smooth and "H" is the command for Hanning. To perform a median-of-three followed by Hanning, use "3H" or "3 H" (the number of spaces separating the two commands does not matter). The number "5" is the command for a median-of-five smooth. To perform a median-of-three followed by a median-of-five, you MUST put a space in: "3 5" is right; "35" is wrong--it specifies a median-of-35 smooth! Following one median smooth immediately by another is the only situation where you need to insert spaces between the commands. Here are the available commands.
SMOOTH does NOT implement a common shorthand, "twice". For example, the specification "3RSSH,twice" is shorthand for "3RSSH,3RSSH": that is, the smoother 3RSSH is applied to the original data, then to its rough, and the smoothed rough is added back to the original smooth. As you can see, this shorthand--although sometimes convenient--is not necessary. The letter commands are not case sensitive: 3rssH is as good as 3RSSH, for instance. Nor do any extra spaces matter; "3 r s s h" has the same meaning as "3rssh". If SMOOTH encounters any command it cannot process, it returns an error. This is intended to prevent mis-interpretations. ExamplesTukey uses a series of annual United States coal production, in millions of net tons per year, as an ongoing example. The data for the years 1920 through 1968 inclusive are 569 416 422 565 484 520 573 518 501 505 468 382 310 334 359 372 439 446 349 395 461 511 583 590 620 578 534 631 600 438 516 534 467 457 392 467 500 493 410 412 416 403 422 459 467 512 534 552 545 Suppose these have been entered into a spreadsheet in cells R2:R50. For the 3RSSH,3RSSH,>,3RSSH smooth the formula is
For even greater flexibility you can type the smooth specification in another cell, say S1, and reference it:
Now it's fast and easy to experiment with the smoothing just by typing commands into cell S1. Here is an Excel graph of these data along with two smooths produced this way. About the last one Tukey writes, 'If arithmetic is no effort--perhaps because we have a friendly computer all programmed--we can be serious about trying "3RSSH3RSSH3,twice" or even "(3RSSH3RSSH3,3RSSH3RSSH3),3RSSH3RSSH3"..." (Notice that SMOOTH does not need the parentheses.)
Interpretation: The first smooth--3RSSH,3RSSH,>,3RSSH--consists of a 3RSSH smooth (repeated medians of 3 with end smoothing, two splits, and Hanning) followed by applying the same smooth to the rough and adding that back in ("twicing"), skip-meaning the result, then smoothing the new rough yet again with 3RSSH and adding that back to the smooth for the final result. Despite all the smoothing going on, this one does a very good job of following most of the data in detail, even at the endpoints, while ignoring a few large fluctuations around the general trend (in the 1920's especially). AdviceGood smooths to start with include 3RSSH and 4 2 5 3H. Experiment with twicing them: 3RSSH,3RSSH and 4 2 5 3H,4 2 5 3H. To analyze the effects of individual commands, apply them separately. For instance, first compute the 3 smooth, then the 3R smooth, then compute the S smooth of that, then the S smooth of that, and finally the H smooth. (This is fast and easy to do by dragging and dropping the array formulas across a bunch of columns and referencing the commands in header cells.) To smooth several sequences of the same length at once, put them into neighboring columns (or rows) and smooth the entire array in one go. Tukey recommends this approach when observations are made at irregular times: smooth both the sequence of times and the sequence of data using the same smooth. (We have discovered some problems near the ends using this approach. Experimenting with SMOOTH will quickly give you insight into what works and what does not.) There is no right or wrong smooth for your data. The point is to glean useful information. Depending on what you are trying to discover, you may be satisfied with a very light smooth (3, for instance) or an extremely heavy one (as shown in the figure above). You will want to analyze the residuals separately. They are straightforward to compute as the differences between the original values and the smoothed values. Elsewhere on these pages we have posted Excel spreadsheets that compute useful exploratory statistics, such as the hinges, letter values, and so on. InstallationDownload the software at smooth.xla into any folder you like. It is an Excel add-in. To install it, open any workbook. Select the "Tools|Add-Ins..." menu. Press the "Browse..." button. Navigate to smooth.xla. Select it. Press OK, the OK again. To uninstall the software, delete, move, or rename the smooth.xla file. Quick StartTest the SMOOTH function: type the values 1, 2, 3, 4, and 5 in cells A2 through A6. Select cells B2 through B6. In the Formula box type
Press Ctrl-Shift-Enter to complete the formula. Copies of the values in A2:A6 should appear in the selection: because nothing is specified in cell B1, no smoothing has been done. Now experiment by typing smooth specifiers in cell B1. For instance, after typing
in cell B1, you should see the repeated medians-of-three smooth in cells B2:B6. (The values will be 2, 2, 3, 4, 5, 5). When you are comfortable with this process, apply it to smooth any set of rows or columns in any of your spreadsheets. To smooth along rows, remember to include a third argument, set to true, as in
You can also use Excel's "Insert|Function..." dialog. As before, select the cells where the results will go. Open the "Insert|Function..." dialog. Look for SMOOTH among the statistical functions. Fill in its arguments. When completing this dialog, hold down the control and shift keys while pressing OK. Technical notesWhen the median window size is odd, we can write it as 2*K+1. The smoothed value at location i in the series x[1], x[2], ..., x[N] is the middle value of {x[i-K], x[i-K+1], ..., x[i+K]}, provided i-K >= 1 and i+K <= N. Otherwise, the smoothed value is the original value (except for medians of three). For medians of three without copying on (that is, for the "3" smoother), the end values are smoothed last. The first end value y[1] is smoothed by taking the median of {y[2], y[3], 3*y[2] - 2*y[3]} where y[] is the series of smoothed values. The last end value y[N] is found in a symmetric fashion. When the median window size is even, we can write it as 2*K. The smoothed value at location i in the series is the average of the two middlemost values among {x[i-K+1], x[i-K+2], ..., x[i+K]}. Thus this window "reaches out" slightly further to the right than to the left in the series. Splitting might not always work as expected when your data are arbitrary floating-point numbers. This is because two numbers can appear to be identical to you when in fact they might differ in one of the least significant digits. Splitting works best when all original values are integers or at least are not too "complicated" when written in binary. Preceding splitting by odd-length median smooths--which is standard--will usually avoid this potential problem. SMOOTH produces results that do not always agree exactly with Tukey's calculations, even after accounting for his rounding methods. In every case I have found, this is due to an error in the book, not in the software. (Tukey performed tens of thousands of calculations for this book, probably all of them in his head or with pencil and paper, so it is no surprise some errors crept in. Indeed, it is testimony to the robustness of his techniques that such errors were never of any consequence in the analysis.) The 2, 3, and 3' smooths are computed with code specially optimized for them. For wider medians (4, 5, ...), the algorithm moves a window across the data. At each step, the value to be removed is searched for (using a binary search) and replaced by the new value in the window. The window is then sorted using a heapsort. Because most of the values in the window are already in sorted order after the first step, this process (search, then sort) should take about O(K) effort rather than the nominal O(Kln(K)), where K is the window size. Thus the total effort is approximately proportional to O(NK), which should be reasonably fast even for large window sizes and long series. (Excel seems to be able to process about NK = 150,000 such operations per second per GHz of CPU speed. I have tested with N up to 5461 and K up to 101R.) The length of any series you can process cannot exceed 2^15 / 6 = 5461 elements. This is an artificial limitation apparently built into Excel. The macros implementing SMOOTH make minimal use of the Excel object model. Thus, they can easily be ported to any other VBA environment, such as Access, Manifold, or ArcGIS, with very little change. |
ColorRamp, Memorized Calculations, Rotate, Sample, XSect, and Tissot are trademarks of Quantitative Decisions. All other products mentioned are registered trademarks or trademarks of their respective companies.
Questions or problems regarding this web site should be directed to information
(@quantdec.com).
|