Exercise 9c

Comments

These comments are based on the homework solutions you submitted.

The answers to 2-5 are general ones based on general principles.  In some particular cases the general ones are wrong, depending on the data distribution, and other answers are correct; but that should be considered an accident of the data, not a quality of the procedure used to visualize the data.

In question 1, "histogram" means in the sense we used in the workshop: a bar chart of frequency versus bin.  It is helpful to label the bins by their breakpoints rather than their sequence (1, 2, 3, etc.).  A chart with one bar extended to the length of each value may be called a "histogram" by some but it is not very helpful for data analysis, nor relevant to the kinds of questions we are addressing here.

In question 2, please note that a quantile method by its very nature cannot single out extreme values unless you use a ridiculously large number of classifications (at least N/2 for N data).  The number of features assigned to any quantile must lie between Floor(N/K) and Ceiling(N/K), where there are K classes.  (Floor = highest integer less than or equal; Ceiling = lowest integer greater than or equal.)  So, for K <= N/2, every quantile must contain at least two features.

In question 5, almost all of you misinterpreted "reverse the order of the classes" and reversed the symbols instead.  (That suggests I made an error in communicating the distinction, so I want to fix it here.)  Make sure you understand the difference: the class is a bin, the symbol is a "crayon" (a property of how a shape is drawn, but not of the shape itself).  Review the figure beginning the class notes for Chapter 9 if you are still not sure.

Solutions

1.    Create a histogram of the data.

I used a single command in Systat 6.0 to produce histograms of all 53 numerical variables--Systat, like most software now, directly reads dBase files.  You can view the histograms on one page.  This means they are small--you can see the shapes, but you cannot read the variable names unless you zoom the view by 300 percent or more.  The histograms are arranged left to right, top to bottom, in the same order their variables appear in the table.

2.    Select a classification method to single out the highest or lowest value.

It will depend on the data.  Where a single value stands apart from the others in the histogram, natural breaks, standard deviation, and equal interval will do well.  Equal area may not work and quantile definitely will not work.

Where a single value does not stand apart, the only way to segregate it visually is to force it into its own classification.  You do that by choosing a large number of classes or by manually specifying the class break.  (Sometimes you will get lucky--an extreme but not isolated value may just get clipped off from the other values by a standard deviation classification.  You cannot rely on this always happening, though.)

3.    How to deal with gaps in the histogram?

The natural breaks method will tend to break classifications in the gaps.  Quantile and equal area classifications will often not do this.  Only when the interval size in standard deviation or equal interval methods is smaller than a gap can you be assured the classification will break across the gap.

All histograms will show gaps when the bins are sufficiently narrow.  What constitutes a "gap" is difficult to define and depends on many things.  Sometimes people interpret gaps as evidence that the data derive from two or more underlying "populations."  In that case there are advanced statistical methods available to obtain useful information about those populations.

4.   Which classification methods help reveal spatial trends?

Consider the equal area method first to see any global trend (in the usual sense of a systematic variation with location).  If instead you want to see anomalous areas, then you will need to be able to distinguish extreme values as well as gaps--natural breaks might be a good starting choice.

5.    What happens when the order of classifications is reversed?

The natural breaks method uses actual data values for break points.  This means that each  break point value belongs to at least two classifications, implying that the corresponding features will fall into at least two classifications.  Recall that ArcView determines how to render each feature by a top-to-bottom linear search of the classifications.   Therefore, when you reverse the order of the classifications, ArcView will necessarily change the class assignment for any feature having a break point value.

For example, consider the [PCT22-24YR] field with five classifications (ArcView's default).  There is an interesting story here, where one county has an unusually low value (percentage of young adults in this case) but it is nestled right against the counties with the highest value:

Now reverse the classification order (use the Legend Editor's button).  This does not change the symbol assigned to each class.  The resulting map is this:

Can you identify the five counties whose symbols (colors) changed?  (There are five because two of them have a value of 4.63, one of the breakpoints.)  The outlying county is now visually grouped with five others.  This map misses the message in the data!

Only the original classification order corresponds to the intended "natural breaks" visualization.  A subtle change in how the data are displayed can have significant consequences: therefore, to do an adequate job, you need to have a deep, clear understanding of the data and of their method of display.