Chapter 9: Classifying and Displaying Features

Summary of principles

The feature/symbol dichotomy: A GIS divides the functions of representing data on a map into at least five conceptually different parts: the attributes are classified (classification) and associated with geometric figures (projection), the classifications are associated with graphic symbols, and those symbols determine how the geometric figures will appear (symbolization of the graphic shapes).
Emphasize substance over form: When working with complex software, like a GIS, deal with the important things first and take care of the details as late in the process as possible.
Look for the familiar within the strange terminology: Many disciplines have, or could have, contributed to the development and evolution of GIS.  Many techniques may already be familiar to you under different names.  Do not let the terminology daunt you!
Maps are statistics: Symbolizing features on a map to reflect underlying data is a statistical procedure.  You can use statistical techniques to help determine appropriate symbolization methods.
You are in charge: Always have a reason for your choice of data symbolization: do not let the computer choose.

Exercises 9a and b—Introducing the Legend Editor

Background

The Legend Editor is a complex dialog providing access to almost all ArcView’s capabilities to classify and modify the display of features.  Every feature has a geographic location.  The view erects a graphic shape at each feature’s location in order to show it to a person.   A graphic shape is a combination of a geometric shape and a symbol, which records the visual features of the shape, such as its colors, line styles, and so on.   The shape is determined by the geographic data, but the means of displaying the shape are not.  Determining how the shape will display is the role of the Legend Editor.   The Legend Editor can make this determination based on the values of one or more attributes of each feature.  In this manner ArcView can show data on a map.

The Legend Editor sets up a sequence of processes.   The first is classification.   This assigns features to groups according to the values of one or more of the feature attributes.   The second is symbolization.   This associates a symbol with each class.   The figure sketches this process.

By glancing at the figure you can work out exactly what must be specified in the Legend Editor:

The attributes whose values will determine the symbols
The method by which attributes will be classified
What each classification will look like (colors, line styles, hatch patterns, point symbols, text fonts)

In addition, the Legend Editor controls how the classifications will be named or labeled in the View’s table of contents and provides capabilities to save and reload a legend.

GIS software differs in how it implements the Legend Editor capabilities, but it must implement these capabilities at some level.

ArcView implementation

Invoke the Legend Editor through a menu item (Theme|Edit Legend), a button , or most simply by double-clicking on a theme’s legend in the view.

Things to watch out for:

You have to press the Apply button before any changes will take effect.
The Legend Editor dialog is in a separate Window that can be moved outside the ArcView window.  However, it will never disappear behind ArcView.  It can remain open while you do other ArcView operations (it is “modeless," not a child window).
The Legend Editor will act differently on different kinds of themes: feature themes, image themes, and grid themes.
The Legend Editor dialog has been long in need of improvement.   Typically, changes to the legend type, values field, or classification type will cause unwanted side-effects, often destroying any colors or symbols you have already specified.   The trick lies in specifying this fundamental information first.   Only when you are sure you have chosen the right kind of legend, the correct attribute(s) to display, and the correct number of classifications should you bother with the details of specifying the symbols themselves.

This last rule applies to all complex software: deal with the important things first and take care of the details as late in the process as possible (or never, if you can get away with it).

Laboratory Exercises

Find the answers by guessing and experimenting.

  1. Describe what the following Legend Editor buttons do.
  2. How many distinct colors appear in the “Red monochromatic” color scheme?  Can you find a way to obtain more?
  3. Practice editing symbols.  (You double-click on one to change it.)   Can you find a way to edit multiple symbols simultaneously?
  4. What happens when you type something into the “Value” column?
  5. What happens when you type something into the “Label” column?
  6. Suppose two classifications overlap; for example, suppose one classification is 0-10 and the next is 5-20.   How does ArcView resolve this conflict; that is, how does it decide which classification to use for values within the overlapping region?
  7. What happens when an attribute’s value does not fall into any classification in a legend?
  8. Figure out what this button does and how it is relevant to the previous question.
  9. (Optional)  The “Statistics” button is handy.  Experiment with it.  Note that there are two definitions of “standard deviation” (SD).   One is the “root mean square” (of the deviations from the mean), or “population” standard deviation.   The other has been adjusted to reduce its estimation bias; it is Ö(N/(N-1)) times larger than the population SD when N values are involved.   Which SD does ArcView compute?
  10. When is the “Advanced” button enabled?   Optional (for now): determine what the advanced options do.

Exercise 9c—Selecting a Numerical Classification Method

Every map is to some extent a distortion of reality.   Through this distortion some maps reveal patterns and other maps lie [see Monmonier or Tufte, for instance].   One of the subtlest forms of lying with maps is associated with the method of classifying numerical attributes.

We will experiment with different classification methods.   The material here is background.   For more information see http://www.colorado.edu/geography/gcraft/notes/cartocom/cartocom_f.html (section 6).

Types of numerical classifications

For each classification type below, N numerical values are to be divided into K classifications (“classes”) according to the rules to be described.   Suppose the ordered numerical values are X1 £ X2 £  £ XN and that they are associated with features F1, F2,  …, FN respectively.  In every case the classifications consist of non-overlapping intervals of numbers.  The endpoints of the classes are the “breaks” or “cut points.”

Quantile

The K classes will be constructed to contain as near to N/K values as practicable.  For example, if N is 18 and K is 5, then N/K = 3.6, so the classes will contain either 3 or 4 values each.  The cut points will always coincide with one of the original data values.

Equal interval

The range from the smallest value X1 to the largest XN is divided into N intervals of equal length.  The length is therefore L = (XN - X1)/N and the endpoints are X1, X1 + L, X1 + 2L, …, XN – L, and XN.  Some intervals may contain none of the original values (can you think of a simple example?).

Standard deviation

This is almost another kind of equal interval classification.  However, the interval length is set to be some multiple of the standard deviation of the data.  Multiples of 1 and 0.5 are common; smaller multiples may be used with lots of data.  The starting value for laying off multiples is the mean of the data.  Chebysheff’s Theorem states that no more than 1/L2 of the data values can lie beyond ±L standard deviations of the mean, so typically the intervals beyond L=±3 or L=±4 are merged since they will not contain many data at all.

Equal area

The breaks between the classes are set so that the total area of the features in each class is as close as possible to 1/K times the total area of all features.  (This of course makes sense only for features with areas—that is, for polygonal features.)  The result usually is a map that has about the same amount of every symbol on it—a kind of visual balance.  (Thought question: could you describe, in detail, an algorithm for determining the equal area cutpoints?)

Equal length and equal number legends are conceivable for polyline and multipoint themes, but ArcView does not implement these.  (The features of a "multipoint" theme consist of zero, one, or more points.)

Natural breaks

Given a desired number of classes, K, the Natural Breaks method partitions the data into K subsets that minimize the sum of the "spreads" within each subset.  ("Spread" is an informal term employed solely for this description.)

When the data represent a random sample from a population consisting of two more more distinctly different subpopulations, and you know--or can accurately guess--how many different subpopulations there are, then Natural Breaks can do a good job of choosing classes which reflect the subpopulation groupings.

Any nonempty set of numbers can be centered by subtracting the mean of those numbers from each value.  The resulting values are known as residuals.  One measure of spread is the sum of squares of the residuals.  It is a measure of how much the values vary about their means, and so is appropriate for identifying clusters of data.

Statistical methods exist to identify clusters in data.  A common one is “K-means,” which (given K) minimizes the total within-cluster variance.  This method, which has been around a long time, was rediscovered by cartographers in 1963 and so goes by the name “Jenks’ Method” in that community.

Borden Dent (Cartography--Thematic Map Design, Fifth Edition, 1999, pages 147-149) relates that the Natural Breaks classes are found by an iterative search to minimize the sum of spreads of the classes.  Thus it is a one-dimensional example of the K-means clustering method.  Natural  breaks, Jenks' Method, and K-means are all the same in this instance.

There are many other types of classification, but (for data sets with no “ties”) all can be reduced to equal intervals after applying a preliminary transformation of the data, Y = f(X).   For example, dividing Y = log(X) into equal intervals is equivalent to dividing the range of the X’s into intervals of equal ratios, such as 1-2, 2-4, 4-8, and so on.  Therefore classification methods for numerical data are usually determined by their effects on the resulting map rather than by some statistical method based purely on the data.

The classification method is an easy target for criticism when you make a map.  Therefore, always have a reason for selecting your method.  It is not sufficient to say, “the computer chose it.”   Who is in charge, you or the machine?

Laboratory Exercises

Experiment with the data provided in GTKAV chapter 9 (counties.shp).   Apply all five classification methods and vary the numbers of classes.   Each team (computer) will select one of the several dozen numerical attributes to study.

  1. Create a histogram of the data, by hand or using software on your computer (such as Excel or a statistics package).   Use this as a frame of reference for evaluating the maps you create.   Create at least two versions of this histogram by substantially varying the bin size (or, equivalently, by varying the number of histogram bars).   Using these histograms alone, predict your answers to the next three questions.  Then proceed to answer those questions through experimentation with ArcView’s Legend Editor.  Check your answers against your predictions.
  2. Which classification method works best for singling out the highest value in the data?  The lowest?
  3. Are there evident “gaps” in your histogram?   If so, which classification method(s) help reveal them on the map?   Which classification method(s) hide the gaps?
  4. Is there any obvious spatial trend to your data?   If so, which classification methods help reveal it?   Which ones obscure it?
  5. Create a “natural breaks” classification using two or more classes.   Look closely at the map.   Now reverse the order of the classes in the legend (use , for example).   What changes?   Can you discover why?

This page was last updated 11 March 2004. It was reformatted.  The "Summary of principles" section was added.  Minor editorial changes were made to clarify portions of the text.

This page was updated 20 October 2002 to expand on the Natural Breaks description.  Thanks to Daniel Karnes of Dartmouth College for pointing out the need for improvement.