Chapter 15: Tables (part 3)

Once upon a time the meat and potatoes of database programming consisted of data summaries.  At Time-Life, where I worked as a database programmer one summer long ago, marketing and circulation managers spent their days poring over thick stacks of database reports.  Each report detailed sales and circulation of hundreds of magazines, broken down by region, salesman, and so on.  The summaries listed total sales, percentages of magazines returned from grocery store racks, and so on.  These kinds of operations consumed the majority of all mainframe resources through the 1980's, until workers obtained direct access to the data through terminals or PCs.

Exercises 15c and 14c--Summarizing Tables

The idea is simple: you group records in a table according to the values of a field (such as magazine) and compute statistics on the records in each group.  The resulting "report" is really just another table having one record for each unique field value (one record per magazine for example) and fields showing the summary statistics.

If you conceive of a "statistic" as being the result of any computation performed on a group of records, you can do some powerful things.  In addition to the usual things such as averages, minima, maxima, and so on, ArcView also offers one geographic summary: the "merge" operation to "summarize" shapes.

From this point of view, Exercise 14c ("merging features") is just a special instance of Exercise 15c ("summarizing a table").  Indeed, the interface is exactly the same.  The only difference is that ArcView must save the result of a spatial merge in a shapefile (so it can store the shapes) instead of a data file.

Exercise 15b--Computing statistics

ArcView's "field statistics" option differs from summaries in several important ways, some of which are evident and some of which are hidden:

It produces text in a "report" box instead of producing a tabular result.  The only way to save the results is to copy this text to the clipboard (select it with the mouse and press ctrl-C) and paste it into a savable document somewhere.
It summarizes the selected records.  It does not break the summary down by field values.
It is executed with an Avenue script.  The script explicitly loops over each selected record.  This is much slower than the built-in summarize operation.

Things to watch out for

Creating an extensive summary with many statistics takes some effort at the interface and may consume some computing time as well.  It's easy to overlook a crucial statistic or to mis-specify one.  Carefully review your summary specifications before pressing the OK button.
Take the time to specify where the summary file will go and what it will be named.  Summaries are often worth saving and should not just automatically get dumped into a temporary directory.
Like most other ArcView operations, if there is a selection on the Table then the summary will be computed only for the selected records.  With summaries, though, you usually want every record summarized.  So get in the habit of clearing the selection before summarizing.
The spatial merge of a point theme produces a multipoint theme.  Its features are indivisible groups of points.  Each group shares a common symbol in the legend.  This can be useful, but note that ArcView does not provide tools for editing multipoints (see GTKAV Chapters 23 and 24 for more on editing shapes).
The spatial merge of a theme of adjacent or overlapping polygons produces features that look simple but internally are very complex.  The problem is that where two merged polygons share a common arc along their boundaries, the arc will disappear--but it's still in the shapefile, twice: once for each original polygon.  Merging does nothing to simplify the internal representation of regions.
With a spatial merge it is possible to create polygons or polylines that have more than one connected component.  These are to polygons (or polylines) as multipoints are to points.  Using ArcView's editing tools you cannot separate the components.  There are scripts available (see ESRI's ArcScripts pages) to "explode" such features into their component parts while retaining all attributes.
The "standard deviations" and "variances" reported by the Field|Statistics and Field|Summarize menu items are not the usual estimators.  The variances are mean squares of deviations from the mean; the standard deviations are their square roots.  To convert a variance to the usual estimator, multiply it by N/(N-1) where N is the count.  To convert a standard deviation to the usual estimator multiply it by the square root of N/(N-1).

Tips and tricks

People get confused about merges.  It is helpful to visualize what the result will be.  What field in the result will uniquely identify each record?  [Magazine]?  [Township]?  [Shape]?  That is the field to activate before beginning the summary.
A quick way to obtain a list of all distinct values of a field is to activate the field and press the summary button.  You do not need to specify any statistics.  The resulting summary will have one record for each field value and a count of the number of occurrences of that field.  This is a useful technique for identifying unwanted duplicate records.
The ArcView summarize operation provides only one level of summary.  What about nested summaries, such as statistics of magazine circulation broken down by year within region?  To do this, create a new field by concatenating two or more fields.  Summarize on the new field.  Include in your statistics the first instances of each of the component fields so that they get carried over as fields in the summary.  You can then repeat the process on the summary to get nested statistics.
You can view and even modify the field statistics script.  Open a new Script Editor document (see GTKAV Chapter 27) and load the "Table.Statistics" system script using the Load system script button .  Name this script editor "Table.Statistics" if you want any changes you make to it to override the system script.  Take a quick look at this script just to appreciate how much extra work it takes to code a loop over selected records in a table.  The Field Calculator is so much easier to use (and, with some cleverness, could actually be used to compute field statistics).