Chapter 16: Joins and Links

You use joins and links in distinctly different ways.  Joins, along with the controls Table documents give you for selectively hiding and reordering fields and records, provide the fundamental tools for constructing "virtual" tables from separate disk data sources.  Links (and "hot links") are primarily an interactive exploration tool.

How they work

It can help to know something about how the software implements joins and links.  Consider the situation of Exercise 16a, which joins data in income.dbf to the attributes of the [California Counties] theme.

Consider how you would create this join using pencil and paper:

For each California county ([Name] on the left side) you would

Look up the record in income.dbf having a matching [County] value, and then

Copy the [Per capita income, USD] value

end.

The result would look like this:

Record Name Matching record (on right) Income
#1 Siskiyou #2 (result of search) 11610 (copied)
#2 Del Norte #1 (result of search) 10625 (copied)
#3 Modoc #4 (result of search) 10971 (copied)
... ... ... ...

Analysis of the algorithms

Looping ("for each") and copying are fast operations.  Searching ("look up") might not be.  Generally, records in a table have no particular order--they appear in the order in which they were entered into the table.  This is usually not useful for a search; you simply have to read the table record-by-record to find the matching value.  On average you would scan half way through the table to find a match.  For example, if the income table had 100,000 records (not an unusual number), then on average you would have to read 50,000 records to find the match for each destination (county) record.

More generally, to perform the pencil-and-paper join for K destination records (such as counties) and N source records (such as income) would require reading K*N/2 destination records on average.  That factor of 1/2 is not terribly relevant--we have already absorbed some constants of proportionality in the K*N expression by assuming it takes a constant amount of time to read a record, for example--but the K*N terms are important.  They show how the algorithm "scales" with changes in the sizes of the data tables.  We say that this is a O(K*N) (read "big-Oh of K times N") algorithm.  This terminology means "up to a factor that can be taken to be constant, the time required is directly proportional to K*N."  (For a very readable discussion of big-O analysis of algorithms, consult Bentley, Jon, Programming Pearls.  1986, Addison-Wesley, MA (paperback).)

Database software does much better than this and so can you.  Dr. W. C. Minor, the nineteenth century physician institutionalized for insanity and murder, was one of the ablest and most prolific contributors to the Oxford English Dictionary (OED).  His technique was systematically to read works of early English literature and to index (that is, cross-reference) and alphabetically sort occurrences of interesting words.  When the OED editors submitted lists of words to him, Minor could usually respond with definitions and examples of usage within a day.  In short, several years of diligent reading and indexing paid off handsomely during the following decades.  (See Winchester, S., The Professor and the Madman.  Harper Perennial, 1999.)

To implement a join or link, then, ArcView first indexes the source table (income.dbf).  This involves sorting its keys (the [Name] values in the example) and cross-referencing the ordered keys with their record numbers.  With the keys sorted, ArcView can then search the index using the dictionary algorithm.  This is what you would do to find a word in a dictionary; namely, guess at the location, compare the word to the first word on the guessed page, and narrow the search according to whether the desired word occurs before the guessed page or after.

On average, each new guess in the dictionary algorithm (technically known as "binary search") will halve the number of words remaining to search.  You can therefore search a table having N=1 words with no guesses, N=2 with one guess (at most), N=4 with two guesses, and generally, N = 2^M with at most M guesses.  An equivalent way of writing this is that M = log2(N) approximately.  Whence, searching a list of N sorted keys is a O(lg(N)) algorithm (where we write lg(N) := log2(N)).

Consequently, K searches of an indexed table of N records will require O(K*lg(N)) time.  Suppose your computer takes one second to join a source table of N=1 records to a destination table of K = 10,000 records.  Suppose further that the extra work in managing a binary search causes binary searching to take ten times longer to process each source record (after all, we have to access an index table and execute the additional code needed to manage the binary search).  To make the comparison simple, let's consider a join involving K=10,000 destination records and N source records.  Here are the expected times for the two approaches: the original scanning search and the indexed search.

  N (number of source records) Scan time (hours:minutes:seconds) Indexed search time
  10 00:00:10 00:33
  100 00:01:40 01:06
  1000 00:16:40 01:40
  10,000 02:46:40 02:13
  100,000 27:46:40 02:46
  1,000,000 over 11 days 03:19

The scan time (linear search) is directly proportional to N while the indexed search time (binary search) is proportional only to lg(N).  The message is clear: if we can create the index for the source table in reasonable time, then for almost any number of source records using the index will be tremendously faster than not using it.  Moreover, joining simply is not practicable with large files unless an index is used.

There are many ways to create an index; they are all tantamount to sorting the keys.  There are many ways to sort N keys in O(N*lg(N)) time.  (See Knuth, D., The Art of Computer Programming, Volume 3: Sorting and Searching.  Second Edition, 1998, Addison-Wesley, MA.  IBSN 0-201-89685-0.)  This is fast enough to make creating an index the best way to go in almost any situation.

Things to Watch Out For

Almost all string comparisons in ArcView are not case sensitive.  For example, ArcView will usually report that "Penn State" is equal to "PENN STATE".  However, the one place this is not true is in joins (but not links!).  "SISKIYOU", for example, will not match "Siskiyou" in a join (try it--edit the California counties attribute table and redo the join with income.dbf).
ArcView joins and links must match fields of the same type (it will not automatically convert types for you).  However, the actual field names are not important.
You can join many sources into one destination.  However, you cannot use a joined table as the source of another join.

We may illustrate this by showing each table as a graphic, each join as a crooked arrow, and an intended join as a thick straight arrow.  Tables constructed from joins are shown with dashed boxes; base tables are shown as shaded document icons.

Not allowed

Allowed

One way to execute the allowed configuration is

  1. Join Base 2 to Base 1
  2. Join Source 1 to Base 1
  3. Join Source 2 to Base 1 (which now contains the needed matching fields from Base 2).

Note that in every case the join sources (first tables listed) are "base" (unjoined) tables.  Table 2 in the disallowed configuration contains a join--it is not a base table.

Joins can create tables with duplicate field names or aliases.  For example, if "X" is a field name in both Base 1 and Source 1 above, then the resulting join will have two [X] fields.  Usually you solve this problem by re-aliasing one (or both) of the resulting fields.  Unfortunately, some ArcView operations (I have not systematically studied which) use field names rather than aliases, in which case you are out of luck: only the first (leftmost) field with a given name will be accessible; the other will appear not even to exist.
It is useful to alias fields before you start creating joins.  ArcView copies field aliases from a source table into the resulting joined table, thereby saving you some work--if you created the aliases before doing the join.
Joins (and links) do take some time to execute.  If you create a project using many joins (this is common), it will take longer for your project to open.  Some people report projects taking 30 minutes or more to open--they have hundreds of joins involving large tables.
ArcView re-executes all joins in a table every time you save any edits.  It does not use an efficient algorithm that redoes the lookups for the changed records: it just redoes the entire join.  This can take a lot of time if many tables are involved or if the base table is large.  Therefore editing a table with joins can be cumbersome.
Links slow down selections: every time you make a selection in the destination table, ArcView has to access the source index(es) and make a corresponding selection.  This can be annoying when performing an extended data analysis.
ArcView stores joins and links (in its project file) by referencing the file and field names involved.  Naturally, if you change the name of a field in a file using some other software--or even using ArcView from another project--then the project file will contain erroneous information.  ArcView may then refuse to open the project file at all.  Editing data can be dangerous!
Joins will work even when the source has multiple records matching a given destination key.  Apparently, ArcView will join the first source record relative to the physical record order in the source file, but you should not rely on this behavior.
Only base fields can be made editable.  A base field is a field in the base table (original destination) of a join.  All other fields (the source, or "foreign," fields) cannot be edited in a table.  (They can be edited from their own tables, though.)
You cannot remove joined sources from a table one by one.  It's an all-or-nothing proposition, using the Table|Remove All Joins menu item.  Therefore you should plan and execute complex joins carefully--if you make a mistake, you have to start over from the beginning.
There appear to be no artificial limits on the number of source tables that can be joined to a base table.  I have created joins involving over 100 tables and 3000 fields without problem.