# A Beginner’s Introduction to Data Algebra¶

The purpose of this section is to provide a brief introduction to data algebra. In particular, we would like to motivate the various definitions and terms in a way that is not daunting to beginners, and shows the usefulness of the data algebra framework. Note that this introduction will make use of set notation, of which there is a short primer at Set Notation if needed.

We will do this by taking a dataset and analyze it using data algebra. Consider the following table of information:

year | providedScientificName | ITISscientificName | ITIScommonName | ITIStsn | validAcceptedITIStsn | decimalLatitude | decimalLongitude |
---|---|---|---|---|---|---|---|

1970 | Micrurus tener Baird & Girard, 1853 | Micrurus tener | Texas Coralsnake | 683040 | 683040 | 30.43099976 | -98.05999756 |

2008 | Masticophis taeniatus Hallowell, 1852 | Masticophis taeniatus | Culebra-chirriadora adornada;Striped Whipsnake | 174240 | 174240 | 30.43399048 | -97.96090698 |

1951 | Kinosternon flavescens Agassiz, 1857 | Kinosternon flavescens | Tortuga-pecho quebrado amarilla;Yellow Mud Turtle | 173766 | 173766 | 30.43678093 | -97.66889191 |

1951 | Acris crepitans Baird, 1854 | Acris crepitans | Northern Cricket Frog;Rana-grillo norte | 173520 | 173520 | 30.43678093 | -97.66889191 |

2011 | Parus bicolor Linnaeus, 1766 | Baeolophus bicolor | Carbonero cresta negra;Tufted Titmouse | 178738 | 554138 | 30.4736805 | -97.96916962 |

Table Sightings is taken from BISON (Biodiversity Information Serving Our Nation) [1], a publicly available dataset from the US Geological Survey, and consists of animal sightings in Travis County, TX.

## Couplets: The Basic Pieces of Data¶

Every data entry, or datum, is represented in data algebra using couplets. The idea being any piece of data consists of two pieces of information. For example in table Sightings 2011 refers to a year, and 30.7436805 is a decimalLatitude. In data algebra notation we write these as

and

The object before the arrow is called the left component, and the object after the arrow is called the right component. For example the couplet \(year{\mapsto}2011\) has left component \(year\) and right right component \(2011\).

## Relations: Sets of Couplets¶

Each row of Sightings is a set of couplets, which we call a relation. For example, the first row of Sightings, let us denote it by \(R_1\), is the relation

While there are other ways of forming relations from the table, for our purposes we will use rows to form relations. One reason we will do this is that each row relation is in fact a function in this case. It is often the case that row relations are functional.

### Getting Data from a Relation¶

One of our primary methods of extracting information from a dataset is composition. Let us say we want to know the \(ITIScommonName\) of the first row of Sightings. What we can do is compose \(R_1\) with the relation

Just like function composition, the output of the first relation becomes the input for the next relation. In this case, our first relation has only one output, or right component, which corresponds to only one input, or left component, in \(R_1\), hence

which tells us that \(ITIScommonName\) for the first row is \(Texas\ Coralsnake\). (Note that, just like with functions, compositions are evaluated from right to left. In particular, given relation composition \(r_2 \circ r_1\) we would apply \(r_1\) first, and then \(r_2\).)

## Clans: Sets of Relations¶

A set of relations is called a clan. In particular, any table can be divided up into a set of row relations, which means any table can be represented by a clan. We will refer to the table Sightings as a clan whose relations are the row relations. Once again, we can use composition to extract data out of our clan.

### Getting Data from a Clan¶

For example, if we want the projection (in terms of relational algebra) of Sightings over \(ITISscientificName\) and \(ITIScommonName\), we can form the relation

Let us use \(\mathbb{S}\) to denote the Sightings clan. If we use \(R_k\) to denote the \(k\)th row of Sightings, then

Note that

and \(R_k \circ D\) will give you the \(ITISscientificName\) and \(ITIScommonName\) in the \(k\)th row of Sightings. In particular we have

which is the projection of Sightings onto \(ITISscientificName\) and \(ITIScommonName\) as we wanted.

[1] | BISON can be accessed at http://bison.usgs.ornl.gov To obtain the data in the table, on the map click on Texas, then on Travis county, and one can then download all of the wildlife sightings recorded for Travis county. The above table is only a small subset of the many sightings in Travis county. |