A Beginner’s Introduction to Data Algebra

The purpose of this section is to provide a brief introduction to data algebra. In particular, we would like to motivate the various definitions and terms in a way that is not daunting to beginners, and shows the usefulness of the data algebra framework. Note that this introduction will make use of set notation, of which there is a short primer at Set Notation if needed.

We will do this by taking a dataset and analyze it using data algebra. Consider the following table of information:

Sightings
year providedScientificName ITISscientificName ITIScommonName ITIStsn validAcceptedITIStsn decimalLatitude decimalLongitude
1970 Micrurus tener Baird & Girard, 1853 Micrurus tener Texas Coralsnake 683040 683040 30.43099976 -98.05999756
2008 Masticophis taeniatus Hallowell, 1852 Masticophis taeniatus Culebra-chirriadora adornada;Striped Whipsnake 174240 174240 30.43399048 -97.96090698
1951 Kinosternon flavescens Agassiz, 1857 Kinosternon flavescens Tortuga-pecho quebrado amarilla;Yellow Mud Turtle 173766 173766 30.43678093 -97.66889191
1951 Acris crepitans Baird, 1854 Acris crepitans Northern Cricket Frog;Rana-grillo norte 173520 173520 30.43678093 -97.66889191
2011 Parus bicolor Linnaeus, 1766 Baeolophus bicolor Carbonero cresta negra;Tufted Titmouse 178738 554138 30.4736805 -97.96916962

Table Sightings is taken from BISON (Biodiversity Information Serving Our Nation) [1], a publicly available dataset from the US Geological Survey, and consists of animal sightings in Travis County, TX.

Couplets: The Basic Pieces of Data

Every data entry, or datum, is represented in data algebra using couplets. The idea being any piece of data consists of two pieces of information. For example in table Sightings 2011 refers to a year, and 30.7436805 is a decimalLatitude. In data algebra notation we write these as

\[year{\mapsto}2011\]

and

\[decimalLatitude{\mapsto}30.7436805\]

The object before the arrow is called the left component, and the object after the arrow is called the right component. For example the couplet \(year{\mapsto}2011\) has left component \(year\) and right right component \(2011\).

Relations: Sets of Couplets

Each row of Sightings is a set of couplets, which we call a relation. For example, the first row of Sightings, let us denote it by \(R_1\), is the relation

\[\begin{split}\begin{align*} R_1 =\ & \{ \\ & year{\mapsto}1970, \\ & providedScientificName{\mapsto}Micrurus\ tener\ Baird\ \&\ Girard,\ 1853, \\ & ITISscientificName{\mapsto}Micrurus\ tener, \\ & ITIScommonName{\mapsto}Texas\ Coralsnake, \\ & ITIStsn{\mapsto}683040, \\ & validAcceptedITIStsn{\mapsto}683040, \\ & decimalLatitude{\mapsto}30.43099976, \\ & decimalLongitude{\mapsto}-98.05999756 \\ & \} \end{align*}\end{split}\]

While there are other ways of forming relations from the table, for our purposes we will use rows to form relations. One reason we will do this is that each row relation is in fact a function in this case. It is often the case that row relations are functional.

Getting Data from a Relation

One of our primary methods of extracting information from a dataset is composition. Let us say we want to know the \(ITIScommonName\) of the first row of Sightings. What we can do is compose \(R_1\) with the relation

\[\{ ITIScommonName{\mapsto}ITIScommonName \}\]

Just like function composition, the output of the first relation becomes the input for the next relation. In this case, our first relation has only one output, or right component, which corresponds to only one input, or left component, in \(R_1\), hence

\[R_1 \circ \{ ITIScommonName{\mapsto}ITIScommonName \} = \{ITIScommonName{\mapsto}Texas\ Coralsnake\}\]

which tells us that \(ITIScommonName\) for the first row is \(Texas\ Coralsnake\). (Note that, just like with functions, compositions are evaluated from right to left. In particular, given relation composition \(r_2 \circ r_1\) we would apply \(r_1\) first, and then \(r_2\).)

Clans: Sets of Relations

A set of relations is called a clan. In particular, any table can be divided up into a set of row relations, which means any table can be represented by a clan. We will refer to the table Sightings as a clan whose relations are the row relations. Once again, we can use composition to extract data out of our clan.

Getting Data from a Clan

For example, if we want the projection (in terms of relational algebra) of Sightings over \(ITISscientificName\) and \(ITIScommonName\), we can form the relation

\[D = \{ ITISscientificName{\mapsto}ITISscientificName, ITIScommonName{\mapsto}ITIScommonName \}\]

Let us use \(\mathbb{S}\) to denote the Sightings clan. If we use \(R_k\) to denote the \(k\)th row of Sightings, then

\[\mathbb{S} = \{ R_1, R_2, R_3, R_4, R_5 \}\]

Note that

\[\mathbb{S} \circ D = \{ R_1 \circ D, R_2 \circ D, R_3 \circ D, R_4 \circ D, R_5 \circ D \}\]

and \(R_k \circ D\) will give you the \(ITISscientificName\) and \(ITIScommonName\) in the \(k\)th row of Sightings. In particular we have

\[\begin{split}\mathbb{S} \circ D =\ & \{ \\ & \{ ITISscientificName{\mapsto}Micrurus\ tener, ITIScommonName{\mapsto}Texas\ Coralsnake \} \\ & \{ ITISscientificName{\mapsto}Masticophis\ taeniatus, ITIScommonName{\mapsto}Culebra-chirriadora\ adornada;Striped\ Whipsnake \} \\ & \{ ITISscientificName{\mapsto}Kinosternon\ flavescens, ITIScommonName{\mapsto}Tortuga-pecho\ quebrado\ amarilla;Yellow\ Mud\ Turtle \} \\ & \{ ITISscientificName{\mapsto}Acris\ crepitans, ITIScommonName{\mapsto}Northern\ Cricket\ Frog;Rana-grillo\ norte \} \\ & \{ ITISscientificName{\mapsto}Baeolophus\ bicolor, ITIScommonName{\mapsto}Carbonero\ cresta\ negra;Tufted\ Titmouse \} \\ & \}\end{split}\]

which is the projection of Sightings onto \(ITISscientificName\) and \(ITIScommonName\) as we wanted.

[1]BISON can be accessed at http://bison.usgs.ornl.gov To obtain the data in the table, on the map click on Texas, then on Travis county, and one can then download all of the wildlife sightings recorded for Travis county. The above table is only a small subset of the many sightings in Travis county.