Joins

Today we’ll talk about joins. (Material/slides taken from R4DS2, Chapter 19)

It’s rare that a data analysis involves only a single data frame.

Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.

Two important types of joins:

  • Mutating joins, which add new variables to one data frame from matching observations in another.
  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.

We’ll begin by discussing keys, the variables used to connect a pair of data frames in a join.

Keys

library(tidyverse)
library(nycflights13)

To understand joins, you need to first understand how two tables can be connected through a pair of keys, within each table.

Primary and foreign keys

Every join involves a pair of keys: a primary key and a foreign key. A primary key is a variable or set of variables that uniquely identifies each observation. When more than one variable is needed, the key is called a compound key.

Keys

For example, in nycflights13:

  • airlines records two pieces of data about each airline: its carrier code and its full name. You can identify an airline with its two letter carrier code, making carrier the primary key.

    airlines
    # A tibble: 16 × 2
       carrier name                       
       <chr>   <chr>                      
     1 9E      Endeavor Air Inc.          
     2 AA      American Airlines Inc.     
     3 AS      Alaska Airlines Inc.       
     4 B6      JetBlue Airways            
    ...
  • airports records data about each airport. You can identify each airport by its three letter airport code, making faa the primary key.

    airports
    # A tibble: 1,458 × 8
       faa   name                    lat    lon   alt    tz dst   tzone
       <chr> <chr>                 <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
     1 04G   Lansdowne Airport      41.1  -80.6  1044    -5 A     Amer…
     2 06A   Moton Field Municipa…  32.5  -85.7   264    -6 A     Amer…
     3 06C   Schaumburg Regional    42.0  -88.1   801    -6 A     Amer…
     4 06N   Randall Airport        41.4  -74.4   523    -5 A     Amer…
    ...

Keys

  • planes records data about each plane. You can identify a plane by its tail number, making tailnum the primary key.

    planes
    # A tibble: 3,322 × 9
       tailnum  year type        manufacturer model engines seats speed
       <chr>   <int> <chr>       <chr>        <chr>   <int> <int> <int>
     1 N10156   2004 Fixed wing… EMBRAER      EMB-…       2    55    NA
     2 N102UW   1998 Fixed wing… AIRBUS INDU… A320…       2   182    NA
     3 N103US   1999 Fixed wing… AIRBUS INDU… A320…       2   182    NA
    ...
  • weather records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making origin and time_hour the compound primary key.

    weather
    # A tibble: 26,115 × 15
       origin  year month   day  hour  temp  dewp humid wind_dir
       <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>
     1 EWR     2013     1     1     1  39.0  26.1  59.4      270
     2 EWR     2013     1     1     2  39.0  27.0  61.6      250
     3 EWR     2013     1     1     3  39.0  28.0  64.4      240
     4 EWR     2013     1     1     4  39.9  28.0  62.2      250
    ...

Keys

A foreign key is a variable (or set of variables) that corresponds to a primary key in another table. For example:

  • flights$tailnum is a foreign key that corresponds to the primary key planes$tailnum.
  • flights$carrier is a foreign key that corresponds to the primary key airlines$carrier.
  • flights$origin is a foreign key that corresponds to the primary key airports$faa.
  • flights$dest is a foreign key that corresponds to the primary key airports$faa.
  • flights$origin-flights$time_hour is a compound foreign key that corresponds to the compound primary key weather$origin-weather$time_hour.
The relationships between airports, planes, flights, weather, and
airlines datasets from the nycflights13 package. airports$faa
connected to the flights$origin and flights$dest. planes$tailnum
is connected to the flights$tailnum. weather$time_hour and
weather$origin are jointly connected to flights$time_hour and
flights$origin. airlines$carrier is connected to flights$carrier.
There are no direct connections between airports, planes, airlines,
and weather data frames.

Connections between all five data frames in the nycflights13 package. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows.

Checking primary keys

It’s good practice to verify that primary keys indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one. This reveals that planes and weather both look good:

planes |> 
  count(tailnum) |> 
  filter(n > 1)
# A tibble: 0 × 2
# ℹ 2 variables: tailnum <chr>, n <int>
weather |> 
  count(time_hour, origin) |> 
  filter(n > 1)
# A tibble: 0 × 3
# ℹ 3 variables: time_hour <dttm>, origin <chr>, n <int>

You should also check for missing values in your primary keys — if a value is missing then it can’t identify an observation!

planes |> 
  filter(is.na(tailnum))
# A tibble: 0 × 9
# ℹ 9 variables: tailnum <chr>, year <int>, type <chr>, manufacturer <chr>,
#   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
weather |> 
  filter(is.na(time_hour) | is.na(origin))
# A tibble: 0 × 15
# ℹ 15 variables: origin <chr>, year <int>, month <int>, day <int>, hour <int>,
#   temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, wind_speed <dbl>,
#   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>

Exercises

  1. We forgot to draw the relationship between weather and airports in @fig-flights-relationships. What is the relationship and how should it appear in the diagram?

  2. weather only contains information for the three origin airports in NYC. If it contained weather records for all airports in the USA, what additional connection would it make to flights?

  3. The year, month, day, hour, and origin variables almost form a compound key for weather, but there’s one hour that has duplicate observations. Can you figure out what’s special about that hour?

  4. We know that some days of the year are special and fewer people than usual fly on them (e.g., Christmas eve and Christmas day). How might you represent that data as a data frame? What would be the primary key? How would it connect to the existing data frames?

  5. Draw a diagram illustrating the connections between the Batting, People, and Salaries data frames in the Lahman package. Draw another diagram that shows the relationship between People, Managers, AwardsManagers. How would you characterize the relationship between the Batting, Pitching, and Fielding data frames?

Mutating joins

  • A mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other.

  • Like mutate(), the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones.

  • Let’s focus on restricted flights dataset

flights2 <- flights |> 
  select(year, time_hour, origin, dest, tailnum, carrier)
flights2
# A tibble: 336,776 × 6
    year time_hour           origin dest  tailnum carrier
   <int> <dttm>              <chr>  <chr> <chr>   <chr>  
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA     
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA     
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA     
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6     
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL     
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA     
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6     
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV     
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6     
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA     
# ℹ 336,766 more rows

Mutating joins

  • Four types of mutating join, but one you’ll use almost all of the time: left_join().
  • Output will always have the same rows as x, the data frame you’re joining to
  • The primary use of left_join() is to add in additional metadata.
  • For example, we can use left_join() to add the full airline name to the flights2 data:
flights2 |>
  left_join(airlines)
Joining with `by = join_by(carrier)`
# A tibble: 336,776 × 7
    year time_hour           origin dest  tailnum carrier name                  
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>                 
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines Inc. 
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines Inc. 
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines Inc.
...

More examples of left_join()’s

Temperature and wind speed when each plane departed:

flights2 |> 
  left_join(weather |> select(origin, time_hour, temp, wind_speed))
Joining with `by = join_by(time_hour, origin)`
# A tibble: 336,776 × 8
    year time_hour           origin dest  tailnum carrier  temp wind_speed
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <dbl>      <dbl>
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA       39.0       12.7
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA       39.9       15.0
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA       39.0       15.0
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6       39.0       15.0
...

What size of plane was flying:

flights2 |> 
  left_join(planes |> select(tailnum, type, engines, seats))
Joining with `by = join_by(tailnum)`
# A tibble: 336,776 × 9
    year time_hour           origin dest  tailnum carrier type     engines seats
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>      <int> <int>
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Fixed w…       2   149
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Fixed w…       2   149
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Fixed w…       2   178
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      Fixed w…       2   200
...

left_join() when rows are missing

When left_join() fails to find a match for a row in x, it fills in the new variables with missing values.

For example, there’s no information about the plane with tail number N3ALAA so the type, engines, and seats will be missing:

flights2 |> 
  filter(tailnum == "N3ALAA") |> 
  left_join(planes |> select(tailnum, type, engines, seats))
Joining with `by = join_by(tailnum)`
# A tibble: 63 × 9
    year time_hour           origin dest  tailnum carrier type  engines seats
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>   <int> <int>
 1  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 2  2013 2013-01-02 18:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 3  2013 2013-01-03 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 4  2013 2013-01-07 19:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 5  2013 2013-01-08 17:00:00 JFK    ORD   N3ALAA  AA      <NA>       NA    NA
 6  2013 2013-01-16 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 7  2013 2013-01-20 18:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 8  2013 2013-01-22 17:00:00 JFK    ORD   N3ALAA  AA      <NA>       NA    NA
 9  2013 2013-10-11 06:00:00 EWR    MIA   N3ALAA  AA      <NA>       NA    NA
10  2013 2013-10-14 08:00:00 JFK    BOS   N3ALAA  AA      <NA>       NA    NA
# ℹ 53 more rows

We’ll return to this problem later.

Specifying join keys

  • By default, left_join() will use all variables that appear in both data frames as the join key, the so called natural join.
  • This is a useful heuristic, but it doesn’t always work.
  • For example, what happens if we try to join flights2 with the complete planes dataset?
flights2 |> 
  left_join(planes)
Joining with `by = join_by(year, tailnum)`
# A tibble: 336,776 × 13
    year time_hour           origin dest  tailnum carrier type  manufacturer
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>       
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>        
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>        
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      <NA>  <NA>        
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>        
...
  • We get a lot of missing matches because our join is trying to use tailnum and year as a compound key.
  • Both flights and planes have a year column but they mean different things: flights$year is the year the flight occurred and planes$year is the year the plane was built.
  • We only want to join on tailnum so we need to provide an explicit specification with join_by():

Specifying join keys

  • We can specify the join keys using join_by():
flights2 |> 
  left_join(planes, join_by(tailnum))
# A tibble: 336,776 × 14
   year.x time_hour           origin dest  tailnum carrier year.y type          
    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>         
 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed wing mu…
 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed wing mu…
 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed wing mu…
 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed wing mu…
...

Note that the year variables are disambiguated in the output with a suffix (year.x and year.y), which tells you whether the variable came from the x or y argument. You can override the default suffixes with the suffix argument.

Specifying join keys

  • join_by(tailnum) is short for join_by(tailnum == tailnum).
flights2 |> 
  left_join(planes, join_by(tailnum == tailnum))
# A tibble: 336,776 × 14
   year.x time_hour           origin dest  tailnum carrier year.y type          
    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>         
 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed wing mu…
 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed wing mu…
 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed wing mu…
 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed wing mu…
...
  • More generally:
original_df %>%
  left_join(new_df, join_by(df1_variable == df2_variable))
  • For example: we can join the flight2 and airports table either by dest or origin.
flights2 |> 
  left_join(airports, join_by(dest == faa)) # adds columns for dest carrier name

flights2 |> 
  left_join(airports, join_by(origin == faa)) # adds columns for origin carrier name

Filtering joins

  • Primary action of a filtering join is to filter the rows.
  • Two types: semi-joins and anti-joins.
  • Semi-joins keep all rows in x that have a match in y.
  • For example, semi-join to filter the airports dataset to show just the origin airports:
airports |> 
  semi_join(flights2, join_by(faa == origin))
# A tibble: 3 × 8
  faa   name                  lat   lon   alt    tz dst   tzone           
  <chr> <chr>               <dbl> <dbl> <dbl> <dbl> <chr> <chr>           
1 EWR   Newark Liberty Intl  40.7 -74.2    18    -5 A     America/New_York
2 JFK   John F Kennedy Intl  40.6 -73.8    13    -5 A     America/New_York
3 LGA   La Guardia           40.8 -73.9    22    -5 A     America/New_York

Anti-joins

  • Anti-joins return all rows in x that don’t have a match in y.
  • Useful for finding missing values that are implicit in the data
    • Implicitly missing values don’t show up as NAs but instead only exist as an absence.
  • For example, we can find rows that are missing from airports by looking for flights that don’t have a matching destination airport:
flights2 |> 
  anti_join(airports, join_by(dest == faa)) |> 
  distinct(dest)
# A tibble: 4 × 1
  dest 
  <chr>
1 BQN  
2 SJU  
3 STT  
4 PSE