class: center, middle, inverse, title-slide .title[ # Introduction to R for Data Analysis ] .subtitle[ ## Appendix: Relational Data ] .author[ ### Johannes Breuer, Stefan Jünger, & Veronika Batzdorfer ] .date[ ### 2022-08-16 ] --- layout: true --- ## Relational data In some cases, you may need to combine different data sets (e.g., if you have seperate data sets from waves of a longitudinal survey study or data on the same subjects from different sources). The simplest case is that the data frames you want to combine either have the same cases/rows in the same order but different variables/columns, or the same columns/variables (with the same names) but different cases/rows. In those cases, you can use the `base R` functions `rbind()` to add/combine rows or `cbind()` to add/combine columns, or, better yet, `bind_rows()` or `bind_cols()` from `dplyr`. --- ## Relational data For all other situations, you can use the `merge()` function from `base R`. However, a more convenient option is using the the so-called *two-table verbs* from `dplyr` which allow you to combine two (or more) tabular data sets in various ways. We will look at two categories of those verbs in the following: 1. Mutating joins 2. Filtering joins If you have ever worked with (or at least seen) `SQL` code, many of the following things will look familiar. For a more in-depth introduction, you can have a look at the [chapter on relational data](https://r4ds.had.co.nz/relational-data.html) in *R for Data Science*. --- ## Superhero data 🦇⚡🕸🧜♂️ As we cannot use the data from the *ALLBUS* 2021 data to illustrate the different joins from `dlpyr`, we will, instead, the [Super Heroes data set from *Kaggle*](https://www.kaggle.com/claudiodavi/superhero-set). .pull-left[ .center[ **Superheroes** <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ **Publishers** <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] --- ## Mutating joins > A mutating join [...] .highlight[first matches observations by their keys], then .highlight[copies across variables from one table to the other] ([R for Data Science](https://r4ds.had.co.nz/relational-data.html#mutating-joins)). - `inner_join()` - `left_join()` - `right_join()` - `full_join()` --- ## Inner join > All rows from `x` where there are matching values in `y`, and all columns from `x` and `y`. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/inner-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Inner join example .pull-left[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> ```r heroes %>% inner_join(publishers, by = "Publisher") ``` <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> </tbody> </table> --- ## Left join > All rows from `x`, and all columns from `x` and `y`. Rows in `x` with no match in `y` will have `NA` values in the new columns. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/left-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Left join example .pull-left[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> .small[ ```r heroes %>% left_join(publishers, by = "Publisher") ``` ] <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> </tbody> </table> --- ## Right join > All rows from y, and all columns from `x` and `y`. Rows in `y` with no match in `x` will have `NA` values in the new columns. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/right-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Right join example .pull-left[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> .small[ ```r heroes %>% right_join(publishers, by = "Publisher") ``` ] <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> --- ## Full join > All rows and all columns from both `x` and `y`. Where there are not matching values, returns `NA` for the one missing. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/full-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Full join example .pull-left[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> .small[ ```r heroes %>% full_join(publishers, by = "Publisher") ``` ] <table class="table" style="font-size: 9px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> --- ## Filtering joins > Filtering joins .highlight[match observations in the same way as mutating joins], but .highlight[affect the observations, not the variables] ([R for Data Science](https://r4ds.had.co.nz/relational-data.html#filtering-joins)). - `semi_join()` - `anti_join()` --- ## Semi join > All rows from `x` where there are matching values in `y`, keeping just columns from `x`. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/semi-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Semi join example .pull-left[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> ```r heroes %>% semi_join(publishers, by = "Publisher") ``` <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> --- ## Anti join > All rows from `x` where there are not matching values in `y`, keeping just columns from `x`. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/anti-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Anti join example .pull-left[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> ```r heroes %>% anti_join(publishers, by = "Publisher") ``` <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> </tbody> </table> --- ## Expanding the joy of joins Even more join operations (including fuzzy joins) are possible with the [`powerjoin` package](https://github.com/moodymudskipper/powerjoin).