Introduction to R for Data Analysis

.title[
# Introduction to R for Data Analysis
]
.subtitle[
## Exploring missingness & outliers
]
.author[
### Johannes Breuer, Stefan Jünger, & Veronika Batzdorfer
]
.date[
### 2022-08-17
]

---

---

## Missings & outliers

Two things that can influence summary statistics as well as univariate and bivariate distributions are missings and outliers. Hence, before we can analyze our data, we should check whether we have clear patterns of missingness or extreme outliers in our data.

---

## Missing data

In the second session on data wrangling we have already discussed how we can define specific values as missings (`NA`) and how we can recode `NA` into something else. As we have seen in that session, the *ALLBUS* 2021 data set contains quite a few codes for different types of missing data. However, when we collect data ourselves or if we (re-)use data sets that are not as well-documented as the *ALLBUS*, we may need to explore potential patterns of missingness to see if there may be identifiable reasons why data for certain variables and/or observations are missing.

---

## Summarizing missingness

The [`naniar` package](http://naniar.njtierney.com/index.html) that we have used for recoding values as `NA` in the second session wrangling also provides a [collection of functions for summarizing missingness](http://naniar.njtierney.com/reference/miss_var_summary.html) in variables or whole data sets, of which we will show three examples in the following.

```r
allbus_2021_eda <- readRDS("./data/allbus_2021_eda.rds")

library(naniar)

allbus_2021_eda %>% 
  select(xeno1:xeno4) %>% 
  miss_var_summary(order = TRUE)
```

```
## # A tibble: 4 x 3
##   variable n_miss pct_miss
##   <chr>     <int>    <dbl>
## 1 xeno2      2078     38.9
## 2 xeno3      2073     38.8
## 3 xeno4      2070     38.7
## 4 xeno1      2065     38.7
```

---

## Summarizing missingness with `naniar`

```r
allbus_2021_eda %>% 
  pct_complete_case()
```

```
## [1] 19.88019
```

---

## Summarizing missingness with `naniar`

```r
allbus_2021_eda %>% 
  miss_case_table()
```

```
## # A tibble: 15 x 3
##    n_miss_in_case n_cases pct_cases
##             <int>   <int>     <dbl>
##  1              0    1062   19.9   
##  2              1    1411   26.4   
##  3              2     451    8.44  
##  4              3     106    1.98  
##  5              4     111    2.08  
##  6              5     125    2.34  
##  7              6      42    0.786 
##  8              7      21    0.393 
##  9              8       6    0.112 
## 10              9      10    0.187 
## 11             10    1384   25.9   
## 12             11     493    9.23  
## 13             12      95    1.78  
## 14             13      22    0.412 
## 15             14       3    0.0562
```

---

## Outliers

There are many ways of identifying and dealing with outliers. Here, we will only look at two methods (one univariate, one multivariate): [Interquartile range (IQR)](https://en.wikipedia.org/wiki/Interquartile_range) and [Mahalanobis distance](https://en.wikipedia.org/wiki/Mahalanobis_distance).

---

## Outliers

As outlier detection works better with numeric data on an [interval scale](https://en.wikipedia.org/wiki/Level_of_measurement#Interval_scale), we will use data from [*Gapminder*](https://www.gapminder.org/) on life expectancy GDP per capita in 2018 for identifying outliers.

```r
life_exp_2018 <- read_csv("./data/life_expectancy_years.csv") %>%
  pivot_longer(-country,
               names_to = "year",
               values_to = "life_exp") %>% 
  filter(year == "2018") %>% 
  select(-year)

gdp_pc_2018 <- read_csv("./data/gdppercapita_us_inflation_adjusted.csv") %>%
  pivot_longer(-country,
               names_to = "year",
               values_to = "gdp_percap") %>% 
  filter(year == "2018") %>% 
  select(-year)

gapminder_2018 <- life_exp_2018 %>% 
  left_join(gdp_pc_2018, by = "country")
```
]

---

## Detecting outliers: IQR

For identifying univariate outliers we can specify lower and upper cutoffs, e.g., using the formula 25th percentile - 1.5 x interquartile range (IQR) for the lower and 75th percentile + 1.5 x IQR for the upper limit. In `R` this can be done as follows:

```r
q2575 <- quantile(gdp_pc_2018$gdp_percap, 
                  probs=c(.25, .75), 
                  na.rm = TRUE)

iqr <- IQR(gdp_pc_2018$gdp_percap,
           na.rm = TRUE)

ul <-  q2575[2]+1.5*iqr  
ll <- q2575[1]-1.5*iqr

gdp_pc_2018_cut <- gdp_pc_2018 %>%
  filter(gdp_percap <= ul)

nrow(gdp_pc_2018) - nrow(gdp_pc_2018_cut)
```

```
## [1] 31
```
]

.small[
**NB**: The value of 1.5 x IQR is a rule of thumb. You should always check whether this makes sense for your data. In the above example with the *Gapminder* data, e.g., it most likely does not. An alternative approach is using mean + 2 x SD (or 3 x SD), but, again, this is nothing more than a rule of thumb.
]

---

## Detecting outliers: Mahalanobis distance

A common method for identifying multivariate outliers is Mahalanobis distance. While there is a `base R` function called `mahalanobis()` that you can use to calculate Mahalanobis distance, a more convenient option is the `mahalanobis_distance()` function from the [`rstatix` package](https://rpkgs.datanovia.com/rstatix/index.html).

```r
library(rstatix)

md <- gapminder_2018 %>% 
  drop_na() %>% 
  mahalanobis_distance() %>% 
  select(-c(life_exp, gdp_percap))

names(md)

gapminder_2018 %>% 
  drop_na() %>% 
  bind_cols(md) %>% 
  arrange(desc(mahal.dist))
```

---

```
## [1] "mahal.dist" "is.outlier"
```

```
## # A tibble: 176 x 5
##    country                  life_exp gdp_percap mahal.dist is.outlier
##    <chr>                       <dbl>      <dbl>      <dbl> <lgl>     
##  1 Luxembourg                   81.8     111000      31.4  TRUE      
##  2 Norway                       82.5      92100      18.7  TRUE      
##  3 Switzerland                  84.1      79200      11.8  FALSE     
##  4 Central African Republic     52.4        379      11.7  FALSE     
##  5 Ireland                      82.1      76700      11.4  FALSE     
##  6 Lesotho                      55.5       1410       8.32 FALSE     
##  7 Denmark                      80.9      63900       7.06 FALSE     
##  8 Qatar                        80.3      63300       7.03 FALSE     
##  9 Papua New Guinea             58.7       2420       5.41 FALSE     
## 10 Singapore                    85        58200       5.26 FALSE     
## # ... with 166 more rows
```