class: center, middle, inverse, title-slide .title[ # Introduction to R for Data Analysis ] .subtitle[ ## Exploring missingness & outliers ] .author[ ### Johannes Breuer, Stefan Jünger, & Veronika Batzdorfer ] .date[ ### 2022-08-17 ] --- layout: true --- ## Missings & outliers Two things that can influence summary statistics as well as univariate and bivariate distributions are missings and outliers. Hence, before we can analyze our data, we should check whether we have clear patterns of missingness or extreme outliers in our data. --- ## Missing data In the second session on data wrangling we have already discussed how we can define specific values as missings (`NA`) and how we can recode `NA` into something else. As we have seen in that session, the *ALLBUS* 2021 data set contains quite a few codes for different types of missing data. However, when we collect data ourselves or if we (re-)use data sets that are not as well-documented as the *ALLBUS*, we may need to explore potential patterns of missingness to see if there may be identifiable reasons why data for certain variables and/or observations are missing. --- ## Summarizing missingness The [`naniar` package](http://naniar.njtierney.com/index.html) that we have used for recoding values as `NA` in the second session wrangling also provides a [collection of functions for summarizing missingness](http://naniar.njtierney.com/reference/miss_var_summary.html) in variables or whole data sets, of which we will show three examples in the following. ```r allbus_2021_eda <- readRDS("./data/allbus_2021_eda.rds") library(naniar) allbus_2021_eda %>% select(xeno1:xeno4) %>% miss_var_summary(order = TRUE) ``` ``` ## # A tibble: 4 x 3 ## variable n_miss pct_miss ## <chr> <int> <dbl> ## 1 xeno2 2078 38.9 ## 2 xeno3 2073 38.8 ## 3 xeno4 2070 38.7 ## 4 xeno1 2065 38.7 ``` --- ## Summarizing missingness with `naniar` ```r allbus_2021_eda %>% pct_complete_case() ``` ``` ## [1] 19.88019 ``` --- ## Summarizing missingness with `naniar` ```r allbus_2021_eda %>% miss_case_table() ``` ``` ## # A tibble: 15 x 3 ## n_miss_in_case n_cases pct_cases ## <int> <int> <dbl> ## 1 0 1062 19.9 ## 2 1 1411 26.4 ## 3 2 451 8.44 ## 4 3 106 1.98 ## 5 4 111 2.08 ## 6 5 125 2.34 ## 7 6 42 0.786 ## 8 7 21 0.393 ## 9 8 6 0.112 ## 10 9 10 0.187 ## 11 10 1384 25.9 ## 12 11 493 9.23 ## 13 12 95 1.78 ## 14 13 22 0.412 ## 15 14 3 0.0562 ``` --- ## Outliers There are many ways of identifying and dealing with outliers. Here, we will only look at two methods (one univariate, one multivariate): [Interquartile range (IQR)](https://en.wikipedia.org/wiki/Interquartile_range) and [Mahalanobis distance](https://en.wikipedia.org/wiki/Mahalanobis_distance). --- ## Outliers As outlier detection works better with numeric data on an [interval scale](https://en.wikipedia.org/wiki/Level_of_measurement#Interval_scale), we will use data from [*Gapminder*](https://www.gapminder.org/) on life expectancy GDP per capita in 2018 for identifying outliers. .small[ ```r life_exp_2018 <- read_csv("./data/life_expectancy_years.csv") %>% pivot_longer(-country, names_to = "year", values_to = "life_exp") %>% filter(year == "2018") %>% select(-year) gdp_pc_2018 <- read_csv("./data/gdppercapita_us_inflation_adjusted.csv") %>% pivot_longer(-country, names_to = "year", values_to = "gdp_percap") %>% filter(year == "2018") %>% select(-year) gapminder_2018 <- life_exp_2018 %>% left_join(gdp_pc_2018, by = "country") ``` ] --- ## Detecting outliers: IQR For identifying univariate outliers we can specify lower and upper cutoffs, e.g., using the formula 25th percentile - 1.5 x interquartile range (IQR) for the lower and 75th percentile + 1.5 x IQR for the upper limit. In `R` this can be done as follows: .small[ ```r q2575 <- quantile(gdp_pc_2018$gdp_percap, probs=c(.25, .75), na.rm = TRUE) iqr <- IQR(gdp_pc_2018$gdp_percap, na.rm = TRUE) ul <- q2575[2]+1.5*iqr ll <- q2575[1]-1.5*iqr gdp_pc_2018_cut <- gdp_pc_2018 %>% filter(gdp_percap <= ul) nrow(gdp_pc_2018) - nrow(gdp_pc_2018_cut) ``` ``` ## [1] 31 ``` ] .small[ **NB**: The value of 1.5 x IQR is a rule of thumb. You should always check whether this makes sense for your data. In the above example with the *Gapminder* data, e.g., it most likely does not. An alternative approach is using mean + 2 x SD (or 3 x SD), but, again, this is nothing more than a rule of thumb. ] --- ## Detecting outliers: Mahalanobis distance A common method for identifying multivariate outliers is Mahalanobis distance. While there is a `base R` function called `mahalanobis()` that you can use to calculate Mahalanobis distance, a more convenient option is the `mahalanobis_distance()` function from the [`rstatix` package](https://rpkgs.datanovia.com/rstatix/index.html). ```r library(rstatix) md <- gapminder_2018 %>% drop_na() %>% mahalanobis_distance() %>% select(-c(life_exp, gdp_percap)) names(md) gapminder_2018 %>% drop_na() %>% bind_cols(md) %>% arrange(desc(mahal.dist)) ``` .right[↪️] --- class: middle ``` ## [1] "mahal.dist" "is.outlier" ``` ``` ## # A tibble: 176 x 5 ## country life_exp gdp_percap mahal.dist is.outlier ## <chr> <dbl> <dbl> <dbl> <lgl> ## 1 Luxembourg 81.8 111000 31.4 TRUE ## 2 Norway 82.5 92100 18.7 TRUE ## 3 Switzerland 84.1 79200 11.8 FALSE ## 4 Central African Republic 52.4 379 11.7 FALSE ## 5 Ireland 82.1 76700 11.4 FALSE ## 6 Lesotho 55.5 1410 8.32 FALSE ## 7 Denmark 80.9 63900 7.06 FALSE ## 8 Qatar 80.3 63300 7.03 FALSE ## 9 Papua New Guinea 58.7 2420 5.41 FALSE ## 10 Singapore 85 58200 5.26 FALSE ## # ... with 166 more rows ```