Introduction to R for Data Analysis

.title[
# Introduction to R for Data Analysis
]
.subtitle[
## Exploratory Data Analysis
]
.author[
### Johannes Breuer, Stefan Jünger, & Veronika Batzdorfer
]
.date[
### 2022-08-17
]

---

---

## What is Exploratory Data Analysis (EDA)?

After wrangling our data, the next thing we should do is exploring them. In practice, of course, these steps are often done iteratively.

There is no single definition of what exploratory data analysis is and which steps and methods belong to this category. For example, while descriptive (or summary) statistics are almost always part of EDA, analysis methods like correlation, chi-square tests, or principal component analysis (PCA) can be part of exploratory or confirmatory analyses (which we will cover in the next session).

---

## What is Exploratory Data Analysis (EDA)?

For this course, we decided not to separate the analysis sessions into descriptive vs. inferential statistics (in parts also because, as we pointed out in the Introduction, this is not a statistics workshop). Instead, the distinction we make between EDA and confirmatory analysis is more akin to that between exploratory and confirmatory research: While the former can be employed to develop hypotheses or research questions, the goal of the latter typically is to test/answer them.<sup>1</sup>

.small[
.footnote[
[1] Of course, this does not mean that EDA techniques cannot be used to answer (exploratory) research questions.
]
]

---

## Exploring EDA

As stated before, exploratory data analysis can take many shapes and forms. In this session, we will focus on the following:

- summary statistics (for numeric variables)
- frequencies & proportions (for categorical variables)
- cross-tabulations & correlations

*Note*: In this session, we will focus on numeric summaries and descriptions of our data. Notably, data visualization typically also is a big part of EDA. We will cover that in the following session on data visualization with `R`.

---

## Disclaimer: A flood 🌊 of packages 📦

In the previous sessions, we have tried to focus on a small set of packages (mostly from `base R` and the `tidyverse`) and took a deep dive 🥽 exploring their functionalities. By comparison, in this session, we will browse through more different packages.

As we have said before, there typically is more than one option (and package) for doing things in `R`. Given the high number of packages that have been developed for EDA, this topic is a very suitable example for demonstrating this.

---

## Choosing packages 📦 for EDA

Many of the EDA options that we are going to discuss in this session are quite similar across the different packages. However, there are differences, e.g., regarding the type of output they produce, their ease-of-use, and their flexibility. In practice, you will probably not need more than 1 to 3 packages for EDA most of the time. It is, however, worth exploring the different options to find packages that (best) suit your needs. Of course, what the best options are strongly depends on your goals and the output you want to get. For example: Do you only want to quickly check the data for yourself, create a report or documentation for internal use, or do you want to produce publication-ready output (tables, plots, etc.)?

---

## Data

As using the full data set can become somewhat unwieldy for the examples in this section, we will create/use a subset of the *ALLBUS* 2021 data. More specifically, we will select a subset of variables on the following:
- basic demographics (age categories, sex)
- political leaning (left-right)
- party vote
- satisfaction with democracy
- xenophobia
- contact with immigrants

---

## Repetition: Data wrangling pipeline

As a repetition and reminder, we will quickly go some of the wrangling steps we discussed before for variables listed on the previous slide in the following.

Of course, it is possible to do the whole wrangling in one pipe. However, to check if everything worked it is advisable to break up the pipe into smaller chunks (a nice tool for checking and debugging pipes that also provides an *RStudio* addin is the package [`ViewPipeSteps`](https://github.com/daranzolin/ViewPipeSteps)). Also, splitting up the wrangling pipe steps allows us to show them on the slides.

.small[
*Note*: To reduce the number of objects in our working environment, we will overwrite the `allbus_2021_eda` object multiple times in the import and wrangling process. In your actual work, however, you may want to keep some of the intermediate objects (e.g., the version of the dataframe before it is wrangled). Remember that this is easily done by assigning different object names.
]

---

## Wrangling pipeline: Import data, select, & rename variables

```r
library(sjlabelled)
library(tidyverse)
library(haven)

allbus_2021_eda <- read_sav("./data/allbus_2021/ZA5280_v1-0-0.sav") %>% # user-defined missings are converted to NA (per default)
remove_all_labels()
```

```r
allbus_2021_eda <- allbus_2021_eda %>% 
  select(agec,
         sex,
         left_right = pa01,
         party_vote = pv01,
         sat_dem = ps03,
         xeno1 = ma01b,
         xeno2 = ma02,
         xeno3 = ma03,
         xeno4 = ma04,
         contact1 = mc01,
         contact2 = mc02,
         contact3 = mc03,
         contact4 = mc04)
```
]

---

## Wrangling pipeline: Change variable types & recode values

```r
allbus_2021_eda <- allbus_2021_eda %>% 
    mutate(sat_dem = recode(sat_dem,
                            `1` = 6,
                            `2` = 5,
                            `3` = 4,
                            `4` = 3,
                            `5` = 3,
                            `6`= 1),
           sex = recode_factor(sex,
                               `1` = "Male",
                               `2` = "Female",
                               `3` = "Non-binary"),
           agec = recode_factor(agec,
                                `1`= "<= 25 years",
                                `2`= "26 to 30 years",
                                `3` = "31 to 35 years",
                                `4` = "36 to 40 years",
                                `5` = "41 to 45 years",
                                `6` = "46 to 50 years",
                                .ordered = TRUE),
           party_vote = recode_factor(party_vote,
                                      `1`= "CDU-CSU",
                                      `2`= "SPD",
                                      `3` = "FDP",
                                      `4` = "Gruene",
                                      `6` = "Linke",
                                      `42` = "AfD",
                                      `90` = "Other party",
                                      `91` = "Would not vote"))
```
]

---

## Wrangling pipeline: Compute new aggregate variables

```r
allbus_2021_eda <- allbus_2021_eda %>%
  mutate(
    xenophobia = rowMeans(across(
    xeno1:xeno4))) %>% 
  mutate(across(contact1:contact4,
                ~recode(
                  .x,
                  `2` = 0))) %>% 
  mutate(contact = rowSums(across(
    contact1:contact4)))
```

---

## Saving the resulting data set

As we will use this reduced and wrangled data set in the exercises for this session (and the appendix slides on exploring missingness and outliers), we should save it (as an `.rds`
file).

```r
saveRDS(allbus_2021_eda, "./data/allbus_2021_eda.rds")
```

We can then (later on) load the data set again with

```r
allbus_2021_eda <- readRDS("./data/allbus_2021_eda.rds")
```

---

## Explore your data: First look 👀

To get a first impression of the data set, we can use some of the functions we discussed in the sessions on *Data Import & Export* and *Data Wrangling Basics*, such as `dim()`, `head()`, or `str()` from `base R`, `glimpse()` from `dplyr`, or `View()` in *RStudio*.

While looking at the the full data set can give us a general understanding of the data and their format and also show if (and how) we may need to wrangle them (further), it is difficult to make sense of the data just by looking at it.

---

## Summary statistics

To make sense of quantitative data we can reduce their information to unique values.

**That's a simple definition of summary statistics**

As such, we can use summarizing functions of
- location (e.g., the mean),
- spread (e.g., standard deviation),
- the shape of the distribution (e.g., skewness), and
- relations between variables (e.g., correlation coefficients)

---

## Summary statistics: `summary()`

A quick and easy way to check some summary statistics for your data set is the `base R` function `summary()` which can be applied to individual variables...

```r
summary(allbus_2021_eda$left_right)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   5.000   4.865   6.000  10.000     210
```

as well as whole data frames:

```r
summary(allbus_2021_eda)
```

---

```
##              agec              sex         left_right       party_vote      sat_dem     
##  <= 25 years   : 618   Male      :2614   Min.   : 1.000   CDU-CSU:1043   Min.   :1.000  
##  26 to 30 years:1136   Female    :2705   1st Qu.: 4.000   Gruene : 944   1st Qu.:4.000  
##  31 to 35 years:1448   Non-binary:   3   Median : 5.000   SPD    : 595   Median :5.000  
##  36 to 40 years:1450   NA's      :  20   Mean   : 4.865   FDP    : 475   Mean   :4.392  
##  41 to 45 years: 617                     3rd Qu.: 6.000   Linke  : 305   3rd Qu.:5.000  
##  46 to 50 years:  32                     Max.   :10.000   (Other): 664   Max.   :6.000  
##  NA's          :  41                     NA's   :210      NA's   :1316   NA's   :1819   
##      xeno1           xeno2           xeno3           xeno4         contact1     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :0.0000  
##  1st Qu.:4.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:0.0000  
##  Median :5.000   Median :2.000   Median :2.000   Median :1.00   Median :0.0000  
##  Mean   :5.114   Mean   :2.579   Mean   :2.845   Mean   :1.86   Mean   :0.2845  
##  3rd Qu.:7.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.00   3rd Qu.:1.0000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.00   Max.   :1.0000  
##  NA's   :2065    NA's   :2078    NA's   :2073    NA's   :2070   NA's   :2210    
##     contact2        contact3         contact4        xenophobia       contact     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :1.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :1.000   Median :1.0000   Median :1.0000   Median :2.750   Median :2.000  
##  Mean   :0.602   Mean   :0.5261   Mean   :0.6078   Mean   :3.097   Mean   :1.953  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :7.000   Max.   :4.000  
##  NA's   :2264    NA's   :2183     NA's   :2129     NA's   :2105    NA's   :2385
```
]

---

## Summary statistics with the `datawizard` package 🧙

The [`datawizard` package](https://easystats.github.io/datawizard/) which we introduced as an interesting alternative or addition for data wrangling before also provides a function for summary statistics. This function also offers several options for customizing the output.

```r
library(datawizard)

allbus_2021_eda %>%
  select(where(is.numeric)) %>%
  describe_distribution(quartiles = TRUE)
```

---

```
## Variable   | Mean |   SD | IQR | Min | Max | Q1 | Q3 | Skewness | Kurtosis |    n | n_Missing
## ---------------------------------------------------------------------------------------------
## left_right | 4.86 | 1.79 |   2 |   1 |  10 |  4 |  6 |    0.151 |    0.099 | 5132 |       210
## sat_dem    | 4.39 | 1.05 |   1 |   1 |   6 |  4 |  5 |   -0.906 |    0.947 | 3523 |      1819
## xeno1      | 5.11 | 1.77 |   3 |   1 |   7 |  4 |  7 |   -0.617 |   -0.634 | 3277 |      2065
## xeno2      | 2.58 | 1.83 |   3 |   1 |   7 |  1 |  4 |    1.041 |    0.060 | 3264 |      2078
## xeno3      | 2.85 | 2.03 |   3 |   1 |   7 |  1 |  4 |    0.830 |   -0.561 | 3269 |      2073
## xeno4      | 1.86 | 1.66 |   1 |   1 |   7 |  1 |  2 |    1.999 |    2.952 | 3272 |      2070
## contact1   | 0.28 | 0.45 |   1 |   0 |   1 |  0 |  1 |    0.956 |   -1.087 | 3132 |      2210
## contact2   | 0.60 | 0.49 |   1 |   0 |   1 |  0 |  1 |   -0.417 |   -1.827 | 3078 |      2264
## contact3   | 0.53 | 0.50 |   1 |   0 |   1 |  0 |  1 |   -0.105 |   -1.990 | 3159 |      2183
## contact4   | 0.61 | 0.49 |   1 |   0 |   1 |  0 |  1 |   -0.442 |   -1.806 | 3213 |      2129
## xenophobia | 3.10 | 1.42 |   2 |   1 |   7 |  2 |  4 |    0.800 |    0.065 | 3237 |      2105
## contact    | 1.95 | 1.32 |   2 |   0 |   4 |  1 |  3 |   -0.036 |   -1.133 | 2957 |      2385
```
]

---

## Summary statistics with `dplyr`

`dplyr` provides a helpful function for creating summary statistics: `summarize()`

`summarize()` is a [vectorized](https://win-vector.com/2019/01/03/what-does-it-mean-to-write-vectorized-code-in-r/) function that can be used to create summary statistics for variables using functions like...

- `mean()`
- `sd()`
- `min()`
- `max()`

- etc.

---

## Summary statistics with `dplyr`

While creating summary statistics using `summarize()` from `dplyr()` requires writing more code, it is the most flexible option. Another nice benefit of `summarize()` is that it produces a `tibble` which can be used for further analyses or for creating plots or tables.

---

## `dplyr::summarize()`

```r
allbus_2021_eda %>% 
  summarize(
    mean_sat_dem = mean(sat_dem, na.rm = TRUE),
    sd_sat_dem = sd(sat_dem, na.rm = TRUE),
    var_sat_dem = var(sat_dem, na.rm = TRUE),
    min_sat_dem = min(sat_dem, na.rm = TRUE),
    max_sat_dem = max(sat_dem, na.rm = TRUE)
  ) %>% 
  round(2) # round to 2 decimal places
```

```
##   mean_sat_dem sd_sat_dem var_sat_dem min_sat_dem max_sat_dem
## 1         4.39       1.05        1.09           1           6
```
]

---

## `dplyr::group_by()`

The `dplyr` function `group_by()` creates data frames (tibbles) that are grouped by one or more variables. This can, e.g., be used to produce grouped summary statistics.

```r
allbus_2021_eda %>% 
  filter(!is.na(party_vote)) %>% # only use cases where party_vote is not missing
  group_by(party_vote) %>%
   summarize(
    mean_sat_dem = mean(sat_dem, na.rm = TRUE),
    sd_sat_dem = sd(sat_dem, na.rm = TRUE),
    var_sat_dem = var(sat_dem, na.rm = TRUE),
    min_sat_dem = min(sat_dem, na.rm = TRUE),
    max_sat_dem = max(sat_dem, na.rm = TRUE)
  ) %>% 
  ungroup()
```

---

```
## # A tibble: 8 x 6
##   party_vote     mean_sat_dem sd_sat_dem var_sat_dem min_sat_dem max_sat_dem
##   <fct>                 <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
## 1 CDU-CSU                4.82      0.840       0.706           1           6
## 2 SPD                    4.65      0.849       0.720           1           6
## 3 FDP                    4.33      1.01        1.02            1           6
## 4 Gruene                 4.74      0.797       0.635           3           6
## 5 Linke                  4.09      0.930       0.865           1           6
## 6 AfD                    3.04      1.18        1.40            1           6
## 7 Other party            3.77      1.22        1.49            1           6
## 8 Would not vote         3.66      1.11        1.23            1           6
```

---

## Grouping and ungrouping

What is important to know and keep in mind is that using `group_by()` changes the class of a tibble/dataframe. To avoid (accidentally) continuing to perform operations groupwise, it is necessary to use the `ungroup()` function (typically at the end of a pipe). In that way, this function is similar to the *Split by* option in *SPSS*.

```r
class(allbus_2021_eda)
```

```
## [1] "data.frame"
```

```r
allbus_2021_eda %>% 
  group_by(party_vote) %>% 
  class()
```

```
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
```

---

## `dplyr::across()`

To produce grouped summary statistics for multiple variables we can use the `dplyr` function `across()` which we already saw in action in the second data wrangling session. *Note*: We only use cases without missing data for any of the variables here (= listwise deletion).

```r
allbus_2021_eda %>%
  select(party_vote,
         starts_with("xeno")) %>% 
  drop_na() %>% 
  group_by(party_vote) %>%
  summarize(across(starts_with("xeno"), 
                   list(mean = mean, 
                        sd = sd), 
                   .names = "{col}_{fn}"))
```

---

```
## # A tibble: 8 x 11
##   party_vote    xeno1_mean xeno1_sd xeno2_mean xeno2_sd xeno3_mean xeno3_sd xeno4_mean xeno4_sd
##   <fct>              <dbl>    <dbl>      <dbl>    <dbl>      <dbl>    <dbl>      <dbl>    <dbl>
## 1 CDU-CSU             5.46     1.65       2.70     1.80       2.95     1.99       2       1.72 
## 2 SPD                 5.07     1.70       2.44     1.76       2.61     1.87       1.68    1.44 
## 3 FDP                 5.42     1.64       2.67     1.70       2.92     1.91       1.78    1.50 
## 4 Gruene              4.02     1.70       1.61     1.05       1.79     1.37       1.19    0.755
## 5 Linke               4.67     1.92       2.09     1.54       2.15     1.58       1.52    1.29 
## 6 AfD                 6.43     1.24       4.51     2.06       5.1      2.12       3.4     2.25 
## 7 Other party         5.22     1.62       2.71     1.96       3        2.21       1.84    1.53 
## 8 Would not vo~       5.43     1.73       3.21     2.09       3.35     2.13       2.45    2.12 
## # ... with 2 more variables: xenophobia_mean <dbl>, xenophobia_sd <dbl>
```

---

# [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_3_1_1_Summary_Statistics.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_3_1_1_Summary_Statistics.html)

---

## Frequencies: `table()`

A simple way of looking at frequencies (e.g., for categorical variables) that we have already seen before is the `base R` function `table()`.

```r
table(allbus_2021_eda$party_vote)
```

```
## 
##        CDU-CSU            SPD            FDP         Gruene          Linke            AfD 
##           1043            595            475            944            305            275 
##    Other party Would not vote 
##            163            226
```

If you also want to include `NA` in the frequency counts, you need to specify the argument `useNA = "always"`.

```r
table(allbus_2021_eda$party_vote, useNA = "always")
```

```
## 
##        CDU-CSU            SPD            FDP         Gruene          Linke            AfD 
##           1043            595            475            944            305            275 
##    Other party Would not vote           <NA> 
##            163            226           1316
```

---

## Proportions with `prop.table()`

If you want proportions instead of raw counts, you can use the `base R` function `prop.table()`.

**NB**: You need to apply this function to an output produced by `table()`.

```r
prop.table(table(allbus_2021_eda$party_vote))
```

```
## 
##        CDU-CSU            SPD            FDP         Gruene          Linke            AfD 
##     0.25906607     0.14778937     0.11798311     0.23447591     0.07575758     0.06830601 
##    Other party Would not vote 
##     0.04048684     0.05613512
```

```r
prop.table(table(allbus_2021_eda$party_vote, useNA = "always"))
```

```
## 
##        CDU-CSU            SPD            FDP         Gruene          Linke            AfD 
##     0.19524523     0.11138151     0.08891801     0.17671284     0.05709472     0.05147885 
##    Other party Would not vote           <NA> 
##     0.03051292     0.04230625     0.24634968
```
]

---

## Proportions with `prop.table()`

If you want fewer decimals places in the output, you can wrap the the `prop.table()` function in a `round()` call.

```r
round(prop.table(table(allbus_2021_eda$party_vote, useNA = "always")), 3) # rounded to 3 decimal places
```

```
## 
##        CDU-CSU            SPD            FDP         Gruene          Linke            AfD 
##          0.195          0.111          0.089          0.177          0.057          0.051 
##    Other party Would not vote           <NA> 
##          0.031          0.042          0.246
```

Or if you want percentages...

```r
round((prop.table(table(allbus_2021_eda$party_vote, useNA = "always")) * 100), 2)
```

```
## 
##        CDU-CSU            SPD            FDP         Gruene          Linke            AfD 
##          19.52          11.14           8.89          17.67           5.71           5.15 
##    Other party Would not vote           <NA> 
##           3.05           4.23          24.63
```

---

## Frequencies and proportions with `janitor::tabyl()`

The [`janitor` package](https://github.com/sfirke/janitor) that we briefly mentioned in the first session on data wrangling also provides the [`tabyl()` function](https://sfirke.github.io/janitor/articles/tabyls.html) for creating frequency and proportion tables:

```r
library(janitor)

party_stats <- allbus_2021_eda %>% 
  tabyl(party_vote) %>% 
  adorn_pct_formatting(digits = 2, 
                       affix_sign = TRUE)

party_stats
```

```
##      party_vote    n percent valid_percent
##         CDU-CSU 1043  19.52%        25.91%
##             SPD  595  11.14%        14.78%
##             FDP  475   8.89%        11.80%
##          Gruene  944  17.67%        23.45%
##           Linke  305   5.71%         7.58%
##             AfD  275   5.15%         6.83%
##     Other party  163   3.05%         4.05%
##  Would not vote  226   4.23%         5.61%
##            <NA> 1316  24.63%             -
```

---

## Frequencies and proportions with `janitor::tabyl()`

A nice thing about `tabyl()` is that is produces a (special type of) data frame which we can, e.g., use for plotting or creating tables (which we will discuss later on).

```r
class(party_stats)
```

```
## [1] "tabyl"      "data.frame"
```

---

## Frequencies and proportions with `dplyr`

We can also use `group_by()` and `summarize()` to get frequencies and proportions for variables in our data set.

```r
allbus_2021_eda %>% 
  filter(!is.na(party_vote)) %>% 
  group_by(party_vote) %>% 
  summarize(n = n()) %>% 
  mutate(proportion = n/sum(n)) %>% 
  ungroup()
```

```
## # A tibble: 8 x 3
##   party_vote         n proportion
##   <fct>          <int>      <dbl>
## 1 CDU-CSU         1043     0.259 
## 2 SPD              595     0.148 
## 3 FDP              475     0.118 
## 4 Gruene           944     0.234 
## 5 Linke            305     0.0758
## 6 AfD              275     0.0683
## 7 Other party      163     0.0405
## 8 Would not vote   226     0.0561
```

---

## Frequencies and proportions with `dplyr`

Instead of using `group_by` and `summarize()` to get frequency counts, we can also use `count()` from `dplyr` as a shorthand.

```r
allbus_2021_eda %>% 
  filter(!is.na(party_vote)) %>% 
  count(party_vote) %>% 
  mutate(proportion = n/sum(n)) %>% 
  ungroup()
```

```
##       party_vote    n proportion
## 1        CDU-CSU 1043 0.25906607
## 2            SPD  595 0.14778937
## 3            FDP  475 0.11798311
## 4         Gruene  944 0.23447591
## 5          Linke  305 0.07575758
## 6            AfD  275 0.06830601
## 7    Other party  163 0.04048684
## 8 Would not vote  226 0.05613512
```

---

## Tables in `R`

There are typically two types of outputs you can produce with `R` for further use in reports or publications: tables and plots.

We will cover the creation of plots in the following session(s) on data visualization. However, as summary statistics, frequencies, and proportions are often presented in tables, we will briefly discuss how to create tables as output in `R` in the following.

As with almost everything in `R`, there are many options (read: packages) for creating tables. We will show examples from three popular options:
- [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html)
- [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/) + [`flextable`](https://davidgohel.github.io/flextable/index.html)

---

## Summary tables with `stargazer`

While there is an ever-growing list of `R` packages for creating tables with many new (promising) contenders (such as [`flextable`](https://davidgohel.github.io/flextable/index.html) or [`gt`](https://gt.rstudio.com/index.html)), `stargazer` is an established and widely-used tool for creating ACSII (plain text), `LaTeX`, and `HTML` table output.

---

## Summary tables with `stargazer`

If we, e.g., want to create a summary statistics table as text output (e.g., for printing in the `R` console), we can use `stargazer` for that.

**NB**: As the main `stargazer()` function does not work with tibbles, we need to convert our data to a regular data frame.

```r
library(stargazer)

allbus_2021_eda %>% 
  select(sat_dem,
         xenophobia,
         contact) %>% 
  as.data.frame() %>% 
  stargazer(type = "text",
            digits = 2,
            title="Descriptive statistics")
```

---

```
## 
## Descriptive statistics
## ==========================================================
## Statistic    N   Mean St. Dev. Min  Pctl(25) Pctl(75) Max 
## ----------------------------------------------------------
## sat_dem    3,523 4.39   1.05   1.00   4.00     5.00   6.00
## xenophobia 3,237 3.10   1.42   1.00   2.00     4.00   7.00
## contact    2,957 1.95   1.32   0.00   1.00     3.00   4.00
## ----------------------------------------------------------
```

---

## Summary tables with `stargazer`

`stargazer` also supports `HTML` and `LaTeX` as output formats. We can export the output for further use (e.g., with our `LaTeX` editor of choice).

```r
# We create a directory for output files first
dir.create("./output")

allbus_2021_eda %>% 
 select(sat_dem,
         xenophobia,
         contact) %>% 
  as.data.frame() %>% 
  stargazer(type = "latex",
            digits = 2,
            out = "./output/stargazer_summary.tex",
            title="Descriptive statistics")
```

.small[
*Note*: If you look at the help file for the `stargazer()` function (via `?stargazer`), you will see that it provides a lot of customization options.
]

---

## Summary tables with `gtsummary`

The `tbl_summary()` function from the [`gtsummary` package](http://www.danieldsjoberg.com/gtsummary/) can, e.g., be used to produce frequency table output. If you run the following code in *RStudio*, the table will be displayed in the `Viewer` pane and can then be exported from there as an image or an `HTML` file.

```r
library(gtsummary)

allbus_2021_eda %>% 
  select(sex, agec) %>% 
  tbl_summary()
```

*Note*: As with `stargazer`, `tbl_summary()` also provides various customization options and you can save its output in different formats, including `.html`, `.rtf`, and `.tex`.

---

## Summary tables with `gtsummary`

You can also save the output of `tbl_summary()` directly in a *Word* file. For that to work, however, you also need to use a function from the [`flextable` package](https://davidgohel.github.io/flextable/index.html) (and an installation of *Word*).

```r
library(flextable)

allbus_2021_eda %>% 
  select(sex, agec) %>% 
  tbl_summary(label = list(sex ~ "Sex",
                           agec ~ "Age category")) %>%
  as_flex_table() %>% 
  save_as_docx("Sample descriptives" = .,
               path = "./output/gt_summary.docx")
```

.small[
*Note*: If you have not done this before, for this code to work, you need to create an `output` folder in your current working directory using `dir.create("./output")`.
]

---

# [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_3_1_2_Frequencies_Proportions.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_3_1_2_Frequencies_Proportions.html)

---

## Relationships between variables

In addition to checking summary statistics for individual variables, another thing that you quite possibly also want to look at as part of your EDA are the relationships between (specific) variables in your data set. There are many ways to do so and the appropriate choice of methods, of course, depends on the types of variables you want to explore. In the following, we will briefly discuss some options for two methods of exploring relationships between variables:

- crosstabulation (for categorical variables)

- correlations (for numeric and/or binary variables)

---

## Crosstabs

You can also use the `base R` functions `table()` and `prop.table()` we have used before for creating univariate frequency and proportion tables to generate crosstabs.

```r
table(allbus_2021_eda$sex, allbus_2021_eda$party_vote) # rows, columns
```

```
##             
##              CDU-CSU SPD FDP Gruene Linke AfD Other party Would not vote
##   Male           528 311 271    426   159 188          84            108
##   Female         511 282 204    516   143  85          77            118
##   Non-binary       0   0   0      2     1   0           0              0
```

```r
round(prop.table(table(allbus_2021_eda$sex, allbus_2021_eda$party_vote))*100, 2)
```

```
##             
##              CDU-CSU   SPD   FDP Gruene Linke   AfD Other party Would not vote
##   Male         13.15  7.75  6.75  10.61  3.96  4.68        2.09           2.69
##   Female       12.73  7.03  5.08  12.86  3.56  2.12        1.92           2.94
##   Non-binary    0.00  0.00  0.00   0.05  0.02  0.00        0.00           0.00
```

---

## Crosstabs

We can also calculate row or column percentages using these `base R` functions.

```r
round(prop.table(table(allbus_2021_eda$sex, allbus_2021_eda$party_vote), 1)*100, 2) # row percentages
```

```
##             
##              CDU-CSU   SPD   FDP Gruene Linke   AfD Other party Would not vote
##   Male         25.45 14.99 13.06  20.53  7.66  9.06        4.05           5.20
##   Female       26.39 14.57 10.54  26.65  7.39  4.39        3.98           6.10
##   Non-binary    0.00  0.00  0.00  66.67 33.33  0.00        0.00           0.00
```

```r
round(prop.table(table(allbus_2021_eda$sex, allbus_2021_eda$party_vote), 2)*100, 2) # column percentages
```

```
##             
##              CDU-CSU   SPD   FDP Gruene Linke   AfD Other party Would not vote
##   Male         50.82 52.45 57.05  45.13 52.48 68.86       52.17          47.79
##   Female       49.18 47.55 42.95  54.66 47.19 31.14       47.83          52.21
##   Non-binary    0.00  0.00  0.00   0.21  0.33  0.00        0.00           0.00
```

.small[
*Note*: If you want to generate tables based on more than two variables, the `base R` function `ftable()` is a good option for prettier printing of results.
]

---

## Crosstabs with `dplyr`

We can use also functions from `dplyr` to create crosstabs including frequencies.

```r
allbus_2021_eda %>% 
  filter(!is.na(party_vote),
         !is.na(sex)) %>% 
  count(sex, party_vote) %>% 
  pivot_wider(names_from = party_vote,
              values_from = n)
```

```
## # A tibble: 3 x 9
##   sex        `CDU-CSU`   SPD   FDP Gruene Linke   AfD `Other party` `Would not vote`
##   <fct>          <int> <int> <int>  <int> <int> <int>         <int>            <int>
## 1 Male             528   311   271    426   159   188            84              108
## 2 Female           511   282   204    516   143    85            77              118
## 3 Non-binary        NA    NA    NA      2     1    NA            NA               NA
```

.small[
*Note*: We have only briefly mentioned the function in the first session on data wrangling (when introducing the concept of tidy data), but - as you might guess from its name - the function `pivot_wider()`, which is part of the [`tidyr` package](https://tidyr.tidyverse.org/), changes the format of a data set from long to wide.
]

---

## Crosstabs with `dplyr`

We can also use functions from `dplyr` in a similar fashion as we have done for a single variable to create crosstabs including percentages.

```r
allbus_2021_eda %>% 
  filter(!is.na(party_vote),
         !is.na(sex)) %>% 
  count(sex, party_vote) %>% 
  mutate(proportion = n/sum(n)*100) %>%
  select(-n) %>% 
  pivot_wider(names_from = party_vote,
              values_from = proportion)
```

```
## # A tibble: 3 x 9
##   sex        `CDU-CSU`   SPD   FDP  Gruene  Linke   AfD `Other party` `Would not vote`
##   <fct>          <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl>         <dbl>            <dbl>
## 1 Male            13.2  7.75  6.75 10.6    3.96    4.68          2.09             2.69
## 2 Female          12.7  7.03  5.08 12.9    3.56    2.12          1.92             2.94
## 3 Non-binary      NA   NA    NA     0.0498 0.0249 NA            NA               NA
```

---

## Crosstabs with the `janitor` package

The `tabyl()` function from the `janitor` package provides quite a few options for crosstabs. We will only show one extended example here, but you can learn more in the [`tabyl` vignette](https://cran.r-project.org/web/packages/janitor/vignettes/tabyls.html).

```r
library(janitor)

allbus_2021_eda %>% 
  filter(!is.na(party_vote),
         !is.na(sex)) %>% 
  tabyl(sex, party_vote) %>% 
  adorn_totals(where = c("row","col")) %>% 
  adorn_percentages(denominator = "row") %>% 
  adorn_pct_formatting(digits = 2) %>% 
  adorn_ns(position = "front")
```

```
##         sex       CDU-CSU          SPD          FDP       Gruene        Linke         AfD
##        Male  528 (25.45%) 311 (14.99%) 271 (13.06%) 426 (20.53%) 159  (7.66%) 188 (9.06%)
##      Female  511 (26.39%) 282 (14.57%) 204 (10.54%) 516 (26.65%) 143  (7.39%)  85 (4.39%)
##  Non-binary    0  (0.00%)   0  (0.00%)   0  (0.00%)   2 (66.67%)   1 (33.33%)   0 (0.00%)
##       Total 1039 (25.88%) 593 (14.77%) 475 (11.83%) 944 (23.52%) 303  (7.55%) 273 (6.80%)
##  Other party Would not vote          Total
##   84 (4.05%)    108 (5.20%) 2075 (100.00%)
##   77 (3.98%)    118 (6.10%) 1936 (100.00%)
##    0 (0.00%)      0 (0.00%)    3 (100.00%)
##  161 (4.01%)    226 (5.63%) 4014 (100.00%)
```
]

---

## Other options for crosstabulation in `R`

As with most things in `R`, there are many options for creating crosstabs. Some alternatives to the ones we presented before include the `CrossTable()` and `crosstab()` functions from the [`descr` package](https://cran.r-project.org/web/packages/descr/index.html) or the `ctable()` function from the [`summarytools` package](https://github.com/dcomtois/summarytools).

Another quite very versatile option for creating crosstabs as well as output based on those (e.g., in combination with the `flextable` package we have briefly used before) is the [`crosstable` package](https://danchaltiel.github.io/crosstable/)

---

## Chi-Square Test

We can use the `summary()` function from `base R` in combination with `table()` to perform a chi-square test.

```r
summary(table(allbus_2021_eda$sex, allbus_2021_eda$party_vote))
```

```
## Number of cases in table: 4014 
## Number of factors: 2 
## Test for independence of all factors:
## 	Chisq = 62.55, df = 14, p-value = 4.172e-08
## 	Chi-squared approximation may be incorrect
```

---

## Chi-Square Test

Most of the other packages that include functions for crosstabs that we mentioned before can also be used for chi-square tests: For example, the `janitor` package.

```r
allbus_2021_eda %>% 
  filter(!is.na(party_vote),
         !is.na(sex)) %>% 
  tabyl(sex, party_vote) %>%
  chisq.test()
```

```
## 
## 	Pearson's Chi-squared test
## 
## data:  .
## X-squared = 62.548, df = 14, p-value = 4.172e-08
```

---

## Correlations

Again, as with the crosstabs examples, there are many different options for calculating and displaying correlations in `R`. In addition to the `base R` functions, we will look at two packages in this part: [`corrr`](https://corrr.tidymodels.org/) and [`correlation`](https://github.com/easystats/correlation).

*Note*: While we will not cover that in this session, `corrr` and `correlation` also offer some nice options for plotting correlations.

---

## Correlations with `base R`

The `base R` function `cor()` computes the correlation coefficient(s) between two or more variables. This function can be used to calculate *Pearson's r*, *Kendall's tau*, and *Spearman's rho*. We also need to specify how we want to deal with missing values (e.g., use pairwise complete observations). For example, let's look at the correlations between the trust variables in our data set:

```r
xenophobia <- allbus_2021_eda %>% 
  select(xeno1:xeno4)

cor(xenophobia,
    use = "pairwise.complete.obs",
    method = "pearson")
```

```
##           xeno1     xeno2     xeno3     xeno4
## xeno1 1.0000000 0.4413642 0.4522287 0.2946897
## xeno2 0.4413642 1.0000000 0.6218878 0.5345694
## xeno3 0.4522287 0.6218878 1.0000000 0.5173153
## xeno4 0.2946897 0.5345694 0.5173153 1.0000000
```
]

---
## Correlations with `base R`

With `corr.test()` you can display the results of a significance test for a correlation.

```r
cor.test(allbus_2021_eda$xeno1, allbus_2021_eda$xeno2,
         method = "pearson")
```

```
## 
## 	Pearson's product-moment correlation
## 
## data:  allbus_2021_eda$xeno1 and allbus_2021_eda$xeno2
## t = 28.058, df = 3254, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4132793 0.4686102
## sample estimates:
##       cor 
## 0.4413642
```

---

## The `corrr` package

The [`corrr` package](https://corrr.tidymodels.org/) is part of the [`tidymodels` suite of packages](https://www.tidymodels.org/) and provides various functions for displaying correlations. The main function is `correlate()` which produces a `tibble` as output.

```r
library(corrr)

correlate(xenophobia)
```

```
## # A tibble: 4 x 5
##   term   xeno1  xeno2  xeno3  xeno4
##   <chr>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 xeno1 NA      0.441  0.452  0.295
## 2 xeno2  0.441 NA      0.622  0.535
## 3 xeno3  0.452  0.622 NA      0.517
## 4 xeno4  0.295  0.535  0.517 NA
```
]

---

## The `corrr` package

The `corrr` package provides several functions for tweaking/optimizing the output of the `correlate()` function. Here's one example:

```r
xenophobia %>% 
  correlate() %>% 
  shave() %>% 
  fashion()
```

```
##    term xeno1 xeno2 xeno3 xeno4
## 1 xeno1                        
## 2 xeno2   .44                  
## 3 xeno3   .45   .62            
## 4 xeno4   .29   .53   .52
```
]

---

## The `correlation` package

The [`correlation` package](https://easystats.github.io/correlation/) is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) (also called the `Easyverse`). It provides a much wider range of correlation types than the `base R` function `cor()` and the `correlate()` function from the `corrr` package (e.g., biserial and tetrachoric correlations for factors). Its core is the `correlation()` function.

```r
library(correlation)

correlation(xenophobia)
```

---

```
## # Correlation Matrix (pearson-method)
## 
## Parameter1 | Parameter2 |    r |       95% CI |     t |   df |         p
## ------------------------------------------------------------------------
## xeno1      |      xeno2 | 0.44 | [0.41, 0.47] | 28.06 | 3254 | < .001***
## xeno1      |      xeno3 | 0.45 | [0.42, 0.48] | 28.94 | 3258 | < .001***
## xeno1      |      xeno4 | 0.29 | [0.26, 0.33] | 17.61 | 3260 | < .001***
## xeno2      |      xeno3 | 0.62 | [0.60, 0.64] | 45.29 | 3252 | < .001***
## xeno2      |      xeno4 | 0.53 | [0.51, 0.56] | 36.07 | 3251 | < .001***
## xeno3      |      xeno4 | 0.52 | [0.49, 0.54] | 34.49 | 3256 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 3253-3262
```

---

## The `correlation` package

Among other things, the `correlation` package allows to calculate grouped/stratified correlations.

```r
allbus_2021_eda %>% 
  select(sex, xeno1:xeno4) %>% 
  filter(sex != "Non-binary") %>% 
  droplevels() %>% # drop unused factor level
  group_by(sex) %>% 
  correlation()
```

---

```
## # Correlation Matrix (pearson-method)
## 
## Group  | Parameter1 | Parameter2 |    r |       95% CI |     t |   df |         p
## ---------------------------------------------------------------------------------
## Male   |      xeno1 |      xeno2 | 0.45 | [0.41, 0.49] | 20.01 | 1597 | < .001***
## Male   |      xeno1 |      xeno3 | 0.46 | [0.42, 0.50] | 20.88 | 1600 | < .001***
## Male   |      xeno1 |      xeno4 | 0.31 | [0.26, 0.35] | 12.96 | 1603 | < .001***
## Male   |      xeno2 |      xeno3 | 0.64 | [0.61, 0.67] | 33.35 | 1597 | < .001***
## Male   |      xeno2 |      xeno4 | 0.56 | [0.52, 0.59] | 26.97 | 1597 | < .001***
## Male   |      xeno3 |      xeno4 | 0.55 | [0.51, 0.58] | 26.32 | 1602 | < .001***
## Female |      xeno1 |      xeno2 | 0.43 | [0.39, 0.47] | 19.53 | 1647 | < .001***
## Female |      xeno1 |      xeno3 | 0.44 | [0.40, 0.48] | 19.91 | 1648 | < .001***
## Female |      xeno1 |      xeno4 | 0.28 | [0.23, 0.32] | 11.77 | 1647 | < .001***
## Female |      xeno2 |      xeno3 | 0.60 | [0.57, 0.63] | 30.23 | 1645 | < .001***
## Female |      xeno2 |      xeno4 | 0.50 | [0.46, 0.54] | 23.42 | 1644 | < .001***
## Female |      xeno3 |      xeno4 | 0.48 | [0.44, 0.51] | 22.08 | 1644 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 1599-1650
```
]

---

# [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_3_1_3_Crosstabs_Correlations.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_3_1_3_Crosstabs_Correlations.html)

---

## Guilty by ~~association~~ correlation

While correlation coefficients are useful for exploring relationships between variables, they can also be misleading. For example, if we do correlation analysis and encounter a (Pearson's) correlation coefficient close to 0, we often think of relationships as pictured below.

---

## Guilty by ~~association~~ correlation

This data set has **the same correlation coefficient (Pearson's r of -0.06)** as the one on the previous slide:

---

## Guilty by ~~association~~ correlation 🦖

So does this one...

---

## Guilty by ~~association~~ correlation

We could go on... The previous three examples all come from the [`datasauRus` package](https://jumpingrivers.github.io/datasauRus/) which essentially is an extension of [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) and includes 13 data sets with the same (Pearson) correlation between x and y.

---

## Trust no singular value!

Importantly, the x- and y-variables in these `datasaurus_dozen` data set also all have the same means and standard deviations.

```r
datasaurus_dozen %>% 
  group_by(dataset) %>%
  summarize(
    mean_x = mean(x), 
    mean_y = mean(y), 
    sd_x = sd(x), 
    sd_y = sd(y), 
    corr = cor(x, y, method = "pearson")
  )
```

---

```
## # A tibble: 13 x 6
##    dataset    mean_x mean_y  sd_x  sd_y    corr
##    <chr>       <dbl>  <dbl> <dbl> <dbl>   <dbl>
##  1 away         54.3   47.8  16.8  26.9 -0.0641
##  2 bullseye     54.3   47.8  16.8  26.9 -0.0686
##  3 circle       54.3   47.8  16.8  26.9 -0.0683
##  4 dino         54.3   47.8  16.8  26.9 -0.0645
##  5 dots         54.3   47.8  16.8  26.9 -0.0603
##  6 h_lines      54.3   47.8  16.8  26.9 -0.0617
##  7 high_lines   54.3   47.8  16.8  26.9 -0.0685
##  8 slant_down   54.3   47.8  16.8  26.9 -0.0690
##  9 slant_up     54.3   47.8  16.8  26.9 -0.0686
## 10 star         54.3   47.8  16.8  26.9 -0.0630
## 11 v_lines      54.3   47.8  16.8  26.9 -0.0694
## 12 wide_lines   54.3   47.8  16.8  26.9 -0.0666
## 13 x_shape      54.3   47.8  16.8  26.9 -0.0656
```
]

---

## Plot your data!

The message from the `datasaurus_dozen` examples should be clear. Relying only on singular values that summarize the location or spread of a single variable or the association of two variables is not a good idea. To avoid reducing a ~~mountain to a molehill~~ dinosaur to a lack of correlation, it is important to plot your data to explore:

- univariate distributions

- grouped univariate distributions (if you want to compare groups)

- bivariate distributions

For that reason (and because it is fun and `R` has a lot to offer in that regard), we will dive 🥽 right into data visualization in the next session...

---

## Other packages for EDA

The [`summarytools` package](https://github.com/dcomtois/summarytools) provides quite a lot of functionalities for EDA, provides an exhaustive [introduction](https://htmlpreview.github.io/?https://github.com/dcomtois/summarytools/blob/master/doc/introduction.html), and also [works nicely with `R Markdown`](https://htmlpreview.github.io/?https://github.com/dcomtois/summarytools/blob/master/doc/rmarkdown.html) (which we will cover on Thursday).

A (minor) downside of this package is that you need to install additional software to use it on *MacOS* and *Linux* (it only works out-of-the-box on *Windows*).

---

## Other packages for EDA

Besides the ones we have covered in this session and `summarytools`, there still are plenty of others that also provide some interesting functionalities. Some of these additional/alternative options are:
- [`inspectdf`](https://alastairrushworth.github.io/inspectdf/)
- [`skimr`](https://docs.ropensci.org/skimr/)
- [`descriptr`](https://descriptr.rsquaredacademy.com/index.html)
- [`DataExplorer`](https://boxuancui.github.io/DataExplorer/)
- [`explore`](https://github.com/rolkra/explore)
- [`dataReporter`](https://github.com/ekstroem/dataReporter)

---

## Automated EDA reports

Some of the EDA packages provide functions for generating automated EDA reports, for example:
- the `dfSummary()` function from the `summarytools` package
- the `skim()` function from the [`skimr` package](https://docs.ropensci.org/skimr/)
- the `make_report()` function from the [`dataReporter` package](https://github.com/ekstroem/dataReporter)
- the `report()` function from the [`explore` package](https://github.com/rolkra/explore)

The function `explore`() from the `explore` package also allows for interactive data exploration.

---

# Extracurricular activities

Check out the appendix slides on exploring missing data and outliers and/or watch the [*YouTube* tutorial *EDA with the tidyverse: Exploring Variation*](https://www.youtube.com/watch?v=26hbyVb00xs) by Mauro Lepore for even more EDA inspiration. Of course, you can also make use of what we covered in this session to further explore the (full) *ALLBUS* 2021 data set.