Exercise 3_1_1: Summary statistics

For this exercise, we will use the same subset of the ALLBUS 2021 data as in the lecture. If you have stored that data set as an .rds file as shown in the slides, you can simply load it with the following command:

allbus_2021_eda <- readRDS("./data/allbus_2021_eda.rds")

If you have not saved the wrangled data as an .rds file yet, you need to go through the data wrangling pipeline shown in the EDA slides (again).

Also, in addition to base R and packages from the tidyverse, we will use the datawizard package in this exercise, so make sure that you have it installed.

1

To get started with the EDA vibe, use a base R function, print some basic summary statistics for the variables xenophobia and contact.

Clues

We can use the dplyr function for selecting variables and pipe the result into the required function.

solution

allbus_2021_eda %>% 
  select(xenophobia, contact) %>% 
  summary()

##    xenophobia       contact     
##  Min.   :1.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.750   Median :2.000  
##  Mean   :3.097   Mean   :1.953  
##  3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :7.000   Max.   :4.000  
##  NA's   :2105    NA's   :2385

2

Use a function from the datawizard package to get summary statistics (descriptions of the distribution) for the following variables in our data set: sat_dem, xenophobia, contact. We do not want information on quartiles or the IQR.

Clues

You can check the arguments for the function we need via ?describe_distribution.

solution

library(datawizard)

allbus_2021_eda %>% 
  select(sat_dem,
         xenophobia,
         contact) %>%
  describe_distribution(iqr = FALSE)

## Variable   | Mean |   SD |        Range | Skewness | Kurtosis |    n | n_Missing
## --------------------------------------------------------------------------------
## sat_dem    | 4.39 | 1.05 | [1.00, 6.00] |    -0.91 |     0.95 | 3523 |      1819
## xenophobia | 3.10 | 1.42 | [1.00, 7.00] |     0.80 |     0.07 | 3237 |      2105
## contact    | 1.95 | 1.32 | [0.00, 4.00] |    -0.04 |    -1.13 | 2957 |      2385

3

Now, let’s use functions from dplyr to create grouped summary statistics. Compute separate means for the variables xenophobia and contact for the different age groups in the data set. The resulting summary variables should be called xenophobia_mean and contact_mean. You should exclude respondents with missing values for the variables of interest.

Clues

You need to group and summarize the data.

solution

allbus_2021_eda %>% 
  select(agec,
         xenophobia,
         contact) %>% 
  drop_na() %>% 
  group_by(agec) %>% 
  summarize(xenophobia_mean = mean(xenophobia),
            contact_mean = mean(contact))

## # A tibble: 6 x 3
##   agec           xenophobia_mean contact_mean
##   <ord>                    <dbl>        <dbl>
## 1 <= 25 years               2.45        2.32 
## 2 26 to 30 years            2.68        2.36 
## 3 31 to 35 years            3.01        2.23 
## 4 36 to 40 years            3.42        1.57 
## 5 41 to 45 years            3.80        0.945
## 6 46 to 50 years            4.32        0.647

Exercise 3_1_1: Summary statistics

Johannes Breuer, Stefan Jünger, & Veronika Batzdorfer

Introduction to R for Data Analysis

1

Clues

solution

2

Clues

solution

3

Clues

solution