For this exercise, we will use the same subset of the ALLBUS 2021 data as in the lecture. If you have stored that data set as an .rds file as shown in the slides, you can simply load it with the following command:

allbus_2021_eda <- readRDS("./data/allbus_2021_eda.rds")

If you have not saved the wrangled data as an .rds file yet, you need to go through the data wrangling pipeline shown in the EDA slides (again).

Also, in addition to base R and packages from the tidyverse, we will use the datawizard package in this exercise, so make sure that you have it installed.

1

To get started with the EDA vibe, use a base R function, print some basic summary statistics for the variables xenophobia and contact.
We can use the dplyr function for selecting variables and pipe the result into the required function.
allbus_2021_eda %>% 
  select(xenophobia, contact) %>% 
  summary()
##    xenophobia       contact     
##  Min.   :1.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.750   Median :2.000  
##  Mean   :3.097   Mean   :1.953  
##  3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :7.000   Max.   :4.000  
##  NA's   :2105    NA's   :2385

2

Use a function from the datawizard package to get summary statistics (descriptions of the distribution) for the following variables in our data set: sat_dem, xenophobia, contact. We do not want information on quartiles or the IQR.
You can check the arguments for the function we need via ?describe_distribution.
library(datawizard)

allbus_2021_eda %>% 
  select(sat_dem,
         xenophobia,
         contact) %>%
  describe_distribution(iqr = FALSE)
## Variable   | Mean |   SD |        Range | Skewness | Kurtosis |    n | n_Missing
## --------------------------------------------------------------------------------
## sat_dem    | 4.39 | 1.05 | [1.00, 6.00] |    -0.91 |     0.95 | 3523 |      1819
## xenophobia | 3.10 | 1.42 | [1.00, 7.00] |     0.80 |     0.07 | 3237 |      2105
## contact    | 1.95 | 1.32 | [0.00, 4.00] |    -0.04 |    -1.13 | 2957 |      2385

3

Now, let’s use functions from dplyr to create grouped summary statistics. Compute separate means for the variables xenophobia and contact for the different age groups in the data set. The resulting summary variables should be called xenophobia_mean and contact_mean. You should exclude respondents with missing values for the variables of interest.
You need to group and summarize the data.
allbus_2021_eda %>% 
  select(agec,
         xenophobia,
         contact) %>% 
  drop_na() %>% 
  group_by(agec) %>% 
  summarize(xenophobia_mean = mean(xenophobia),
            contact_mean = mean(contact))
## # A tibble: 6 x 3
##   agec           xenophobia_mean contact_mean
##   <ord>                    <dbl>        <dbl>
## 1 <= 25 years               2.45        2.32 
## 2 26 to 30 years            2.68        2.36 
## 3 31 to 35 years            3.01        2.23 
## 4 36 to 40 years            3.42        1.57 
## 5 41 to 45 years            3.80        0.945
## 6 46 to 50 years            4.32        0.647