class: center, middle, inverse, title-slide .title[ # Introduction to R for Data Analysis ] .subtitle[ ## Data Visualization - Part 2 ] .author[ ### Johannes Breuer, Stefan Jünger, Veronika Batzdorfer ] .date[ ### 2021-08-18 ] --- layout: true --- ## Content of this session .pull-left[ **What we will do** - Visual checks of model assumptions - Plotting coefficients - Plotting predictions - Plotting multiple models ] .pull-right[ **What we won't do** - Again, no crazy models - No Bayesian statistics ] --- ## Packages in this session .pull-left[ We will, again, use some packages from the `easystats` collection - `parameters` - `performance` - **`see`** ] .pull-right[ <img src="data:image/png;base64,#../img/logo_wall.png" style="display: block; margin: auto;" /> ] .pull-left[ While the `see` package is straightforward to use, it has some limitations. From time to time we will, hence, also switch to the `sjPlot` package. ] .pull-right[ <img src="data:image/png;base64,#../img/sjplot_logo.png" width="40%" style="display: block; margin: auto;" /> ] --- ## Data for this session As in the previous sessions, we will use the data from the from the *ALLBUS/GGSS 2021*. ```r library(haven) allbus_2021_dvII <- read_spss("./data/allbus_2021/ZA5280_v1-0-0.sav") ``` --- ## Fictional research question We are still interested in xenophobic attitudes, like to investigate how social trust and contact to foreigners might reduce them, and how party choices may contribute to higher levels of xenophobia. **Of course, this 'research' is again purely based on ad-hoc hypotheses.** ```r library(dplyr) library(forcats) library(sjlabelled) allbus_2021_dvII <- allbus_2021_dvII %>% remove_all_labels() %>% mutate( sex = recode_factor(sex, `1` = "Male", `2` = "Female", `3` = "Non-binary"), region = recode_factor( eastwest, `1` = "Western Germany", `2` = "Eastern Germany" ), xenophobia = rowMeans(across(ma01b:ma04, ~as.numeric(.x))), trust = ifelse(st01 == 1, 1, 0), contact = rowSums(across(mc01:mc04, ~as.numeric(.x))), afd = ifelse(pv01 == 42, 1, 0), person_weight = as.numeric(wghtpew) ) ``` --- ## Estimating a linear model (OLS) We start with estimating a linear regression model including our variables of interest and two control variables. ```r linear_model <- lm( xenophobia ~ age + sex + region + trust + contact + afd, data = allbus_2021_dvII ) library(parameters) model_parameters(linear_model) ```
Parameter
Coefficient
SE
CI
CI_low
CI_high
t
df_error
p
(Intercept)
0.69
0.133
0.95
0.429
0.951
5.19
2133
2.34e-07
age
0.0189
0.00154
0.95
0.0159
0.022
12.3
2133
2.02e-33
sexFemale
-0.136
0.0515
0.95
-0.237
-0.0349
-2.64
2133
0.00838
sexNon-binary
-0.939
1.18
0.95
-3.25
1.37
-0.796
2133
0.426
regionEastern Germany
0.0301
0.0583
0.95
-0.0842
0.144
0.517
2133
0.605
trust
-0.47
0.0573
0.95
-0.582
-0.357
-8.2
2133
4.09e-16
contact
0.223
0.0222
0.95
0.179
0.266
10
2133
3.67e-23
afd
1.82
0.102
0.95
1.62
2.02
17.9
2133
1.14e-66
--- ## Quick inspection using `base R` In this session, we use `base R` plots again, but only to quickly illustrate that you could use them for some basic model checks, too. .pull-left[ ```r par(mfrow = c(2, 2)) plot(linear_model) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] --- ## Using `easystats` `easystats` provides some nice features for inspecting model fit. Here's a simple check for normality of the residuals. It is a combination of using a function from the `performance` package and then using `see` to plot it. .pull-left[ ```r library(performance) library(see) check_normality(linear_model) %>% plot() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-3-1.png" width="75%" style="display: block; margin: auto;" /> ] --- ## QQ-Plot (`easystats`) You can also draw some classic QQ-plots for this purpose. .pull-left[ ```r check_normality(linear_model) %>% plot(type = "qq") ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-4-1.png" width="85%" style="display: block; margin: auto;" /> ] --- ## Heteroscedasticity (`easystats`) Something that is always a bit of an issue in regression analysis is heteroscedasticity. Again, it's straightforward to check for this using `performance` and `see`. .pull-left[ ```r check_heteroscedasticity( linear_model ) %>% plot() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-5-1.png" width="85%" style="display: block; margin: auto;" /> ] --- ## There is more In the collection of `easystats` packages, there are [way more procedures](https://easystats.github.io/see/articles/performance.html) you can apply to check your models. To replicate the summary plot from `base R` with some nicer formatting, we can use the following function from the `performance` package: .pull-left[ ```r check_model(linear_model) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-6-1.png" width="80%" style="display: block; margin: auto;" /> ] --- class: center, middle # [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_4_2_1_Plotting_Diagnostics.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_4_2_1_Plotting_Diagnostics.html) --- ## Getting to the substantive part After checking the model assumptions, and pretending everything's fine (it's not...), we will deal with the substantive part of our fictional research question. The traditional way of doing that it printing some regression tables and checking the individual coefficients. That's fine but plotting is also nice. For that purpose, we will now turn to... - Coefficient Plots / Forest Plots - Prediction Plots --- ## Coefficient plot (`parameters`) `parameter` provides an easy method for creating a coefficient plot in one to two lines of code. .pull-left[ ```r library(parameters) model_parameters(linear_model) %>% plot() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-7-1.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Changing labels: it's `ggplot2`-based All of the plots produced by `parameters` and `see` are based on `ggplot2`. This means that, after calling the plotting functions from these packages, we can draw additional or manipulate existing plot layers using the usual `+`-structure. .pull-left[ ```r library(ggplot2) model_parameters(linear_model) %>% plot() + scale_y_discrete( labels = # in reversed order! c( "AfD Vote", "Contact", "Social Trust", "Region (Ref. Western Germany)", "Non-binary (Ref. Male)", "Female (Ref. Male)", "Age" ) ) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Paint it black! 🖤 Thus, we can easily manipulate the appearance of the plots, e.g., by turning to a less colorful theme. .pull-left[ ```r library(ggplot2) model_parameters(linear_model) %>% plot() + scale_y_discrete( labels = # in reversed order! c( "AfD Vote", "Contact", "Social Trust", "Region (Ref. Western Germany)", "Non-binary (Ref. Male)", "Female (Ref. Male)", "Age" ) ) + scale_colour_grey(start = 0, end = 0) + guides(color = "none") ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-9-1.png" width="95%" style="display: block; margin: auto;" /> ] --- ## The advantage of prediction plots Regression coefficients are sometimes a bit hard to read - they are based on the scale of the independent variable when unstandardized - it is ambiguous what they actually mean substantially - for example, is a coefficient of .2 or .05 (practically) meaningful or not? Predictions are a solution - they basically draw the regression line into the scatter plot of the Y and X variables - we can read at which level of X which value of Y is expected --- ## Predictions (`sjPlot`) For the purpose of plotting predictions, we will use the `plot_model()` function from the `sjPlot` package. In principle, it would also be possible with the `easystats` packages, but I find `sjPlot` easier to work with in this case... .pull-left[ ```r library(sjPlot) linear_model %>% plot_model( type = "pred", terms = "age" ) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Customizing `sjplot`s As for the `easytats` packages, the plots produced by `sjplot` are `ggplot2`-based. .pull-left[ ```r linear_model %>% plot_model( type = "pred", terms = "age" ) + xlab("Age") + ylab("Xenophobia") + ggtitle("Linear Model") + theme_bw() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-11-1.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Interaction effects Let's now turn to a possible interaction between age and contact. Does contact to foreigners moderate the relationship between age and xenophobia? ```r linear_model_interaction <- lm( xenophobia ~ age + sex + region + trust + contact + afd + age:contact, data = allbus_2021_dvII ) model_parameters(linear_model_interaction) ```
Parameter
Coefficient
SE
CI
CI_low
CI_high
t
df_error
p
(Intercept)
1.68
0.406
0.95
0.885
2.48
4.14
2132
3.61e-05
age
0.000125
0.00744
0.95
-0.0145
0.0147
0.0168
2132
0.987
sexFemale
-0.14
0.0515
0.95
-0.24
-0.0386
-2.71
2132
0.00675
sexNon-binary
-0.921
1.18
0.95
-3.23
1.39
-0.783
2132
0.434
regionEastern Germany
0.0221
0.0583
0.95
-0.0922
0.136
0.379
2132
0.704
trust
-0.463
0.0573
0.95
-0.575
-0.351
-8.08
2132
1.05e-15
contact
0.0587
0.0673
0.95
-0.0732
0.191
0.873
2132
0.383
afd
1.82
0.102
0.95
1.62
2.02
17.9
2132
9.32e-67
age:contact
0.00305
0.00118
0.95
0.000734
0.00536
2.58
2132
0.00985
--- ## Plotting interaction effects Regression parameters for interactions are particular difficult to interpret. Again, `sjPlot` can help in this regard. .pull-left[ ```r linear_model_interaction %>% plot_model( type = "int" ) + xlab("Age") + ylab("Xenophobia") + ggtitle("Interaction Model") + theme_bw() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-12-1.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Sidenote: colors again Here's an example of the different color types in `ggplot2` (`color` and `fill`) and how you can manipulate them. .pull-left[ ```r linear_model_interaction %>% plot_model( type = "int" ) + xlab("Age") + ylab("Xenophobia") + ggtitle("Interaction Model") + theme_bw() + scale_color_manual( values = c("red", "blue"), labels = c("Litte", "A lot"), name = "Contact" ) + scale_fill_manual( values = rep("grey", 3) ) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-13-1.png" width="95%" style="display: block; margin: auto;" /> ] --- class: center, middle # [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_4_2_2_Plotting_a_Regression.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_4_2_2_Plotting_a_Regression.html) --- ## Estimating multiple models Something you may often do in your research is estimating multiple models because of - changing covariates or dependent variables - different model specifications We can compare models side by side using regression tables, but by now we should know: plotting is better. --- ## Two more example regressions We then can run a logistic regression based on the binary variable from the median split. ```r linear_model_plain <- lm(xenophobia ~ trust + contact + afd, data = allbus_2021_dvII) linear_model_controls <- lm(xenophobia ~ age + sex + region + trust + contact + afd, data = allbus_2021_dvII) model_parameters(linear_model_plain) ```
Parameter
Coefficient
SE
CI
CI_low
CI_high
t
df_error
p
(Intercept)
1.11
0.127
0.95
0.859
1.36
8.71
2151
6.12e-18
trust
-0.454
0.0594
0.95
-0.57
-0.338
-7.65
2151
3.03e-14
contact
0.309
0.0204
0.95
0.269
0.349
15.2
2151
2.12e-49
afd
1.8
0.104
0.95
1.6
2.01
17.3
2151
7.84e-63
```r model_parameters(linear_model_controls) ```
Parameter
Coefficient
SE
CI
CI_low
CI_high
t
df_error
p
(Intercept)
0.69
0.133
0.95
0.429
0.951
5.19
2133
2.34e-07
age
0.0189
0.00154
0.95
0.0159
0.022
12.3
2133
2.02e-33
sexFemale
-0.136
0.0515
0.95
-0.237
-0.0349
-2.64
2133
0.00838
sexNon-binary
-0.939
1.18
0.95
-3.25
1.37
-0.796
2133
0.426
regionEastern Germany
0.0301
0.0583
0.95
-0.0842
0.144
0.517
2133
0.605
trust
-0.47
0.0573
0.95
-0.582
-0.357
-8.2
2133
4.09e-16
contact
0.223
0.0222
0.95
0.179
0.266
10
2133
3.67e-23
afd
1.82
0.102
0.95
1.62
2.02
17.9
2133
1.14e-66
--- ## Comparing model fits (`performance`) The `performance` package gives you a nice spider plot when you compare the performance of models visually. .pull-left[ ```r library(performance) compare_performance( linear_model_plain, linear_model_controls ) %>% plot() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-14-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Comparing model parameters (`sjPlot`) To compare coefficients from different regression models, we can also use `sjplot`. This time, instead of using `plot_model()`, we use `plot_models()`. .pull-left[ ```r plot_models( linear_model_plain, linear_model_controls, std.est = "std" ) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-15-1.png" width="85%" style="display: block; margin: auto;" /> ] --- ## Adjusting it within the function The nice thing about `sjPlot` is that it provides a lot of options for customizing the plot. For example, to change the legend and axis labels we can use the `m.label` and `axis.labels` arguments. .pull-left[ ```r plot_models( linear_model_plain, linear_model_controls, std.est = "std", legend.title = "Models", m.labels = c( "Without controls", "With controls" ), axis.labels = c( "Region (Ref. Western Germany)", "Non-binary (Ref. Male)", "Female (Ref. Male)", "Age", "AfD Vote", "Contact", "Social Trust" ) ) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-16-1.png" width="95%" style="display: block; margin: auto;" /> ] --- ## More customization As `sjplot` is also based on `ggplot2` we easily add further layers to our plot, such as a theme. .pull-left[ ```r library(ggplot2) plot_models( linear_model_plain, linear_model_controls, std.est = "std", legend.title = "Models", m.labels = c( "Without controls", "With controls" ), axis.labels = c( "Region (Ref. Western Germany)", "Non-binary (Ref. Male)", "Female (Ref. Male)", "Age", "AfD Vote", "Contact", "Social Trust" ) ) + theme_bw() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-17-1.png" width="85%" style="display: block; margin: auto;" /> ] --- ## Plotting predictions from multiple models Neither the `easystats` packages nor `sjPlot` provide an easy way for plotting predictions from multiple models. For this purpose, we have to dig a bit deeper into the wrangling of the underlying data. Fortunately, what `sjPlot` supports is extracting the data from predicting a model. With these data at hand, we can simply create a data frame comprising predictions from multiple models. --- ## Extracting the predictions The procedure to extract predictions from a model is similar to using the plotting command. But instead of using `plot_model()`, we use `get_model_data()`. ```r linear_predictions <- get_model_data( linear_model_plain, term = "contact", type = "pred" ) linear_predictions ```
x
predicted
std.error
conf.low
conf.high
group
group_col
4
2.34
0.0491
2.25
2.44
1
1
5
2.65
0.0338
2.59
2.72
1
1
6
2.96
0.0264
2.91
3.01
1
1
7
3.27
0.033
3.21
3.34
1
1
8
3.58
0.048
3.49
3.67
1
1
--- ## More predictions We can extract the predictions from the previous two models the same way. .pull-left[ ```r predictions_plain <- get_model_data( linear_model_plain, term = "contact", type = "pred" ) ``` ] .pull-right[ ```r predictions_controls <- get_model_data( linear_model_controls, term = "contact", type = "pred" ) ``` ] --- ## Combining the predictions .pull-left[ We combine the data using simple data wrangling operations. - `bind_rows()` is a function from `dplyr` for appending rows from one data set to another - there's also `rbind()` in `base R` - we use `mutate()` to create an indicator to know which row belongs to what model ] .pull-right[ ```r library(dplyr) library(tibble) all_predictions <- bind_rows( predictions_plain %>% mutate( model = "Without Controls" ), predictions_controls %>% mutate( model = "With Controls" ) ) %>% as_tibble() ``` ] --- ## Plotting them with `facet_wrap()` To create multiple prediction plots with a subplot for each model we use barebones `ggplot2`. Fortunately, this is not more complex than using `easystats` packages or `sjPlot`. .pull-left[ ```r predictions_simple <- ggplot( all_predictions, aes(x = x, y = predicted) ) + geom_line() + geom_line(aes(y = conf.high), linetype = "dashed") + geom_line(aes(y = conf.low), linetype = "dashed") + facet_wrap(~model) + theme_bw() predictions_simple ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-18-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Average marginal effects (AME) In the previous session, we also (briefly) discussed average marginal effects. - parameter estimates based on the 'real' empirical values of all other covariates - handy when relationships are non-linear - way easier to interpret than usual regression coefficients - also nice when using predictions --- ## Extracting AME We can also use the `marginaleffects` package and extract prediction information for our models. While the resulting columns names are different, the `predictions()` function from that package also gives us information about X-values, estimates, and confidence bands. ```r library(marginaleffects) plain_AME <- predictions(linear_model_plain, newdata = datagrid(contact = 1:4)) controls_AME <- predictions(linear_model_controls, newdata = datagrid(contact = 1:4)) ``` --- ## Combining AME data After extracting the information, we can simply combine the two data sets again to get predictions for all included models. ```r all_AME <- bind_rows( plain_AME %>% mutate(model = "Without Controls"), controls_AME %>% mutate(model = "With Controls") ) %>% as_tibble() ``` --- ## Plotting predictions based on AME Plotting them can then be done the same way as before. .pull-left[ ```r predictions_AME <- ggplot( all_AME, aes(x = contact, y = predicted) ) + geom_line() + geom_line(aes(y = conf.high), linetype = "dashed") + geom_line(aes(y = conf.low), linetype = "dashed") + facet_wrap(~model) + theme_bw() predictions_AME ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-19-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Combining them with previous predictions Using the `patchwork` package again, we can even plot the 'simple' predictions together with the AME ones in one plot. *Note*: In principle, this would have been possible with combining the data as well. .pull-left[ ```r library(patchwork) (predictions_simple + ggtitle("Simple Predictions") + ylab("Predicted Xenophobia Value") + xlab("Contact") + ylim(4, 11)) / (predictions_AME + ggtitle("AME Predictions") + ylab("Predicted Xenophobia Value") + xlab("Contact") + ylim(4, 11)) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-20-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## What is more? Sometimes getting to a plot that is ready for publication can be a long way to go. Fortunately, the `easystats` packages and `sjPlot` help a lot to convert it to a medium walking distance. There are also other packages that can facilitate the goal of creating publication-ready plots, such as the [`dotwhisker`](https://cran.r-project.org/web/packages/dotwhisker/) package for coefficient plots. Things can get messy if we have to work with statistical models that are not incorporated into the presented packages. For example, AME models or when you want to make use of results produced with other statistical software like *Stata* (for an example, you can have a look at [my code](https://stefanjuenger.github.io/land_use_disadvantages/Land_Use_Disadvantages_Main_Analysis.html) for a [recent paper](https://doi.org/10.1177/00420980211023206)). --- class: center, middle # [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_4_2_3_Combining_Predictions.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_4_2_3_Combining_Predictions.html) --- ## Extracurricular activities Explore the data from the *ALLBUS/GGSS 2021* a bit further and think about some confirmatory analyses you would find interesting for these data. Also think about how you would (want to) visualize your results. If you have the time and motivation, feel free to actually do some of the analyses and visualizations you find interesting, building on the code from today's lectures and exercises.