class: center, middle, inverse, title-slide .title[ # Introduction to R for Data Analysis ] .subtitle[ ## Data Wrangling - Part 2 ] .author[ ### Johannes Breuer, Stefan Jünger, & Veronika Batzdorfer ] .date[ ### 2022-08-16 ] --- layout: true --- ## Data wrangling continued 🤠 While in the last session we focused on changing the structure of our data by **selecting**, **renaming**, and **relocating** columns and **filtering** and **arranging** rows, in this part we will focus on altering the content of data sets by *adding* and *changing* variables and variable values. More specifically, we will deal with... - creating and computing new variables (in various ways) - recoding the values of a variable - dealing with missing values --- ## Setup If you have not yet done so in your current `R` session, first load the required libraries for this part. ```r library(sjlabelled) library(tidyverse) library(haven) ``` --- ## Setup As in the previous session (as well as most of the following ones), we will use the fresh *ALLBUS* data from 2021. In case you have not already done so, before you can (continue to) wrangle the data, you need to import them. As working with labelled data can be a bit tedious when wrangling them, for the sake of simplicity (and because they are in German), we will remove all labels from the data set using the `remove_all_labels()` function from the `sjlabelled` package. However, we want to keep codes for missing values (more on that later), so we need to set the `user_na` argument of the `read_sav()` function to `TRUE`. ```r allbus_2021 <- read_sav("./data/allbus_2021/ZA5280_v1-0-0.sav", user_na = TRUE) %>% remove_all_labels() ``` .small[ *Note*: The code assumes that the `.sav` file containing the 2021 *ALLBUS* data is saved in subfolder `\data\allbus_2021` within the course materials folder (which should then also be your working directory for the code to run). ] --- ## Join the World Wrangling Federation <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2022\content\img\ready-wrangle.png" width="50%" style="display: block; margin: auto;" /> --- ## Creating & transforming variables The simplest case of adding a new variable is creating a constant. You might, e.g., want to do that to add information about the year in which data were collected. This is how you can do this in `base R`: ```r allbus_2021$year <- 2021 head(allbus_2021$year) ``` ``` ## [1] 2021 2021 2021 2021 2021 2021 ``` *Note*: By default, new variables are added after the last column in the data set. --- ## Creating & transforming variables Another simple variable transformation is adding or subtracting a constant, which, in `base R`, you can do as follows: ```r allbus_2021$sex_new <- allbus_2021$sex - 1 table(allbus_2021$sex, allbus_2021$sex_new) ``` ``` ## ## -10 0 1 2 ## -9 20 0 0 0 ## 1 0 2614 0 0 ## 2 0 0 2705 0 ## 3 0 0 0 3 ``` --- ## Creating & transforming variables We can also add new variables by changing the data type of an existing variable. The `base R` way of doing this is the following: ```r allbus_2021$id_char <- as.character(allbus_2021$respid) typeof(allbus_2021$respid) ``` ``` ## [1] "double" ``` ```r typeof(allbus_2021$id_char) ``` ``` ## [1] "character" ``` *Note*: In case you want to overwrite a variable, you can do so by giving the new variable the same name as the old one. --- ## Creating & transforming variables The `dplyr` package provides a very versatile function for creating and transforming variables: `mutate()`, which you can also use to create a new variable that is a constant, ... ```r allbus_2021 <- allbus_2021 %>% mutate(year = 2021) allbus_2021 %>% select(year) %>% head() ``` ``` ## year ## 1 2021 ## 2 2021 ## 3 2021 ## 4 2021 ## 5 2021 ## 6 2021 ``` --- ## Creating & transforming variables ... applies a simple transformation to an existing variable, ... ```r allbus_2021 <- allbus_2021 %>% mutate(sex_new = sex - 1) allbus_2021 %>% select(starts_with("sex")) %>% head ``` ``` ## sex sex_new ## 1 2 1 ## 2 1 0 ## 3 2 1 ## 4 1 0 ## 5 2 1 ## 6 1 0 ``` --- ## Creating & transforming variables ... or changes the data type of an existing variable. ```r allbus_2021 <- allbus_2021 %>% mutate(id_char = as.character(respid)) allbus_2021 %>% select(respid, id_char) %>% glimpse() ``` ``` ## Rows: 5,342 ## Columns: 2 ## $ respid <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22~ ## $ id_char <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "1~ ``` --- ## `dplyr::mutate()` <img src="data:image/png;base64,#https://github.com/allisonhorst/stats-illustrations/blob/main/rstats-artwork/dplyr_mutate.png?raw=true" width="60%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Creating & transforming variables Notably, however, `mutate()` can be used for much more complex variable transformations. We will go through several of those in the following. One situation in which we might want to transform variables or create new ones, e.g., is when we want to recode their values. *Note*: We could, of course, also do this in `base R`, but the code for that can get quite convoluted. --- ## Recoding values Say, for example, we want to recode the item on political interest from the *ALLBUS* 2021, so that higher values represent stronger interest. For that purpose, we can combine the two `dplyr` functions `mutate()` and `recode()`. .small[ ```r allbus_2021 <- allbus_2021 %>% mutate(pa02a_R = recode(pa02a, `5` = 1, # `old value` = new value `4` = 2, `2` = 4, `1` = 5)) table(allbus_2021$pa02a, allbus_2021$pa02a_R) ``` ``` ## ## -42 -9 1 2 3 4 5 ## -42 8 0 0 0 0 0 0 ## -9 0 16 0 0 0 0 0 ## 1 0 0 0 0 0 0 579 ## 2 0 0 0 0 0 1528 0 ## 3 0 0 0 0 2470 0 0 ## 4 0 0 0 585 0 0 0 ## 5 0 0 156 0 0 0 0 ``` ] --- ## Excursus: Harmonization If we (want to) work with data sets from multiple sources (or also just multiple waves within the same survey program), it may be that the same constructs are measured differently. What we can do in this case is harmonizing the different measures. A helpful tool for this is the *GESIS* service [*QuestionLink*](https://www.gesis.org/en/services/processing-and-analyzing-data/data-harmonization/question-link). *QuestionLink* also offers `R` code (a `base R` as well as a `dplyr` version) for harmonizing various constructs (including political interest) across different survey programs (including the *ALLBUS*) . --- ## Harmonization example Let's say, we want to recode the 5-point political interest measure from the *ALLBUS*, so that it matches the 4-point measure from the [*European Values Study* (EVS)](https://europeanvaluesstudy.eu/). We can download [an `HTML` document containing the code we need](https://osf.io/f2uta/download) via the *QuestionLink* website, and then use the code to recode the variable. ```r allbus_2021 <- allbus_2021 %>% mutate(polint_evs = recode(pa02a, `1` = 0.68, `2` = 1.31, `3` = 2.13, `4` = 3.22, `5` = 4.15)) ``` --- class: center, middle # [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_2_2_1_Create_Transform_Vars.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_2_2_1_Create_Transform_Vars.html) --- ## Missing values A particular reason why we may want to recode specific values of one or multiple variable is if we have missing data in our data set. Most of the real data sets we work with have missing data. As the data can be missing for various reasons, we often use codes (and labels) to distinguish between different types of missing data. --- ## Missing value codes If you look at the the codebooks for ALLBUS data sets (the 2021 does not yet have one), you will see that there are quite a few codes for missing data. While the missing codes are consistent, typically only a few of them are used for each variable. We have also already seen examples of this on the previous slides when we computed new variables based on existing ones. --- ## Short excursus: Exploring missing value codes in the *ALLBUS* 2021 data If we want to explore missing value codes within the data set, we need to import and keep the labels. ```r allbus_2021_labelled <- read_sav("./data/allbus_2021/ZA5280_v1-0-0.sav") ``` *Note*: Currently, only a German version of the data set is available (English translations are produced later on), the labels for missing values (as well as other values) will be in German. --- ## Short excursus: Exploring missing value codes in the *ALLBUS* 2021 data To print all value labels, including those used for missing values, we can make use of the function `print_labels()` from the `haven` package. ```r print_labels(allbus_2021_labelled$pa02a) ``` ``` ## ## Labels: ## value label ## -42 DATENFEHLER: MFN ## -9 KEINE ANGABE ## 1 SEHR STARK ## 2 STARK ## 3 MITTEL ## 4 WENIG ## 5 UEBERHAUPT NICHT ``` --- ## Missing values in `R` In `R`, missing values are represented by `NA`. `NA` is a reserved term in `R`, meaning that you cannot use it as a name for anything else (this is also the case for `TRUE` and `FALSE`). **NB**: If we use the `haven` function `read_sav()` for importing *SPSS* data files, by default, user-defined missings are converted to `NA`. If we want to keep the original values, we need to change the argument `user_na` within that function to `TRUE` (the default is that it is set to `FALSE`). --- ## Wrangling missing values When we prepare our data for analysis there are generally two things we might want/have to do with regard to missing values: - define specific values as missings (i.e., set them to `NA`) - recode `NA` values into something else (typically to distinguish between different types of missing values) --- ## Recode values as `NA` With `base R` you can set values to `NA` for specific variables as follows: .small[ ```r sum(is.na(allbus_2021$pa02a)) ``` ``` ## [1] 0 ``` ```r allbus_2021$pa02a[allbus_2021$pa02a == -42] <- NA allbus_2021$pa02a[allbus_2021$pa02a == -9] <- NA sum(is.na(allbus_2021$pa02a)) ``` ``` ## [1] 24 ``` ] --- ## Recode values as `NA` The `tidyverse` option for setting specific values of individual variables to `NA` is the `dplyr` function `na_if()` combined with `mutate()`. ```r allbus_2021 <- allbus_2021 %>% mutate(pa02a = na_if(pa02a, -42)) %>% mutate(pa02a = na_if(pa02a, -9)) ``` --- ## Recode values as `NA` The `na_if()` function can also be used to recode specific values as `NA` for a whole data set. ```r allbus_2021 <- allbus_2021 %>% na_if(-42) %>% na_if(-9) ``` *Note*: `na_if()` only takes single values as its second argument (i.e., the value to replace with `NA`). --- ## Recode values as `NA` While `na_if()` can be applied to a specified selection of variables if combined with another `dplyr` function that we will cover in a bit, the `base R` and `tidyverse` options for recoding values as `NA` are somewhat difficult to use when they should be used for a selection or range of many different values. There are, however, functions from two other packages that come in handy here: - `set_na()` from the [`sjlabelled` package](https://strengejacke.github.io/sjlabelled/index.html) - `replace_with_na()` and its scoped variants, such as `replace_with_na_all()`, from the [`naniar` package](http://naniar.njtierney.com/index.html) 🦁 --- ## `set_na()` from `sjlabelled` An easy-to-use option for recoding values to `NA` (for individual variables or full data frames) is the function `set_na()` from the `sjlabelled` package. ```r library(sjlabelled) allbus_2021 <- allbus_2021 %>% set_na(na = c(-6, -7, -8, -9, -10)) ``` *Note*: The `set_na` function can also be used to replace different values with `NA` for different variables. --- ## The missings of `naniar` 🦁 The `naniar` package provides many useful functions for handling missing data in `R` (and works very well in combination with the `tidyverse`). For example, we can use the function `replace_with_na_all` to code every value in our data set that is < 0 as `NA`. ```r allbus_2021 <- allbus_2021 %>% replace_with_na_all(condition = ~.x < 0) ``` Using the functions `replace_with_na_at()` and `replace_with_na_if()`, we can also recode values as `NA` for a selection or specific type of variables (e.g., all numeric variables). --- ## When to deal with missing values? While, as the previous examples should have shown, you can include the handling of missing values as part of your data wrangling, the simpler option can be to deal with them already in the data import step. As we have seen, many data import functions, such as `read_csv()` or `read_sav()` include arguments that can be used to indicate what values should be specified as `NA`. However, this is only the (potentially) more comfortable option if the values that should be treated as missing are the same across all variables in the data set. --- ## Dealing with missing values in `R` As with everything in `R`, there are also many online resources on dealing with missing data. A fairly new and interesting one is the [chapter on missing values on the work-in progress 2nd edition of *R for Data Science*](https://r4ds.hadley.nz/missing-values.html). There also are various packages for different imputation techniques. A popular one is the [`mice` package](https://amices.org/mice/). However, we won't cover the topic of imputation in this course. --- ## Excluding cases with missing values For some analyses it may make sense (or even be necessary) to exclude cases with missing values on one or more variables. Do demonstrate this it is helpful to use a version of the data set in which the *SPSS* user-defined missings are set to `NA`. This is the case for the data set with labels that we imported before as we did not set the `read_sav()` argument `user_na` to `TRUE` there. --- ## Excluding cases with missing values If you want to exclude observations with missing values for individual variables, you can use `!is.na(variable_name)` with your filtering method of choice. However, there are also methods for only keeping complete cases (i.e., cases without missing data). The `base R` function for that is `na.omit()` ```r allbus_2021_complete <- na.omit(allbus_2021_labelled) nrow(allbus_2021_complete) ``` ``` ## [1] 0 ``` *NB*: Of course, the number of excluded/included cases depends on how you have defined your missings values before. --- ## Excluding cases with missing values The `tidyverse` equivalent of `na.omit()` is `drop_na()` from the `tidyr` package. You can use this function to remove cases that have missings on any variable in a data set or only on specific variables. ```r allbus_2021_labelled %>% drop_na() %>% nrow() ``` ``` ## [1] 0 ``` ```r allbus_2021_labelled %>% drop_na(pv01) %>% nrow() ``` ``` ## [1] 4026 ``` *NB*: Of course, the number of excluded/included cases depends on how you have defined your missings values before. --- ## Recode `NA` into something else An easy option for replacing `NA` with another value for a single variable is the `replace_na()` function from the `tidyr` package in combination with `mutate()`. ```r allbus_2021 <- allbus_2021 %>% mutate(pa02a = replace_na(pa02a, -99)) ``` **NB**: This particular example does not make much sense (so you should probably not execute this code). You can, however, specify different values for different types of missing values. To do this, you probably need to make the recoding dependent on (values in) other variables. --- class: center, middle # [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_2_2_2_Missing_Values.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_2_2_2_Missing_Values.html) --- ## Conditional variable transformation Sometimes, things are a bit more complicated when it comes to creating new variables. Simple recoding can be insufficient when we need to make the values of a new variable conditional on values of (multiple) other variables. Such cases require conditional transformations. --- ## Simple conditional transformation The simplest version of a conditional variable transformation is using an `ifelse()` statement. ```r allbus_2021 <- allbus_2021 %>% mutate(sex_char = ifelse(sex == 1, "male", "female")) allbus_2021 %>% select(sex, sex_char) %>% sample_n(5) # randomly sample 5 cases from the df ``` ``` ## sex sex_char ## 1 1 male ## 2 2 female ## 3 2 female ## 4 1 male ## 5 2 female ``` .small[ *Note*: A more versatile option for creating dummy variables is the [`fastDummies` package](https://jacobkap.github.io/fastDummies/). ] --- ## Advanced conditional transformation For more flexible (or complex) conditional transformations, the `case_when()` function from `dyplyr` is a powerful tool. ```r allbus_2021 <- allbus_2021 %>% mutate(pol_view_cat = case_when( between(pa01, 0, 3) ~ "left", between(pa01, 4, 7) ~ "center", pa01 > 7 ~ "right" )) allbus_2021 %>% select(pa01, pol_view_cat) %>% sample_n(5) ``` ``` ## pa01 pol_view_cat ## 1 6 center ## 2 5 center ## 3 5 center ## 4 1 left ## 5 5 center ``` --- ## `dplyr::case_when()` A few things to note about `case_when()`: - you can have multiple conditions per value - conditions are evaluated consecutively - when none of the specified conditions are met for an observation, by default, the new variable will have a missing value `NA` for that case - if you want some other value in the new variables when the specified conditions are not met, you need to add `TRUE ~ value` as the last argument of the `case_when()` call - to explore the full range of options for `case_when()` check out its [online documentation](https://dplyr.tidyverse.org/reference/case_when.html) or run `?case_when()` in `R`/*RStudio* --- ## `dplyr::case_when()` <img src="data:image/png;base64,#https://github.com/allisonhorst/stats-illustrations/blob/main/rstats-artwork/dplyr_case_when.png?raw=true" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Applying the same transformation(s) to multiple variables The `dplyr` package provides a handy tool for applying transformations, such as recoding values or specifying missing values across a set of variables: `across()`. --- ## Specify missing values across a selection of variables In the following example, we want to define the same missing values for all variables assessing trust in public institutions. ```r allbus_2021 <- allbus_2021 %>% mutate( across(pt01:pt20, ~set_na( .x, na = c(-42, -11, -9)))) ``` --- ## Recode values `across()` defined variables We can also use `across()` to recode multiple variables. Here, we want to recode the items measuring trust so that they reflect distrust instead. In this case, we probably want to create new variables. We can do so by using the `.names` argument of the `across()` function (for details, check the help file for the function). ```r allbus_2021 <- allbus_2021 %>% mutate( across( pt01:pt20, ~recode( .x, `7` = 1, # `old value` = new value `6` = 2, `5` = 3, `3` = 5, `2` = 6, `1` = 7 ), .names = "{.col}_R")) ``` --- ## Other options for using `across()` The `across()` function allows you to do (and can facilitate) quite a few things when it comes to variable transformation and creation. For example, it can be used with logical conditions (such as `is.numeric()`) or the `dplyr` selection helpers we encountered in the previous session (such as `starts_with()`) to apply transformations to variables of a specific type or meeting some other criteria (as well as all variables in a data set). To explore more options, you can check the [documentation for the `across()` function](https://dplyr.tidyverse.org/reference/across.html). --- ## `dplyr::across()` <img src="data:image/png;base64,#https://github.com/allisonhorst/stats-illustrations/blob/main/rstats-artwork/dplyr_across.png?raw=true" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Aggregate variables Something we might want to do as part of our data wrangling is to create aggregate variables, such as sum or mean scores based on a set of items. What is important to keep in mind here is that `dplyr` operations are applied per column. This is a common sources of confusion and errors as what we want to do in the case of creating aggregate variables requires transformations to be applied per row (respondent). --- ## Aggregate variables The most common type of aggregate variables are sum and mean scores.<sup>1</sup> An easy way to create those is combining the `base R` functions `rowSums()` and `rowMeans()` with `across()` from `dplyr`. .small[ .footnote[ [1] Of course, `R` offers many other options for dimension reduction, such as PCA, factor analyis, etc. However, we won't cover those in this course. ] ] --- ## Mean score In this example, we create a mean score for trust in various institutions. ```r allbus_2021 <- allbus_2021 %>% mutate(mean_trust = rowMeans(across( pt01:pt20), na.rm = TRUE)) ``` --- ## Sum score In the same manner, we can also create sum scores. Let's say, we want do this for the questions assessing in which contexts respondents have contact with immigrants. In this case, as we have not done so before, we first need to specify the right values as missing. For the sum score to be easier to interpret, we also want to recode the values, so that 0 means no and 1 means yes. ```r allbus_2021 <- allbus_2021 %>% mutate( across(mc01:mc04, ~set_na( .x, na = c(-42, -11, -10, -9)))) %>% mutate(across(mc01:mc04, ~recode( .x, `2` = 0 ))) ``` --- ## Sum score Now we can compute the sum score. ```r allbus_2021 <- allbus_2021 %>% mutate(sum_contact = rowSums(across( mc01:mc04))) ``` --- ## More options for aggregate variables If you want to use other functions than just `mean()` or `sum()` for creating aggregate variables, you need to use the [`rowwise()` function from `dplyr`](https://dplyr.tidyverse.org/articles/rowwise.html) in combination with [`c_across()`](https://dplyr.tidyverse.org/reference/c_across.html) which is a special variant of the `dplyr` function `across()` for row-wise operations/aggregations. --- class: center, middle # [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_2_2_3_Across_Aggregate.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_2_2_3_Across_Aggregate.html) --- ## Outlook: Other variable types In the examples in this session, we almost exclusively worked with numeric variables. There are, however, other variable types that occur frequently in data sets in the social sciences: - factors - strings - time and dates Working with strings in `R` is a topic that would require its own workshop, and the same is essentially true for time and dates. Hence, we will only briefly discuss the basics of factors in this session (also because we will meet them again in the following session). --- ## Factors Factor are a special type of variable in `R` that represent categorical data. Before `R` version `4.0.0.` the default for `base R` was that all characters variables are imported as factors. Internally, factors are stored as integers, but they have (character) labels (so-called *levels*) associated with them. Hence, if you are not working with the special class of labelled data (e.g., via the packages [`haven`](https://haven.tidyverse.org/), [`labelled`](https://larmarange.github.io/labelled/index.html), or [`sjlabelled`](https://strengejacke.github.io/sjlabelled/index.html)), factors come closest to having variables with value labels as you might know from *SPSS*. Notably, as factors are a native data type to `R`, they do not cause the issues that labelled variables often do (as labels represent an additional attribute, making them a special class that many functions cannot work with). --- ## Factors Factors in `R` can be **unordered** - in which case they are similar to **nominal** level variables in *SPSS* - or **ordered** - in which case they are similar to **ordinal** level variables in *SPSS*. Using factors can be necessary for certain statistical analysis and plots (e.g., if you want to compare groups). Working with factors in `R` is a big topic, and we will only briefly touch upon it in this workshop. For a more in-depth discussion of factors in `R` you can, e.g., have a look at the [chapter on factors](https://r4ds.had.co.nz/factors.html) in *R for Data Science*. --- ## Factors 4 🐱s There are many functions for working with factors in `base R`, such as `factor()` or `as.factor()`. However, a generally more versatile and easier-to-use option is the [`forcats` package](https://forcats.tidyverse.org/) from the `tidyverse`. <img src="data:image/png;base64,#https://github.com/rstudio/hex-stickers/blob/master/PNG/forcats.png?raw=true" width="25%" style="display: block; margin: auto;" /> *Note*: There is a good [introduction to working with factors using `forcats` by Vebash Naidoo](https://sciencificity-blog.netlify.app/posts/2021-01-30-control-your-factors-with-forcats/) and *RStudio* also offers a [`forcats` cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/factors.pdf). --- ## Unordered factor Previously, we have recoded the numeric values from the `sex` variable to character values. We could also create an unordered factor based on those values. Using the `recode_factor()` function (together with `mutate()`) from `dplyr`, we can create a factor from a numeric (or a character) variable. ```r allbus_2021 <- allbus_2021 %>% mutate(sex_fac = recode_factor(sex, `1` = "male", `2` = "female", `3` = "non-binary")) allbus_2021 %>% select(sex, sex_fac) %>% sample_n(5) ``` ``` ## sex sex_fac ## 1 2 female ## 2 2 female ## 3 2 female ## 4 1 male ## 5 1 male ``` --- ## Ordered factor For creating factors, the `case_when()` function is a very helpful tool. If, e.g., we want to create age categories as an ordered factor based on the numeric age variable, we could do so as follows ```r allbus_2021 <- allbus_2021 %>% mutate(age_cat = case_when( age < 30 ~ "18 to 29", age < 50 ~ "30 to 49", age < 70 ~ "50 to 69", age > 69 ~ "70 and older" )) %>% mutate(age_cat = factor(age_cat, levels = c("18 to 29", "30 to 49", "50 to 69", "70 and older"), ordered = TRUE)) ``` --- class: center, middle # [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_2_2_4_Factors_Conditional_Recode.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_2_2_4_Factors_Conditional_Recode.html) --- ## Outlook: Working with strings in `R` As stated before, we won't be able to cover the specifics of working with strings in `R` in this course. However, it may be good to know that the `tidyverse` package [`stringr`](https://stringr.tidyverse.org/index.html) offers a collection of convenient functions for working with strings. <img src="data:image/png;base64,#https://github.com/rstudio/hex-stickers/blob/master/PNG/stringr.png?raw=true" width="25%" style="display: block; margin: auto;" /> The `stringr` package provides a good [introduction vignette](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html), the book *R for Data Science* has a whole section on [strings with `stringr`](https://r4ds.had.co.nz/strings.html), and there also is an [*RStudio* Cheat Sheet for `stringr`](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf). --- ## Sidenote: Regular expressions If you want (or have) to work with [regular expressions](https://en.wikipedia.org/wiki/Regular_expression), there are several packages that can facilitate this process by allowing you to create regular expressions in in a (more) human-readable: e.g., [`rex`](https://github.com/r-lib/rex), [`RVerbalExpressions `](https://rverbalexpressions.netlify.app/index.html), or [`rebus` package](https://github.com/richierocks/rebus) which allows you to create regular expressions in R in a human-readable way. Another helpful tool is the *RStudio* addin [`RegExplain`](https://www.garrickadenbuie.com/project/regexplain/). --- ## Outlook: Times and dates [Working with times and dates can be quite a pain in programming](https://www.youtube.com/watch?v=-5wpm-gesOY) (as well as data analysis). Luckily, there are a couple of neat options for working with times and dates in `R` that can reduce the headache. --- ## Outlook: Times and dates <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2022\content\img\excel-time.jpg" width="50%" style="display: block; margin: auto;" /> .small[ Source: https://twitter.com/ExcelHumor/status/1558608440230117384 ] --- ## Outlook: Times and dates If you want/need to work with times and dates in `R`, you may want to look into the [`lubridate` package](https://lubridate.tidyverse.org/) which is part of the `tidyverse`, and for which *RStudio* also provides a [cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/lubridate.pdf). <img src="data:image/png;base64,#https://github.com/rstudio/hex-stickers/blob/master/PNG/lubridate.png?raw=true" width="25%" style="display: block; margin: auto;" /> *Note*: If you work with time series data, it is also worth checking out the [`tsibble` package](https://tsibble.tidyverts.org/) for your wrangling tasks. --- ## Extracurricular activities Check out the [appendix slides for today](https://stefanjuenger.github.io/r-intro-gesis-2022/slides/2_3_Appendix_Relational_Data.html) which cover the topic of relational data (i.e., combining multiple data sets). Have a look at the [*Tidy Tuesday* repository on *GitHub*](https://github.com/rfordatascience/tidytuesday), listen to a few of the very short episodes of the [*Tidy Tuesday* Podcast](https://www.tidytuesday.com/), check out the [#tidytuesday Twitter hashtag](https://twitter.com/hashtag/tidytuesday?lang=en), or watch one (or more) of the [*Tidy Tuesday* screencasts on *YouTube* by David Robinson](https://www.youtube.com/watch?v=E2amEz_upzU&list=PL19ev-r1GBwkuyiwnxoHTRC8TTqP8OEi8).