class: center, middle, inverse, title-slide .title[ # Introduction to R for Data Analysis ] .subtitle[ ## Appendix: Labelled Data ] .author[ ### Johannes Breuer, Stefan Jünger, & Veronika Batzdorfer ] .date[ ### 2022-08-15 ] --- layout: true --- ## Beyond flat files: labelled data A lot of data comes in some sort of flat file format, such as `CSV`. In the social sciences, however, we often deal with proprietary file formats, such as *SPSS*'s `.sav` or *Stata*'s `.dta` files. What these data typically include are labels. These labels are used to describe variables or variable values. They comprise some specific metadata inherent in these proprietary file formats. *If you were able to travel back ten years in time and ask an `R` geek, she'd say that you cannot use labels in R. You'd either have to import, e.g., value labels as character strings or use their codes as factors. However, these days...* --- ## Not being able to use labelled data is a thing of the past Nowadays, if you use the `haven` package, labels are built-in for the corresponding file types. For example: ```r allbus_2021 <- haven::read_sav("./data/allbus_2021/ZA5280_v1-0-0.sav") allbus_2021["agec"] ``` ``` ## # A tibble: 5,342 × 1 ## agec ## <dbl+lbl> ## 1 3 [45-59 JAHRE] ## 2 3 [45-59 JAHRE] ## 3 5 [75-89 JAHRE] ## 4 5 [75-89 JAHRE] ## 5 4 [60-74 JAHRE] ## 6 1 [18-29 JAHRE] ## 7 2 [30-44 JAHRE] ## 8 3 [45-59 JAHRE] ## 9 4 [60-74 JAHRE] ## 10 3 [45-59 JAHRE] ## # … with 5,332 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- ## Advantages of using labelled data One could rejoice in not having to use a codebook anymore, just like in *SPSS* (although just looking at code output for glimpsing feels much more... data-geeky). An advantage is definitely that you can potentially re-use the labels in figures and plots, and some `R` packages do that automatically, such as the [`sjPlot`](https://strengejacke.github.io/sjPlot/) package. In addition, when you exchange your data with colleagues who do not use `R` or when you plan to publish your data (which you always should if that is possible), being able to export data you have manipulated in `R` in different formats is great. **However, be aware of the missing values hell that you may enter due to different missing value definitions in *Stata* and *SPSS*.** --- ## Getting labels For variables: ```r sjlabelled::get_label(allbus_2021$agec) ``` ``` ## [1] "ALTER: BEFRAGTE(R), KATEGORISIERT" ``` For values: .tinyish[ ```r sjlabelled::get_labels(allbus_2021$agec) ``` ``` ## [1] "NICHT GENERIERBAR" "18-29 JAHRE" "30-44 JAHRE" "45-59 JAHRE" ## [5] "60-74 JAHRE" "75-89 JAHRE" "UEBER 89 JAHRE" ``` ] --- ## Setting labels: Variables ```r allbus_2021$agec <- sjlabelled::set_label(allbus_2021$agec, label = "Age, categorized") sjlabelled::get_label(allbus_2021$agec) ``` ``` ## [1] "Age, categorized" ``` --- ## Setting labels: Values .tinyish[ ```r allbus_2021$agec <- sjlabelled::set_labels( allbus_2021$agec, labels = c( "18-29 years", "30-44 years", "45-59 years", "60-74 years", "75-89 years", "Over 89 years" ) ) sjlabelled::get_labels(allbus_2021$agec) ``` ``` ## [1] "18-29 years" "30-44 years" "45-59 years" "60-74 years" "75-89 years" ## [6] "Over 89 years" ``` ]