Introduction to R for Data Analysis

class: center, middle, inverse, title-slide

.title[
# Introduction to R for Data Analysis
]
.subtitle[
## Appendix: Labelled Data
]
.author[
### Johannes Breuer, Stefan Jünger, & Veronika Batzdorfer
]
.date[
### 2022-08-15
]

---

layout: true

---

## Beyond flat files: labelled data

A lot of data comes in some sort of flat file format, such as `CSV`. In the social sciences, however, we often deal with proprietary file formats, such as *SPSS*'s `.sav` or *Stata*'s `.dta` files.

What these data typically include are labels. These labels are used to describe variables or variable values. They comprise some specific metadata inherent in these proprietary file formats.

*If you were able to travel back ten years in time and ask an `R` geek, she'd say that you cannot use labels in R. You'd either have to import, e.g., value labels as character strings or use their codes as factors. However, these days...*

---

## Not being able to use labelled data is a thing of the past

Nowadays, if you use the `haven` package, labels are built-in for the corresponding file types. For example:

```r
allbus_2021 <-
  haven::read_sav("./data/allbus_2021/ZA5280_v1-0-0.sav")

allbus_2021["agec"]
```

```
## # A tibble: 5,342 × 1
##               agec
##          <dbl+lbl>
##  1 3 [45-59 JAHRE]
##  2 3 [45-59 JAHRE]
##  3 5 [75-89 JAHRE]
##  4 5 [75-89 JAHRE]
##  5 4 [60-74 JAHRE]
##  6 1 [18-29 JAHRE]
##  7 2 [30-44 JAHRE]
##  8 3 [45-59 JAHRE]
##  9 4 [60-74 JAHRE]
## 10 3 [45-59 JAHRE]
## # … with 5,332 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---

## Advantages of using labelled data

One could rejoice in not having to use a codebook anymore, just like in *SPSS* (although just looking at code output for glimpsing feels much more... data-geeky).

An advantage is definitely that you can potentially re-use the labels in figures and plots, and some `R` packages do that automatically, such as the [`sjPlot`](https://strengejacke.github.io/sjPlot/) package.

In addition, when you exchange your data with colleagues who do not use `R` or when you plan to publish your data (which you always should if that is possible), being able to export data you have manipulated in `R` in different formats is great.

**However, be aware of the missing values hell that you may enter due to different missing value definitions in *Stata* and *SPSS*.**

---

## Getting labels

For variables:

```r
sjlabelled::get_label(allbus_2021$agec)
```

```
## [1] "ALTER: BEFRAGTE(R), KATEGORISIERT"
```

For values:

.tinyish[

```r
sjlabelled::get_labels(allbus_2021$agec)
```

```
## [1] "NICHT GENERIERBAR" "18-29 JAHRE"       "30-44 JAHRE"       "45-59 JAHRE"      
## [5] "60-74 JAHRE"       "75-89 JAHRE"       "UEBER 89 JAHRE"
```
]

---

## Setting labels: Variables

```r
allbus_2021$agec <- 
  sjlabelled::set_label(allbus_2021$agec, label = "Age, categorized")

sjlabelled::get_label(allbus_2021$agec)
```

```
## [1] "Age, categorized"
```

---

## Setting labels: Values
.tinyish[

```r
allbus_2021$agec <- 
  sjlabelled::set_labels(
    allbus_2021$agec,
    labels = 
      c(
        "18-29 years", "30-44 years", "45-59 years", "60-74 years", 
        "75-89 years", "Over 89 years"
      )
  )

sjlabelled::get_labels(allbus_2021$agec)
```

```
## [1] "18-29 years"   "30-44 years"   "45-59 years"   "60-74 years"   "75-89 years"  
## [6] "Over 89 years"
```
]