In case you need it, here’s the data loading routine again:
library(dplyr)
library(haven)
allbus_2021_dvI <-
read_sav(
"./data/allbus_2021ZA5280_v1-0-0.sav"
) %>%
sjlabelled::set_na(na = c(-1:-99, 97:98))
As we have discussed in the session on Exploratory Data Analysis, data exploration is not only about creating numbers and summary statistics. Sometimes a good plot can reveal more insights than a whole data frame filled with numbers (especially to the human eye). In this exercise, we use what we’ve just learned about plots with ggplot2
. We will also repeat some of the content from the sessions on data wrangling in the following exercises (as this is typically part of a pipeline for data visualization).
This time we are going to use the Gapminder data on GDP per capita again. Hence, we we need to first load the Gapminder GDP data from the CSV
file and convert it to long format.
library(dplyr)
library(tidyr)
gapminder_ggplot_input <-
readr::read_csv("./data/gdppercapita_us_inflation_adjusted.csv") %>%
pivot_longer(-country, names_to = "year", values_to = "GDP") %>%
filter(!is.na(GDP)) %>%
arrange(year, GDP) %>%
group_by(year) %>%
summarise(GDP_over_all_countries = mean(GDP)) %>%
ungroup()
## Rows: 191 Columns: 60
## -- Column specification ------------------------------------------------------------------------------------------
## Delimiter: ","
## chr (1): country
## dbl (59): 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Our aim is to analyze how the GDP has developed over time. The nice thing about plots is that we can use the whole range of years and still identify differences between various periods. Our plot of choice for this is a line plot to visualize the data as a time series.
ggplot2
, plot the Gapminder GDP per capita data as a line plot to display a time series. One important note here: In the aesthetics definition for this plot you should define a grouping variable group = 1
. Otherwise, ggplot
assumes that you want to plot one line for each year.
geom_line
.
ggplot(
data = gapminder_ggplot_input,
aes(x = year, y = GDP_over_all_countries, group = 1)
) +
geom_line()
Admittedly, this may not be the best approach to identify differences between the periods directly. We don’t know when our periods start and when they end. Luckily, this can be done in two relatively straightforward steps. Let’s start with the first one: using different colors for different periods. For this purpose, we need an indicator variable as a grouping variable to use different colors for the line at each period.
mutate()
and case_when()
lets you create the new variables we need. To get some sensible legend labels later, you should specify the indicator variables as strings.
gapminder_ggplot_input <-
gapminder_ggplot_input %>%
mutate(
period =
case_when(
year >= 1960 & year <= 1969 ~ "1960-1969",
year >= 1970 & year <= 2001 ~ "1970-2001",
year >= 2002 & year <= 2018 ~ "2002-2018"
)
)
After we’re set up with our indicator variable, it’s plotting time again. We can simply reuse our code from before and define a grouping color in the aesthetics definition.
aes()
, you can choose the option color = indicator_variable
to define the grouping.
ggplot(
data = gapminder_ggplot_input,
aes(
x = year,
y = GDP_over_all_countries,
color = period,
group = 1
)
) +
geom_line()
Now we can see some visual differences between the different periods. One last thing, however, is that there are way too many labels on the x-axis. Maybe a more sensible axis labeling approach would be to create axis breaks for ten-year-steps. NB: The next one is an advanced exercise as we did not talk about manipulating axes before. If you’re not feeling adventurous you can just skip this one.
scale_x_discrete()
and its breaks with the option breaks = breaks_vector
. You can check the help file (?scale_discrete
) for some more information. A helpful additional function to use here is seq()
from base R
.
ggplot(
data = gapminder_ggplot_input,
aes(
x = year,
y = GDP_over_all_countries,
color = period,
group = 1
)
) +
geom_line() +
scale_x_discrete(
breaks = seq(
from = 1960,
to = 2011,
by = 10
)
)