Data Formats

GESIS Workshop: Introduction to Geospatial Techniques for Social Scientists in R

Stefan Jünger, Anne-Kathrin Stroppe, Dennis Abel

2026-04-23

Day Time Title
April 09 10:00-11:30 Introduction
April 09 11:30-11:45 Coffee Break
April 09 11:45-13:00 Data Formats
April 09 13:00-14:00 Lunch Break
April 09 14:00-15:30 Mapping
April 09 15:30-15:45 Coffee Break
April 09 15:45-17:00 Spatial Wrangling
April 10 09:00-10:30 Spatial Wrangling
April 10 10:30-10:45 Coffee Break
April 10 10:45-12:00 Applied Spatial Linking
April 10 12:00-13:00 Lunch Break
April 10 13:00-14:30 Spatial Analysis
April 10 14:30-14:45 Coffee Break
April 10 14:45-16:00 Spatial Econometrics & Outlook

Why care about data types and formats?

There are differences in how spatial information is stored, processed, and visually represented.

  • Different commands for data import and manipulation
  • Spatial linking techniques and analyses partly determined by data format
  • Visualization of data can vary

Vector and raster data

Sources: OpenStreetMap / GEOFABRIK (2018), City of Cologne (2014), and the Statistical Offices of the Federation and the Länder (2016) / Jünger, 2019

1. Vector data

Representing the world in vectors

The surface of the earth is represented by simple geometries and attributes.

Each object is defined by longitude (x) and latitude (y) values.

It could also include z coordinates…

Vector data: geometries

Every real-world feature is one of three types of geometry:

  • Points: discrete location (e.g., a tree)
  • Lines: linear feature (e.g., a river)
  • Polygons: enclosed areas (e.g., a city, country, administrative boundaries)

National Ecological Observatory Network (NEON), cited by Datacarpentry

Vector data: attribute tables

If we have geometries, we do not necessarily have any other information.

We must assign attributes to each geometry to hold additional information \(\rightarrow\) data tables called attribute tables.

  • Each row represents a geometric object, which we can also call observation, feature, or case
  • Each column holds an attribute or, in ‘our’ language, a variable

Vector data: attribute tables

File formats/extensions

  • GeoPackage .gpkg
  • Shapefile .shp
  • GeoJSON .geojson
  • Sometimes, vector data come even in a text format, such as CSV

Welcome to the proprietary reality: shapefiles

Both the geometric information and attribute table could be saved within one file. Rather often, ESRI Shapefiles are used to store vector data. Shapefiles consist of at least three mandatory files with the following extensions:

  • .shp: shape format
  • .shx: shape index format
  • .dbf: attribute format
  • (.prj: CRS/Projection)

You don’t have to remember what they stand for, but you can only properly work with the data if none of those files is missing.

Welcome to simple features

Several packages are out there to wrangle and visualize spatial and, especially, vector data within R. We will use the sf package (“simple features”).


Why?

simple features refers to a formal standard representing spatial geometries and supports interfaces to other programming languages and GIS systems (ISO 19125-1).

Illustration by Allison Horst

Load a shapefile

The first step is, of course, loading the data. We want to import the shapefile for the administrative borders of the German states (Bundesländer) called VG250_LAN.shp in the data folder.

# load library
library(sf)

# load data
german_states <- sf::read_sf("./data/VG250_LAN.shp")

# let's have a look
german_states

Load a shapefile

# load library
library(sf)

# load data
german_states <- sf::read_sf("./data/VG250_LAN.shp")

# let's have a look
german_states
Simple feature collection with 35 features and 23 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 280371.1 ymin: 5235856 xmax: 921292.4 ymax: 6106244
Projected CRS: ETRS89 / UTM zone 32N
# A tibble: 35 × 24
     ADE    GF   BSG ARS   AGS   SDV_ARS   GEN   BEZ     IBZ BEM   NBD   SN_L  SN_R  SN_K  SN_V1 SN_V2 SN_G  FK_S3 NUTS 
   <int> <int> <int> <chr> <chr> <chr>     <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1     2     4     1 01    01    01002000… Schl… Land     20 --    ja    01    0     00    00    00    000   0     DEF  
 2     2     4     1 02    02    02000000… Hamb… Frei…    22 --    ja    02    0     00    00    00    000   0     DE6  
 3     2     4     1 03    03    03241000… Nied… Land     20 --    ja    03    0     00    00    00    000   0     DE9  
 4     2     4     1 04    04    04011000… Brem… Frei…    23 --    ja    04    0     00    00    00    000   0     DE5  
 5     2     4     1 05    05    05111000… Nord… Land     20 --    ja    05    0     00    00    00    000   0     DEA  
 6     2     4     1 06    06    06414000… Hess… Land     20 --    ja    06    0     00    00    00    000   0     DE7  
 7     2     4     1 07    07    07315000… Rhei… Land     20 --    ja    07    0     00    00    00    000   0     DEB  
 8     2     4     1 08    08    08111000… Bade… Land     20 --    ja    08    0     00    00    00    000   0     DE1  
 9     2     4     1 09    09    09162000… Baye… Frei…    21 --    ja    09    0     00    00    00    000   0     DE2  
10     2     4     1 10    10    10041010… Saar… Land     20 --    nein  10    0     00    00    00    000   0     DEC  
# ℹ 25 more rows
# ℹ 5 more variables: ARS_0 <chr>, AGS_0 <chr>, WSK <date>, DEBKG_ID <chr>, geometry <MULTIPOLYGON [m]>

We can already plot it

plot(german_states["GEN"])

Inspect your data: classics

There are no huge differences between the shapefile we just imported and a regular data table.

# head of data table
head(german_states, 2)
Simple feature collection with 2 features and 23 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 426205.6 ymin: 5913462 xmax: 650128.7 ymax: 6101487
Projected CRS: ETRS89 / UTM zone 32N
# A tibble: 2 × 24
    ADE    GF   BSG ARS   AGS   SDV_ARS    GEN   BEZ     IBZ BEM   NBD   SN_L  SN_R  SN_K  SN_V1 SN_V2 SN_G  FK_S3 NUTS 
  <int> <int> <int> <chr> <chr> <chr>      <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1     2     4     1 01    01    010020000… Schl… Land     20 --    ja    01    0     00    00    00    000   0     DEF  
2     2     4     1 02    02    020000000… Hamb… Frei…    22 --    ja    02    0     00    00    00    000   0     DE6  
# ℹ 5 more variables: ARS_0 <chr>, AGS_0 <chr>, WSK <date>, DEBKG_ID <chr>, geometry <MULTIPOLYGON [m]>

Inspect your data: spatial features

Besides our general data inspection, we may also want to check the spatial features of our import. This check includes the geometric type (points? lines? polygons?) and the coordinate reference system.

# type of geometry
sf::st_geometry(german_states) 
Geometry set for 35 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 280371.1 ymin: 5235856 xmax: 921292.4 ymax: 6106244
Projected CRS: ETRS89 / UTM zone 32N
First 5 geometries:

Inspect your data: spatial features

Each polygon is defined by several connected points to build an enclosed area. Several polygons in one data frame have the sf type multipolygons. Just as Germany consists of several states, the polygon Germany consists of several smaller polygons. That’s true even for individual German states. Here’s an example of North Rhine-Westphalia:

# extract the simple features column and further inspecting it
german_states$geometry[5] |> 
  dplyr::glimpse()
sfc_MULTIPOLYGON of length 1; first list element: List of 5
 $ :List of 1
  ..$ : num [1:8804, 1:2] 477607 477708 477759 477990 478075 ...
 $ :List of 1
  ..$ : num [1:117, 1:2] 304448 304424 304395 304295 304262 ...
 $ :List of 1
  ..$ : num [1:154, 1:2] 300795 300803 300828 300861 300894 ...
 $ :List of 1
  ..$ : num [1:32, 1:2] 301597 301660 301711 301732 301725 ...
 $ :List of 1
  ..$ : num [1:28, 1:2] 301890 301902 301642 301515 300972 ...
 - attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"

Inspect your data: spatial features

Remember: The Coordinate Reference System is critical. A crucial step is to check the CRS of your geospatial data.

# coordinate reference system
sf::st_crs(german_states) 
Coordinate Reference System:
  User input: ETRS89 / UTM zone 32N 
  wkt:
PROJCRS["ETRS89 / UTM zone 32N",
    BASEGEOGCRS["ETRS89",
        ENSEMBLE["European Terrestrial Reference System 1989 ensemble",
            MEMBER["European Terrestrial Reference Frame 1989"],
            MEMBER["European Terrestrial Reference Frame 1990"],
            MEMBER["European Terrestrial Reference Frame 1991"],
            MEMBER["European Terrestrial Reference Frame 1992"],
            MEMBER["European Terrestrial Reference Frame 1993"],
            MEMBER["European Terrestrial Reference Frame 1994"],
            MEMBER["European Terrestrial Reference Frame 1996"],
            MEMBER["European Terrestrial Reference Frame 1997"],
            MEMBER["European Terrestrial Reference Frame 2000"],
            MEMBER["European Terrestrial Reference Frame 2005"],
            MEMBER["European Terrestrial Reference Frame 2014"],
            MEMBER["European Terrestrial Reference Frame 2020"],
            ELLIPSOID["GRS 1980",6378137,298.257222101,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[0.1]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4258]],
    CONVERSION["UTM zone 32N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",9,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Europe between 6°E and 12°E: Austria; Denmark - onshore and offshore; Germany - onshore and offshore; Italy - onshore and offshore; Norway including Svalbard - onshore and offshore; Spain - offshore."],
        BBOX[36.53,6,84.01,12.01]],
    USAGE[
        SCOPE["Pan-European conformal mapping at scales larger than 1:500,000."],
        AREA["Europe between 6°E and 12°E and approximately 36°30'N to 84°N."],
        BBOX[36.53,6,84.01,12.01]],
    ID["EPSG",25832]]

Exercise 2_1: Import Vector Data💪

🖱Click here for the exercise

2. Raster data

Difference to vector data

  • Other data format(s), different file extensions
  • Geometries do not differ within one dataset
  • Other geospatial operations possible
  • Can be way more efficient and straightforward to process
  • It’s like working with simple tabular data

Von Yug, modifications by Cfaerber et al. - Eigenes Werk, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=1183592

Visual difference between vector and raster data in geospatial data

What exactly are raster data?

  • Hold information on (most of the time) evenly shaped grid cells
  • Basically, a simple data table
  • Each cell represents one observation

Metadata

  • Information about geometries is globally stored
  • They are the same for all observations
  • Their location in space is defined by their cell location in the data table
  • Without this information, raster data were simple image files

https://knowyourmeme.com/memes/theyre-the-same-picture

Important metadata

Raster Dimensions \(\rightarrow\) number of columns, rows, and cells

Extent \(\rightarrow\) similar to bounding box in vector data

Resolution \(\rightarrow\) the size of each raster cell

Coordinate reference system \(\rightarrow\) defines where on the earth’s surface the raster layer lies

Setting up a raster dataset is easy

# Defining a simple 4x4 matrix with some sampled numbers
input_data <- 
  sample(1:100, 16) |> 
  matrix(nrow = 4)

input_data
     [,1] [,2] [,3] [,4]
[1,]   53   23   62   21
[2,]   44   58   82    7
[3,]   59   55   27   28
[4,]   88   20    2   46

Setting up a raster dataset is easy

# Converting our 4x4 matrix into a terra raster object
raster_layer <- terra::rast(input_data)

raster_layer
class       : SpatRaster 
size        : 4, 4, 1  (nrow, ncol, nlyr)
resolution  : 1, 1  (x, y)
extent      : 0, 4, 0, 4  (xmin, xmax, ymin, ymax)
coord. ref. :  
source(s)   : memory
name        : lyr.1 
min value   :     2 
max value   :    88 

We can already plot it

# Plotting the terra raster object
terra::plot(raster_layer)

File formats/extensions

  • GeoTIFF tif
  • Gridded data .grd
  • Network common data format .nc
  • Esri grid .asc
  • Sometimes, raster data come even in a text format, such as CSV

In this course, we will only use tif files as they are pretty common. Just be aware that there are different formats out there.

Implementations in R

terra is the most commonly used package for raster data in R.

Some other developments, e.g., in the stars package, also implement an interface to simple features in sf.

The terra package helps to use more elaborate zonal statistics. But the spatstat package is more advanced.

Loading raster tifs (German census data)

# Number of immigrants in Cologne
immigrants_cologne <- terra::rast("./data/immigrants_cologne.tif")

# Number of inhabitants in Cologne
inhabitants_cologne <- terra::rast("./data/inhabitants_cologne.tif")

Loading raster tifs (German census data)

# Number of immigrants in Cologne
immigrants_cologne <- terra::rast("./data/immigrants_cologne.tif")

# Number of inhabitants in Cologne
inhabitants_cologne <- terra::rast("./data/inhabitants_cologne.tif")

immigrants_cologne
class       : SpatRaster 
size        : 290, 266, 1  (nrow, ncol, nlyr)
resolution  : 100, 100  (x, y)
extent      : 4094800, 4121400, 3083900, 3112900  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (EPSG:3035) 
source      : immigrants_cologne.tif 
name        : cat_0 
min value   :    -9 
max value   :   516 

Loading raster tifs (German census data)

# Number of immigrants in Cologne
immigrants_cologne <- terra::rast("./data/immigrants_cologne.tif")

# Number of inhabitants in Cologne
inhabitants_cologne <- terra::rast("./data/inhabitants_cologne.tif")

immigrants_cologne 
 
inhabitants_cologne  
class       : SpatRaster 
size        : 290, 266, 1  (nrow, ncol, nlyr)
resolution  : 100, 100  (x, y)
extent      : 4094800, 4121400, 3083900, 3112900  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (EPSG:3035) 
source      : inhabitants_cologne.tif 
name        : cat_0 
min value   :    -9 
max value   :   678 

Compare layers by plotting

terra::plot(immigrants_cologne)

terra::plot(inhabitants_cologne)

Base R functionalities

# Define missing values
immigrants_cologne[immigrants_cologne == -9] <- NA
inhabitants_cologne[inhabitants_cologne == -9] <- NA

Base R functionalities

# Define missing values
immigrants_cologne[immigrants_cologne == -9] <- NA
inhabitants_cologne[inhabitants_cologne == -9] <- NA
 
immigrants_cologne
class       : SpatRaster 
size        : 290, 266, 1  (nrow, ncol, nlyr)
resolution  : 100, 100  (x, y)
extent      : 4094800, 4121400, 3083900, 3112900  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (EPSG:3035) 
source(s)   : memory
varname     : immigrants_cologne 
name        : cat_0 
min value   :     0 
max value   :   516 

Base R functionalities

# Define missing values
immigrants_cologne[immigrants_cologne == -9] <- NA
inhabitants_cologne[inhabitants_cologne == -9] <- NA

immigrants_cologne

inhabitants_cologne 
class       : SpatRaster 
size        : 290, 266, 1  (nrow, ncol, nlyr)
resolution  : 100, 100  (x, y)
extent      : 4094800, 4121400, 3083900, 3112900  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (EPSG:3035) 
source(s)   : memory
varname     : inhabitants_cologne 
name        : cat_0 
min value   :     3 
max value   :   678 

Simple statistics

Working with raster data is straightforward

  • quite speedy
  • yet not as comfortable as working with sf objects

For example, to calculate the mean, we would use the following:

terra::global(immigrants_cologne, fun = "mean", na.rm = TRUE)
          mean
cat_0 25.85978

Note: Using mean() on the whole raster dataset would calculate the mean in each raster cell (in case there are multiple attributes in the dataset).

Get all values as a vector

We can also extract the values of a raster directly as a vector:

all_raster_values <- terra::values(immigrants_cologne)

mean(all_raster_values, na.rm = TRUE)
[1] 25.85978

Nevertheless, although raster data are simple data tables, working with them is a bit different compared to, e.g., simple features.

Combining raster layers to calculate new values

immigrant_rate <-
  immigrants_cologne * 100 / 
  inhabitants_cologne

immigrant_rate
class       : SpatRaster 
size        : 290, 266, 1  (nrow, ncol, nlyr)
resolution  : 100, 100  (x, y)
extent      : 4094800, 4121400, 3083900, 3112900  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (EPSG:3035) 
source(s)   : memory
varname     : immigrants_cologne 
name        :    cat_0 
min value   :  0.00000 
max value   : 97.72727 
terra::plot(immigrant_rate)

Exercise 2_2: Basic Raster Operations💪

🖱 Click here for the exercise