Applied Spatial Linking

GESIS Workshop: Introduction to Geospatial Techniques for Social Scientists in R

Stefan Jünger & Dennis Abel

2025-04-10

Now

Day Time Title
April 09 10:00-11:30 Introduction
April 09 11:30-11:45 Coffee Break
April 09 11:45-13:00 Data Formats
April 09 13:00-14:00 Lunch Break
April 09 14:00-15:30 Mapping
April 09 15:30-15:45 Coffee Break
April 09 15:45-17:00 Spatial Wrangling
April 10 09:00-10:30 Spatial Wrangling
April 10 10:30-10:45 Coffee Break
April 10 10:45-12:00 Applied Spatial Linking
April 10 12:00-13:00 Lunch Break
April 10 13:00-14:30 Spatial Analysis
April 10 14:30-14:45 Coffee Break
April 10 14:45-16:00 Spatial Econometrics & Outlook

What are georeferenced data?

Data with a direct spatial reference \(\rightarrow\) geo-coordinates

  • Information about geometries
  • Optional: Content in relation to the geometries

Sources: OpenStreetMap / GEOFABRIK (2018), City of Cologne (2014), and the Statistical Offices of the Federation and the Länder (2016) / Jünger, 2019

Georeferenced survey data

Survey data enriched with geo-coordinates (or other direct spatial references).

With georeferenced survey data, we can analyze interactions between individual behaviors and attitudes and the environment.

An example workflow

From the addresses to analyses with georeferenced survey data, several steps and challenges along the way. We will talk about:

  • Data Protection & Data Access
  • Geocoding
  • Spatial Data Linking
  • An example workflow using the sora package

Data protection

That‘s one of the biggest issues.

  • Explicit spatial references increase the risk of re-identifying anonymized survey respondents
  • Can occur during the processing of data but also during the analysis

Affects all phases of research and data management!

Data availability

Geospatial Data

  • Often de-centralized distributed
  • Fragmented data landscape, at least in Germany

Georeferenced Survey Data

  • Primarily, survey data
  • Depends on documentation
  • Access difficult due to data protection restrictions

https://www.eea.europa.eu/data-and-maps https://datasearch.gesis.org/ https://datasetsearch.research.google.com/

Distribution & re-identification risk

Even without (in)direct spatial references, data may still be sensitive.

  • Geospatial attributes add new information to existing data
  • Maybe part of general data privacy checks, but we may not distribute these data as is

Safe Rooms / Secure Data Centers

  • Control access
  • Checks output

https://www.gesis.org/en/services/processing-and-analyzing-data/guest-research-stays/secure-data-center-sdc

Geocoding

Geocoding is the conversion of indirect spatial references (e.g., addresses) into direct spatial references (e.g., coordinates)

However, conducting this procedure is tricky (not only in R). Many services are either

  • Expensive (at least they cost money or have other restrictions)
  • Probably not data protection-friendly (Hey Google)
  • Or both

Our Approach

We rely on a service offered by the Federal Agency of Cartography and Geodesy (BKG):

  • Online interface and API for online geocoding
  • Offline geocoding possible based on raw data
  • But: Data and service used to be restricted

bkggeocoder

R package bkggeocoder developed at GESIS for (offline) geocoding by Stefan and Jonas Lieth:

New interface in the sora package

We can now also use the sora package to geocode addresses (but thus far, with fewer features than bkggeocoder).

leibniz_addresses <-
  tibble::tribble(
    ~id, ~street, ~house_number, ~zip_code, ~place, ~institute,
    1, "B 2", "1", "68159", "Mannheim", "GESIS",
    2, "Unter Sachsenhausen", "6-8",  "50667", "Köln", "GESIS",
    3, "Kellnerweg", "4", "37077", "Göttingen", "DPZ",
    4, "Reichsstr.", "4-6", "04109",  "Leipzig", "GWZO",
    5, "Schöneckstraße", "6", "79104", "Freiburg", "KIS",
    6, "Albert-Einstein-Straße", "29a", "18059", "Rostock", "LIKAT",
    7, "L7", "1", "68161", "Mannheim", "ZEW",
    8, "Müggelseedamm", "310", "12587", "Berlin", "IGB",
    9, "Campus D2", "2", "66123", "Saarbrücken", "INM",
    10, "Eberswalder Straße", "84", "15374", "Müncheberg (Mark)", "ZALF"
  )

leibniz_addresses
# A tibble: 10 × 6
      id street                 house_number zip_code place             institute
   <dbl> <chr>                  <chr>        <chr>    <chr>             <chr>    
 1     1 B 2                    1            68159    Mannheim          GESIS    
 2     2 Unter Sachsenhausen    6-8          50667    Köln              GESIS    
 3     3 Kellnerweg             4            37077    Göttingen         DPZ      
 4     4 Reichsstr.             4-6          04109    Leipzig           GWZO     
 5     5 Schöneckstraße         6            79104    Freiburg          KIS      
 6     6 Albert-Einstein-Straße 29a          18059    Rostock           LIKAT    
 7     7 L7                     1            68161    Mannheim          ZEW      
 8     8 Müggelseedamm          310          12587    Berlin            IGB      
 9     9 Campus D2              2            66123    Saarbrücken       INM      
10    10 Eberswalder Straße     84           15374    Müncheberg (Mark) ZALF     

Setup and run the Geocoding

# load sora package
library(sora)

# set API key for the session
Sys.setenv(SORA_API_KEY = readLines("sora_key"))

# check if the sora API can be reached
sora_available()
[1] TRUE

Setup and run the Geocoding

# load sora package
library(sora)

# set API key for the session
Sys.setenv(SORA_API_KEY = readLines("sora_key"))

# check if the sora API can be reached
sora_available()

# start the geocoding
leibniz_addresses <-
  sora::sora_geocoder(
    leibniz_addresses
  )
Information from SoRa: 

Setup and run the Geocoding

# load sora package
library(sora)

# set API key for the session
Sys.setenv(SORA_API_KEY = readLines("sora_key"))

# check if the sora API can be reached
sora_available()

# start the geocoding
leibniz_addresses <-
  sora::sora_geocoder(
    leibniz_addresses
  )

# check status
sora::sora_job_status(leibniz_addresses)
Information from SoRa: 
2026-04-22 15:03:58: WAITING ─ The geocoding job is waiting to be processed

Setup and run the Geocoding

# load sora package
library(sora)

# set API key for the session
Sys.setenv(SORA_API_KEY = readLines("sora_key"))

# check if the sora API can be reached
sora_available()

# start the geocoding
leibniz_addresses <-
  sora::sora_geocoder(
    leibniz_addresses
  )

# check status
sora::sora_job_status(leibniz_addresses)
[1] "Waiting for SoRa API to be finished..."
[1] "Waiting for SoRa API to be finished..."
Information from SoRa: 
2026-04-22 15:04:17: SUCCESSFUL ─ Linking was finished successfully

Setup and run the Geocoding

# load sora package
library(sora)

# set API key for the session
Sys.setenv(SORA_API_KEY = readLines("sora_key"))

# check if the sora API can be reached
sora_available()

# start the geocoding
leibniz_addresses <-
  sora::sora_geocoder(
    leibniz_addresses
  )

# check status
sora::sora_job_status(leibniz_addresses)

# pulling the results from the server
leibniz_addresses <- sora::sora_results(leibniz_addresses)

leibniz_addresses
Information from SoRa: 
# A tibble: 10 × 4
   id        y     x score     
   <chr> <dbl> <dbl> <chr>     
 1 1      49.5  8.46 0.999     
 2 2      50.9  6.95 0.9836667 
 3 3      51.6  9.95 0.98488575
 4 4      51.3 12.4  0.9822792 
 5 5      48.0  7.86 0.984953  
 6 6      54.1 12.1  0.99587816
 7 7      49.5  8.48 0.999     
 8 8      52.4 13.6  0.999     
 9 9      49.3  7.04 0.9396661 
10 10     52.5 14.1  0.9485    

Convert To sf Object And Plot

leibniz_addresses_sf <-
  leibniz_addresses |> 
  sf::st_as_sf(coords = c("x", "y"), crs = 4326)

tmaptools::read_osm(
  leibniz_addresses_sf, 
  type = "esri-topo"
) |> 
  terra::rast() |> 
  tm_shape() +
  tm_rgb() +
  tm_shape(leibniz_addresses_sf) +
  tm_dots(size = 2, col = "red")

Data Linking

Linking via common identifiers is most commonly used but comes with challenges (e.g., territorial status and land reforms? Comparable units? Heterogeneity within units?).

Spatial linking methods (examples) I

Lookups

e.g., sf::st_join()

Circles

e.g., sf::st_buffer()

Spatial linking methods (examples) II

Squares

e.g., sf::st_buffer(..., endCapStyle = "SQUARE")

Isochrones

e.g., openrouteservice::ors_isochrones()

Cheatsheet: Spatial Operations

An overview of spatial operations using the sf package can be accessed here.

sf vs. sora

In principle, you are now well-suited to conduct many of these methods on your own, locally using sf and other spatial packages. However, this endeavor can still be complicated to setup, data may not be available, or your work station is simply not able to run your applications. To navigate these issues, we (GESIS, IÖR, SOEP) developed a full-blown spatial linking infrastructure in the SoRa project. You can install the sora package to access its API with this code:

pak::pkg_install(
  "git::https://git.gesis.org/sora-service/sora.git"
)

Cheatsheet: Spatial Linking

An overview of (most) of the spatial linking operations using the sora package can be accessed in the folder sessions.

Fake research question

Say we’re interested in the impact of flooding zones in a neighborhood on individual-level attitudes towards the prepardness of one’s municipality regarding future flooding events.

We plan to conduct a survey which is representative of the population of Germany.

https://imgflip.com/i/9ptcuu

Synthetic georeferenced survey data

We have added a synthetic dataset of 500 geocoordinates in the ./data/ folder (aggregated to 1 sq km centroids). Initially, it was based on a sample of the georeferenced GESIS Panel.

synthetic_survey_geocoordinates <-
  readRDS("./data/synthetic_survey_geocoordinates.rds")

tmaptools::read_osm(
  synthetic_survey_geocoordinates, 
  type = "esri-topo"
) |> 
  terra::rast() |> 
  tm_shape() +
  tm_rgb() +
  tm_shape(synthetic_survey_geocoordinates) +
  tm_dots(col = "red")

Correspondence table

As in any survey that deals with addresses, we need a correspondence table of the distinct identifiers.

correspondence_table <-
  dplyr::bind_cols(
    id_survey = 
      stringi::stri_rand_strings(10000, 10) |>  
      sample(500, replace = FALSE),
    id = synthetic_survey_geocoordinates$id
  )

correspondence_table
# A tibble: 500 × 2
   id_survey     id
   <chr>      <int>
 1 JmD0GGi5Qb   142
 2 05qo5ARXIM   263
 3 vFpzEpQl9g   839
 4 SK1zB3PTlX  1400
 5 JYqj1USKdw  1783
 6 ChQrF9ma5f   705
 7 X50bXxtVOA   392
 8 jnTMPbuXIe   120
 9 iSHwXaSJwu  1612
10 zvL4R2QdGC  1533
# ℹ 490 more rows

Conduct the survey

We ‘ask’ respondents for some standard sociodemographics. But we also include an item from the GESIS Panel on environmental attitudes: “The risk of flooding is increasing in the municipality where I live.” (risk_flood; Range 1, “Do not agree at all”, to 7, “Fully agree”). Since we cannot share the actual data, we created fake data using the faux package.

fake_survey_data <- 
  dplyr::bind_cols(
    id = correspondence_table$id,
    age = 
      pmin(pmax(rnorm(500, mean = 50, sd = 15))) |> 
      round(),
    gender = 
      sample(1:3, 500, replace = TRUE, prob = c(.45, .45, .1)) |> 
      as.factor(),
    education =
      pmin(pmax(rnorm(500, mean = 3, sd = 1))) |> 
      round(),
    income =
      pmin(pmax(rlnorm(500, meanlog = log(2500), sdlog = 0.5))) |> 
      round(),
    risk_flood = secret_variable_we_are_hiding_from_you
  )

fake_survey_data
# A tibble: 500 × 6
      id   age gender education income risk_flood
   <int> <dbl> <fct>      <dbl>  <dbl>      <dbl>
 1   142    33 1              2   1418          4
 2   263    45 1              2   3342          6
 3   839    30 1              3   1319          6
 4  1400    62 3              3   3445          7
 5  1783    26 2              2   2429          4
 6   705    76 1              4   2457          6
 7   392    54 1              3   1574          6
 8   120    46 2              1   1299          1
 9  1612    41 2              3   2418          7
10  1533    65 1              3   2316          3
# ℹ 490 more rows

Hypothesis

Exposure to flooding zones

The higher the exposure to flooding zones the more pessimistic is the opinion about prepardness of municipalities regarding future flooding risks.

Distribution of our ‘survey data’

par(mfrow=c(2, 3))
for (variable in names(fake_survey_data)[-1]) {
  hist(
    as.numeric(fake_survey_data[[variable]]), main = variable, xlab = variable
  )
}

The sora datapicker

…but we can also use the sora datapicker–again either online or directly within R. In the end of this effort, we need a specific ID to feed into the other sora functions (see below). So let’s search for our flooding data in a resolution of 1000 meters and the collection year 2021.

# tabular output of all available spatial datasets
sora::sora_datapicker("spatial")
# A tibble: 9,443 × 14
   data_provider dataset_id      datatype description geometry_type keywords label required_permissions service_provider
   <chr>         <chr>           <chr>    <chr>       <chr>         <chr>    <chr> <chr>                <chr>           
 1 IOER          ioer-monitor-g… numeric  Number of … Raster        ""       1000… Nutzungsbedingungen… IOER            
 2 IOER          ioer-monitor-g… numeric  Number of … Raster        ""       1000… Nutzungsbedingungen… IOER            
 3 IOER          ioer-monitor-g… numeric  Number of … Raster        ""       5000… Nutzungsbedingungen… IOER            
 4 IOER          ioer-monitor-g… numeric  Number of … Vector: Poly… ""       Citi… Nutzungsbedingungen… IOER            
 5 IOER          ioer-monitor-g… numeric  Number of … Vector: Poly… ""       Dist… Nutzungsbedingungen… IOER            
 6 IOER          ioer-monitor-g… numeric  Number of … Vector: Poly… ""       Muni… Nutzungsbedingungen… IOER            
 7 IOER          ioer-monitor-g… numeric  Number of … Vector: Poly… ""       Muni… Nutzungsbedingungen… IOER            
 8 IOER          ioer-monitor-g… numeric  Number of … Vector: Poly… ""       Spat… Nutzungsbedingungen… IOER            
 9 IOER          ioer-monitor-g… numeric  Number of … Vector: Poly… ""       Stat… Nutzungsbedingungen… IOER            
10 IOER          ioer-monitor-g… numeric  Number of … Raster        ""       1000… Nutzungsbedingungen… IOER            
# ℹ 9,433 more rows
# ℹ 5 more variables: spatial_resolution <chr>, time_frame <int>, title <chr>, unit <chr>, url <chr>

The sora datapicker

…but we can also use the sora datapicker–again either online or directly within R. In the end of this effort, we need a specific ID to feed into the other sora functions (see below). So let’s search for our flooding data in a resolution of 1000 meters and the collection year 2021.

# tabular output of all available spatial datasets
sora::sora_datapicker("spatial")

# filtering for the ID we need
filtered_spatial_data <-
  sora::sora_datapicker("spatial") |> 
  dplyr::filter(grepl("flood", title)) |> 
  dplyr::filter(spatial_resolution == "1000m Raster") |> 
  dplyr::filter(time_frame == 2021) 

filtered_spatial_data |> 
  dplyr::select(dataset_id, description)
# A tibble: 3 × 2
  dataset_id                    description                                                                             
  <chr>                         <chr>                                                                                   
1 ioer-monitor-r01rg-2021-1000m Ratio of officially designated flood zones to the total reference area                  
2 ioer-monitor-r05rt-2021-1000m Ratio of built-up settlement space and transportation infrastructure in offically desig…
3 ioer-monitor-r04rt-2021-1000m Area coverage of built-up settlement and transportion area within the flooding area.    

The ID we need is ioer-monitor-r01rg-2021-1000m.

Defining the spatial dataset

All requests to link data in sora boil down to this simple schema:

linking_job <- 
  sora_request(
    dataset,
    link_to,
    method,
    ...
  )

We define our dataset with our geocoordinates (fake_survey_geocoordinates), a spatial dataset link_to (IOER data), and a method we want to use for linking (more on that in a minute). Indeed, there’s more we can specify as indicated by the ....

Setting up our geocoordinates

Let’s start with our geocoordinates. Orginally, sora was designed as an interface to georeferenced data, e.g., from SOEP or GESIS. But sora can also be used for custom datasets, like our fake_survey_geocoordinates. For this purpose, we need the sora::custom_data() function.

custom_data <- sora::sora_custom(synthetic_survey_geocoordinates)

custom_data
<sora_custom>
Using custom data with EPSG code 3035. 

     id       x       y
1   142 4294500 3306500
2   263 4119500 3159500
3   839 4132500 3169500
4  1400 4319500 3397500
5  1783 4582500 3264500
6   705 4225500 3231500
7   392 4127500 3052500
8   120 4204500 3272500
9  1612 4619500 3140500
10 1533 4302500 2840500

Setting up our geospatial data

As we did our research, we know the ID of the geospatial data we are interested in: ioer-monitor-r01rg-2021-1000m. We can use the function sora::sora_spatial() to ‘register’ the data for the request to the sora API.

flooding_zones <- sora::sora_spatial("ioer-monitor-r01rg-2021-1000m")

flooding_zones
<sora_spatial>
Linking to the spatial dataset ioer-monitor-r01rg-2021-1000m. 

Setting up our first linkages

The next step is to define how we want to link both datasets. Let’s start with a 5000 meters circular buffer! It’s a method of aggregation, which is why we specify it in the first step.

linking_method_1 <-
  sora::sora_linking(
    "aggregate_attribute",
    selection_area = "circle",
    radius = 5000,
    output = "mean"
  )

linking_method_1
<sora_linking>
Using method aggregate_attribute with the following parameters:

selection_area: circle
radius: 5000
output: mean 

Doing the analysis

lm(
  risk_flood ~ age + gender + education + income + flood_zones_perc, 
  data = fake_survey_data_linked
) |> 
  sjPlot::plot_model()

Exercise 6: Using SoRa💪

🖱Click here for the exercise