Introduction to Geospatial Techniques for Social Scientists in R

Introduction to Geospatial Techniques for Social Scientists in R
Investigating Spatial Autocorrelation
Stefan Jünger & Anne-Kathrin Stroppe
GESIS Workshop
April 24, 2024
1 / 41

Now
 
    Day 
    Time 
    Title 
  


    April 23 
    10:00-11:30 
    Introduction to GIS 
  

    April 23 
    11:45-13:00 
    Vector Data 
  

    April 23 
    13:00-14:00 
    Lunch Break 
  

    April 23 
    14:00-15:30 
    Mapping 
  

    April 23 
    15:45-17:00 
    Raster Data 
  

    April 24 
    09:00-10:30 
    Advanced Data Import & Processing 
  

    April 24 
    10:45-12:00 
    Applied Data Wrangling & Linking 
  

    April 24 
    12:00-13:00 
    Lunch Break 
  

    April 24 
    13:00-14:30 
    Investigating Spatial Autocorrelation 
  

    April 24 
    14:45-16:00 
    Spatial Econometrics & Outlook 
  

2 / 41

Day	Time	Title
April 23	10:00-11:30	Introduction to GIS
April 23	11:45-13:00	Vector Data
April 23	13:00-14:00	Lunch Break
April 23	14:00-15:30	Mapping
April 23	15:45-17:00	Raster Data
April 24	09:00-10:30	Advanced Data Import & Processing
April 24	10:45-12:00	Applied Data Wrangling & Linking
April 24	12:00-13:00	Lunch Break
April 24	13:00-14:30	Investigating Spatial Autocorrelation
April 24	14:45-16:00	Spatial Econometrics & Outlook

Thus far

We've done some

wrangling,
mapping,
and linking of geospatial data (with georeferenced survey data)

We've seen that geospatial data are

relevant to provide context
- as social scientists, we know that space is important!
nice to look at
- we can tell a story

However, geospatial data can be interesting on their own for social science studies!

3 / 41

Tobler's first law of geography

[E]verything is related to everything else, but near things are more related than distant things (Tobler 1970, p. 236)

This means nearby geographical regions, institutions, or people are more similar to each other or do have a stronger influence on each other.

What we get is an interdependent system.

Tobler, W. R. (1970). A Computer Movie Simulating Urban Growth in the Detroit Region. Economic Geography, 46, 234–240. https://doi.org/10.2307/143141

4 / 41

Spatial Interdependence or Autocorrelation

Tobler's law is the fundamental principle of doing spatial analysis. We want to know

if observations in our data are spatially interdependent
and how this interdependence can be explained (= data generation process)

5 / 41

Developing a model of connectiveness: the chess board

6 / 41

Developing a model of connectiveness: the chess board

7 / 41

Rook and queen neighborhoods

8 / 41

Rook and queen neighborhoods

9 / 41

It's an interdependent system

10 / 41

Let's do it hands-on: Our 'research' question

Say, we are interested in AfD voting outcomes in relation to ethnic compositions of neighborhoods

combination of far-right voting research with Allport's classic contact theory
we are just doing it in the Urban context of Cologne (again)

11 / 41

Voting districts

voting_districts <-
  sf::st_read("./data/Stimmbezirk.shp") |> 
  sf::st_transform(3035) |> 
  dplyr::transmute(Stimmbezirk = as.numeric(nummer))

## Reading layer `Stimmbezirk' from data source 
##   `C:\Users\mueller2\a_talks_presentations\gesis-workshop-geospatial-techniques-R-2024\data\Stimmbezirk.shp' using driver `ESRI Shapefile'
## Simple feature collection with 543 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 343914.7 ymin: 5632759 xmax: 370674.3 ymax: 5661475
## Projected CRS: ETRS89 / UTM zone 32N

head(voting_districts, 2)

## Simple feature collection with 2 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
##   Stimmbezirk                       geometry
## 1       10205 MULTIPOLYGON (((4105588 309...
## 2       10213 MULTIPOLYGON (((4106748 309...

12 / 41

afd_votes <-
  glue::glue(
    "https://www.stadt-koeln.de/wahlen/bundestagswahl/09-2021/praesentation/\\
    Open-Data-Bundestagswahl476.csv"
  ) |> 
  readr::read_csv2() |>
  dplyr::transmute(Stimmbezirk = `gebiet-nr`, afd_share = (F1 / F) * 100)
head(afd_votes, 2)

## # A tibble: 2 × 2
##   Stimmbezirk afd_share
##         <dbl>     <dbl>
## 1       10101      13.1
## 2       10102      11.4

13 / 41

Simple ID matching to link data

election_results <-
  dplyr::left_join(
    voting_districts,
    afd_votes,
    by = "Stimmbezirk"
  )
head(election_results, 2)

## Simple feature collection with 2 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
##   Stimmbezirk afd_share                       geometry
## 1       10205   9.74026 MULTIPOLYGON (((4105588 309...
## 2       10213  11.76471 MULTIPOLYGON (((4106748 309...

14 / 41

Do vote shares spatially cluster?

tm_shape(election_results) +
  tm_polygons(
    "afd_share",
    palette = "viridis"
  )

15 / 41

Pull in German Census data

immigrants_cologne <-
  z11::z11_get_100m_attribute(STAATSANGE_KURZ_2) |>
  terra::crop(election_results) |>
  terra::mask(election_results)
inhabitants_cologne <-
  z11::z11_get_100m_attribute(Einwohner) |>
  terra::crop(election_results) |>
  terra::mask(election_results)
immigrant_share_cologne <-
  (immigrants_cologne / inhabitants_cologne) * 100

16 / 41

It's raster data

tm_shape(immigrant_share_cologne) +
  tm_raster(palette = "-viridis")

17 / 41

Linking: Let's get geographical

As the voting (vector) data is different to the Census raster data we cannot use simple ID matching like before

we have to rely on spatial linking techniques
we could use terra::extract()
- but as a default it only captures raster cells as a whole and not their spatial fraction
- which is honestly okay for most applications
- but why not try something else?

18 / 41

`exactextractr::exact_extract()`!

election_results <-
  election_results |>
  dplyr::mutate(
    immigrant_share = 
      exactextractr::exact_extract(
        immigrant_share_cologne, election_results, 'mean', progress = FALSE
      ),
    inhabitants = 
      exactextractr::exact_extract(
        inhabitants_cologne, election_results, 'mean', progress = FALSE
      )
  )
head(election_results, 2)

## Simple feature collection with 2 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
##   Stimmbezirk afd_share                       geometry immigrant_share inhabitants
## 1       10205   9.74026 MULTIPOLYGON (((4105588 309...        15.00740    140.1017
## 2       10213  11.76471 MULTIPOLYGON (((4106748 309...        10.66915    107.6079

19 / 41

Voilà

tm_shape(election_results) +
  tm_polygons(
    "immigrant_share", 
    palette = "-viridis"
  )

20 / 41

How to test spatial autocorrelation

We now have to ask

do the spatial units relate to each other?
if yes, in which way?
- only if they are bordering each other? (i.e., Queens or Rooks)
- or also if they are in proximity but not necessarily contiguous?

21 / 41

Let's try Queens neighborhoods

queens_neighborhoods <-
  spdep::poly2nb(
    election_results,
    queen = TRUE
  )
summary(queens_neighborhoods)

## Neighbour list object:
## Number of regions: 543 
## Number of nonzero links: 2978 
## Percentage nonzero weights: 1.010009 
## Average number of links: 5.484346 
## Link number distribution:
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
##   2  12  53 107 136 102  58  30  21  13   4   2   2   1 
## 2 least connected regions:
## 69 196 with 1 link
## 1 most connected region:
## 249 with 14 links

22 / 41

Connected regions

queens_neighborhoods |>
  spdep::nb2lines(
    coords = sf::st_as_sfc(election_results), 
    as_sf = TRUE
  ) |>
  tm_shape() +
  tm_dots() +
  tm_lines()

23 / 41

Can we now start?

Unfortunately, we are yet done with creating the links between neighborhoods. What we receive is, in principle, a huge matrix with connected observations.

##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## 1     0    0    0    0    0    0    0    0    0     0
## 2     0    0    0    0    0    0    0    1    1     0
## 3     0    0    0    1    1    0    0    0    1     0
## 4     0    0    1    0    1    1    0    0    0     0
## 5     0    0    1    1    0    1    0    0    0     0
## 6     0    0    0    1    1    0    1    1    0     0
## 7     0    0    0    0    0    1    0    1    1     0
## 8     0    1    0    0    0    1    1    0    1     0
## 9     0    1    1    0    0    0    1    1    0     0
## 10    0    0    0    0    0    0    0    0    0     0

That's nothing we could plug into a statistical model, such as a regression or the like (see next session).

24 / 41

Normalization

Normalization is the process of creating actual spatial weights. There is a huge dispute on how to do it (Neumayer & Plümper, 2016). But nobody questions whether it should be done in the first place since, among others, it restricts the parameter space of the weights.

non-normalized

##   [,1] [,2] [,3] [,4] [,5]
## 1    0    0    0    0    0
## 2    0    0    0    0    0
## 3    0    0    0    1    1
## 4    0    0    1    0    1
## 5    0    0    1    1    0

## [1] 0 0 2 2 2

normalized

##   [,1] [,2]      [,3]  [,4]      [,5]
## 1    0    0 0.0000000 0.000 0.0000000
## 2    0    0 0.0000000 0.000 0.0000000
## 3    0    0 0.0000000 0.125 0.1250000
## 4    0    0 0.3333333 0.000 0.3333333
## 5    0    0 0.2500000 0.250 0.0000000

## [1] 0.0000000 0.0000000 0.2500000 0.6666667 0.5000000

Neumayer, E., & Plümper, T. (2016). W. Political Science Research and Methods, 4(01), 175–193. https://doi.org/10.1017/psrm.2014.40

25 / 41

Row-normalization

One of the disputed but at the same time standard procedures is row-normalization. It divides all individual weights (=connections between spatial units) $w_{ij}$ by the row-wise sum of of all other weights:

$W = \sum_j\frac{w_{ij}}{\sum_jw_{ij}}$

An alternative would be minmax-normalization:

$W = \sum_j\frac{w_{ij} - min(w_{ij})}{max(w_{ij})-min(w_{ij})}$

26 / 41

Apply row-normalization

queens_W <- spdep::nb2listw(queens_neighborhoods, style = "W")
summary(queens_W)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 543 
## Number of nonzero links: 2978 
## Percentage nonzero weights: 1.010009 
## Average number of links: 5.484346 
## Link number distribution:
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
##   2  12  53 107 136 102  58  30  21  13   4   2   2   1 
## 2 least connected regions:
## 69 196 with 1 link
## 1 most connected region:
## 249 with 14 links
## 
## Weights style: W 
## Weights constants summary:
##     n     nn  S0       S1       S2
## W 543 294849 543 211.1469 2261.598

27 / 41

Test of spatial autocorrelation: Moran's I

$I=\frac{N}{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}}\frac{\sum_{i=1}^{N}\sum_{j=1}^Nw_{ij}(x_i-\bar{x})(x_j-\bar{x})}{\sum_{i=1}^N(x_i-\bar{x})^2}$

Most and foremost, Moran's I makes use of the previously created weights between all spatial units pairs $w_{ij}$ . It weights deviations from an overall mean value of connected pairs according to the strength of the modeled spatial relations. Moran's I can be interpreted as some sort of a correlation coefficient (-1 = perfect negative spatial autocorrelation; +1 = perfect positive spatial autocorrelation).

28 / 41

Moran's I in `spdep`

spdep::moran.test(
  election_results$immigrant_share, 
  listw = queens_W
)

## 
##     Moran I test under randomisation
## 
## data:  election_results$immigrant_share  
## weights: queens_W    
## 
## Moran I statistic standard deviate = 20.428, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5408375504     -0.0018450185      0.0007057415

29 / 41

Test of spatial autocorrelation: Geary's C

Moran's I is a global statistic for spatial autocorrelation. It can produce issues when there are only local clusters of spatial interdependence in the data. An alternative is the use of Geary's C:

$C=\frac{(N-1)\sum_i\sum_jw_{ij}(x_i-x_j)^2}{2\sum_{i=1}^N\sum_{j=1}^Nw_{ij}\sum_i(x_i-\bar{x})^2}$

As you can see, in the numerator, the average value $\bar{x}$ is not as prominent as in Moran's I. Geary's C only produces values between 0 and 2 (value near 0 = positive spatial autocorrelation; 1 = no spatial autocorrelation; values near 2 = negative spatial autocorrelation).

30 / 41

Geary's C in `spdep`

spdep::geary.test(
  election_results$immigrant_share, 
  listw = queens_W
)

## 
##     Geary C test under randomisation
## 
## data:  election_results$immigrant_share 
## weights: queens_W   
## 
## Geary C statistic standard deviate = 17.355, p-value < 2.2e-16
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic       Expectation          Variance 
##       0.442840220       1.000000000       0.001030604

31 / 41

Modern inferface to neighbors: `sfdep` package

The sfdep package provides a more tidyverse-compliant syntax to spatial weights. See:

election_results <-
  election_results |> 
  dplyr::mutate(
    neighbors = sfdep::st_contiguity(election_results), # queen neighborhoods by default
    weights = sfdep::st_weights(neighbors)
  )
head(election_results, 2)

## Simple feature collection with 2 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
##   Stimmbezirk afd_share                       geometry immigrant_share inhabitants                          neighbors
## 1       10205   9.74026 MULTIPOLYGON (((4105588 309...        15.00740    140.1017 16, 18, 22, 25, 458, 459, 460, 462
## 2       10213  11.76471 MULTIPOLYGON (((4106748 309...        10.66915    107.6079                   8, 9, 11, 48, 49
##                                                  weights
## 1 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125
## 2                                0.2, 0.2, 0.2, 0.2, 0.2

32 / 41

Calculating once again Moran's I

library(magrittr)
election_results %$% 
  sfdep::global_moran_test(immigrant_share, neighbors, weights)

## 
##     Moran I test under randomisation
## 
## data:  x  
## weights: listw    
## 
## Moran I statistic standard deviate = 20.428, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5408375504     -0.0018450185      0.0007057415

33 / 41

Calculating once again Geary's C

election_results %$% 
  sfdep::global_c_test(immigrant_share, neighbors, weights)

## 
##     Geary C test under randomisation
## 
## data:  x 
## weights: listw   
## 
## Geary C statistic standard deviate = 17.355, p-value < 2.2e-16
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic       Expectation          Variance 
##       0.442840220       1.000000000       0.001030604

34 / 41

Exercise 2_3_1: Neighborhood Matrices

Exercise

Solution

35 / 41

Measures of local spatial autocorrelation: LISA clusters

The reason why we show you the sfdep package is that it provides nice functions to calculate local measures of spatial autocorrelation. One popular choice are the estimation of Local Indicators of Spatial Autocorrelation (i.e., LISA clusters). In the most straightforward way they can be interpreted as case-specific indicators of spatial autocorrelation:

$I_i=\frac{x_i-\bar{x}}{\frac{\sum_{i-1}^N(x_i-\bar{x})^2}{N}}\sum_{j=1}^Nw_{ij}(x_j-\bar{x})$

36 / 41

LISA clusters in `sfdep`

lisa <- 
  election_results |> 
  dplyr::mutate(
    lisa = sfdep::local_moran(afd_share, neighbors, weights)
  ) |> 
  tidyr::unnest()
head(lisa, 2)

## Simple feature collection with 2 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4104887 ymin: 3095603 xmax: 4105588 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
## # A tibble: 2 × 19
##   Stimmbezirk afd_share                          geometry immigrant_share inhabitants neighbors weights    ii      eii var_ii  z_ii   p_ii
##         <dbl>     <dbl>                <MULTIPOLYGON [m]>           <dbl>       <dbl>     <int>   <dbl> <dbl>    <dbl>  <dbl> <dbl>  <dbl>
## 1       10205      9.74 (((4105588 3096207, 4105571 3095…            15.0        140.        16   0.125  1.13 -0.00889  0.204  2.52 0.0116
## 2       10205      9.74 (((4105588 3096207, 4105571 3095…            15.0        140.        18   0.125  1.13 -0.00889  0.204  2.52 0.0116
## # ℹ 7 more variables: p_ii_sim <dbl>, p_folded_sim <dbl>, skewness <dbl>, kurtosis <dbl>, mean <fct>, median <fct>, pysal <fct>

37 / 41

They are also nice for mapping

tm_shape(lisa) +
  tm_fill("afd_share", midpoint = NA, palette = "viridis")

tm_shape(lisa) +
  tm_fill("ii", midpoint = NA, palette = "viridis")

38 / 41

One last bit: bivariate LISAs

lisa_bivariate <- 
  election_results |> 
  dplyr::mutate(
    lisa = sfdep::local_moran_bv(
      afd_share, 
      immigrant_share, 
      neighbors, 
      weights
      )
  ) |> 
  tidyr::unnest()
tm_shape(lisa_bivariate) +
  tm_fill(
    "Ib", 
    midpoint = NA, 
    palette = "viridis"
  )

39 / 41

Wrap up

You now know how to

model connectedness of spatial units
investigate spatial autocorrelation
- globally
- locally
map it

There's way more, particularly when it comes to

spatial weights (see exercise)
clustering techniques (e.g., Hot Spot Analysis)
autocorrelation with more than one or two variables

Now we know our data are spatially autocorrelated. Let's try to find out why this is the case via some spatial econometrics

40 / 41

stefan.juenger@gesis.org

@StefanJuenger

StefanJuenger

https://stefanjuenger.github.io

41 / 41

Day	Time	Title
April 23	10:00-11:30	Introduction to GIS
April 23	11:45-13:00	Vector Data
April 23	13:00-14:00	Lunch Break
April 23	14:00-15:30	Mapping
April 23	15:45-17:00	Raster Data
April 24	09:00-10:30	Advanced Data Import & Processing
April 24	10:45-12:00	Applied Data Wrangling & Linking
April 24	12:00-13:00	Lunch Break
April 24	13:00-14:30	Investigating Spatial Autocorrelation
April 24	14:45-16:00	Spatial Econometrics & Outlook

Introduction to Geospatial Techniques for Social Scientists in R
Investigating Spatial Autocorrelation
Stefan Jünger & Anne-Kathrin Stroppe
GESIS Workshop
April 24, 2024
1 / 41

Now
 
    Day 
    Time 
    Title 
  


    April 23 
    10:00-11:30 
    Introduction to GIS 
  

    April 23 
    11:45-13:00 
    Vector Data 
  

    April 23 
    13:00-14:00 
    Lunch Break 
  

    April 23 
    14:00-15:30 
    Mapping 
  

    April 23 
    15:45-17:00 
    Raster Data 
  

    April 24 
    09:00-10:30 
    Advanced Data Import & Processing 
  

    April 24 
    10:45-12:00 
    Applied Data Wrangling & Linking 
  

    April 24 
    12:00-13:00 
    Lunch Break 
  

    April 24 
    13:00-14:30 
    Investigating Spatial Autocorrelation 
  

    April 24 
    14:45-16:00 
    Spatial Econometrics & Outlook 
  

2 / 41

Day	Time	Title
April 23	10:00-11:30	Introduction to GIS
April 23	11:45-13:00	Vector Data
April 23	13:00-14:00	Lunch Break
April 23	14:00-15:30	Mapping
April 23	15:45-17:00	Raster Data
April 24	09:00-10:30	Advanced Data Import & Processing
April 24	10:45-12:00	Applied Data Wrangling & Linking
April 24	12:00-13:00	Lunch Break
April 24	13:00-14:30	Investigating Spatial Autocorrelation
April 24	14:45-16:00	Spatial Econometrics & Outlook

Thus far

We've done some

wrangling,
mapping,
and linking of geospatial data (with georeferenced survey data)

We've seen that geospatial data are

relevant to provide context
- as social scientists, we know that space is important!
nice to look at
- we can tell a story

However, geospatial data can be interesting on their own for social science studies!

3 / 41

Tobler's first law of geography

[E]verything is related to everything else, but near things are more related than distant things (Tobler 1970, p. 236)

This means nearby geographical regions, institutions, or people are more similar to each other or do have a stronger influence on each other.

What we get is an interdependent system.

Tobler, W. R. (1970). A Computer Movie Simulating Urban Growth in the Detroit Region. Economic Geography, 46, 234–240. https://doi.org/10.2307/143141

4 / 41

Spatial Interdependence or Autocorrelation

Tobler's law is the fundamental principle of doing spatial analysis. We want to know

if observations in our data are spatially interdependent
and how this interdependence can be explained (= data generation process)

5 / 41

Developing a model of connectiveness: the chess board

6 / 41

Developing a model of connectiveness: the chess board

7 / 41

Rook and queen neighborhoods

8 / 41

Rook and queen neighborhoods

9 / 41

It's an interdependent system

10 / 41

Let's do it hands-on: Our 'research' question

Say, we are interested in AfD voting outcomes in relation to ethnic compositions of neighborhoods

combination of far-right voting research with Allport's classic contact theory
we are just doing it in the Urban context of Cologne (again)

11 / 41

Voting districts

voting_districts <-
  sf::st_read("./data/Stimmbezirk.shp") |> 
  sf::st_transform(3035) |> 
  dplyr::transmute(Stimmbezirk = as.numeric(nummer))

## Reading layer `Stimmbezirk' from data source 
##   `C:\Users\mueller2\a_talks_presentations\gesis-workshop-geospatial-techniques-R-2024\data\Stimmbezirk.shp' using driver `ESRI Shapefile'
## Simple feature collection with 543 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 343914.7 ymin: 5632759 xmax: 370674.3 ymax: 5661475
## Projected CRS: ETRS89 / UTM zone 32N

head(voting_districts, 2)

## Simple feature collection with 2 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
##   Stimmbezirk                       geometry
## 1       10205 MULTIPOLYGON (((4105588 309...
## 2       10213 MULTIPOLYGON (((4106748 309...

12 / 41

afd_votes <-
  glue::glue(
    "https://www.stadt-koeln.de/wahlen/bundestagswahl/09-2021/praesentation/\\
    Open-Data-Bundestagswahl476.csv"
  ) |> 
  readr::read_csv2() |>
  dplyr::transmute(Stimmbezirk = `gebiet-nr`, afd_share = (F1 / F) * 100)
head(afd_votes, 2)

## # A tibble: 2 × 2
##   Stimmbezirk afd_share
##         <dbl>     <dbl>
## 1       10101      13.1
## 2       10102      11.4

13 / 41

Simple ID matching to link data

election_results <-
  dplyr::left_join(
    voting_districts,
    afd_votes,
    by = "Stimmbezirk"
  )
head(election_results, 2)

## Simple feature collection with 2 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
##   Stimmbezirk afd_share                       geometry
## 1       10205   9.74026 MULTIPOLYGON (((4105588 309...
## 2       10213  11.76471 MULTIPOLYGON (((4106748 309...

14 / 41

Do vote shares spatially cluster?

tm_shape(election_results) +
  tm_polygons(
    "afd_share",
    palette = "viridis"
  )

15 / 41

Pull in German Census data

immigrants_cologne <-
  z11::z11_get_100m_attribute(STAATSANGE_KURZ_2) |>
  terra::crop(election_results) |>
  terra::mask(election_results)
inhabitants_cologne <-
  z11::z11_get_100m_attribute(Einwohner) |>
  terra::crop(election_results) |>
  terra::mask(election_results)
immigrant_share_cologne <-
  (immigrants_cologne / inhabitants_cologne) * 100

16 / 41

It's raster data

tm_shape(immigrant_share_cologne) +
  tm_raster(palette = "-viridis")

17 / 41

Linking: Let's get geographical

As the voting (vector) data is different to the Census raster data we cannot use simple ID matching like before

we have to rely on spatial linking techniques
we could use terra::extract()
- but as a default it only captures raster cells as a whole and not their spatial fraction
- which is honestly okay for most applications
- but why not try something else?

18 / 41

`exactextractr::exact_extract()`!

election_results <-
  election_results |>
  dplyr::mutate(
    immigrant_share = 
      exactextractr::exact_extract(
        immigrant_share_cologne, election_results, 'mean', progress = FALSE
      ),
    inhabitants = 
      exactextractr::exact_extract(
        inhabitants_cologne, election_results, 'mean', progress = FALSE
      )
  )
head(election_results, 2)

## Simple feature collection with 2 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
##   Stimmbezirk afd_share                       geometry immigrant_share inhabitants
## 1       10205   9.74026 MULTIPOLYGON (((4105588 309...        15.00740    140.1017
## 2       10213  11.76471 MULTIPOLYGON (((4106748 309...        10.66915    107.6079

19 / 41

Voilà

tm_shape(election_results) +
  tm_polygons(
    "immigrant_share", 
    palette = "-viridis"
  )

20 / 41

How to test spatial autocorrelation

We now have to ask

do the spatial units relate to each other?
if yes, in which way?
- only if they are bordering each other? (i.e., Queens or Rooks)
- or also if they are in proximity but not necessarily contiguous?

21 / 41

Let's try Queens neighborhoods

queens_neighborhoods <-
  spdep::poly2nb(
    election_results,
    queen = TRUE
  )
summary(queens_neighborhoods)

## Neighbour list object:
## Number of regions: 543 
## Number of nonzero links: 2978 
## Percentage nonzero weights: 1.010009 
## Average number of links: 5.484346 
## Link number distribution:
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
##   2  12  53 107 136 102  58  30  21  13   4   2   2   1 
## 2 least connected regions:
## 69 196 with 1 link
## 1 most connected region:
## 249 with 14 links

22 / 41

Connected regions

queens_neighborhoods |>
  spdep::nb2lines(
    coords = sf::st_as_sfc(election_results), 
    as_sf = TRUE
  ) |>
  tm_shape() +
  tm_dots() +
  tm_lines()

23 / 41

Can we now start?

Unfortunately, we are yet done with creating the links between neighborhoods. What we receive is, in principle, a huge matrix with connected observations.

##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## 1     0    0    0    0    0    0    0    0    0     0
## 2     0    0    0    0    0    0    0    1    1     0
## 3     0    0    0    1    1    0    0    0    1     0
## 4     0    0    1    0    1    1    0    0    0     0
## 5     0    0    1    1    0    1    0    0    0     0
## 6     0    0    0    1    1    0    1    1    0     0
## 7     0    0    0    0    0    1    0    1    1     0
## 8     0    1    0    0    0    1    1    0    1     0
## 9     0    1    1    0    0    0    1    1    0     0
## 10    0    0    0    0    0    0    0    0    0     0

That's nothing we could plug into a statistical model, such as a regression or the like (see next session).

24 / 41

Normalization

non-normalized

##   [,1] [,2] [,3] [,4] [,5]
## 1    0    0    0    0    0
## 2    0    0    0    0    0
## 3    0    0    0    1    1
## 4    0    0    1    0    1
## 5    0    0    1    1    0

## [1] 0 0 2 2 2

normalized

##   [,1] [,2]      [,3]  [,4]      [,5]
## 1    0    0 0.0000000 0.000 0.0000000
## 2    0    0 0.0000000 0.000 0.0000000
## 3    0    0 0.0000000 0.125 0.1250000
## 4    0    0 0.3333333 0.000 0.3333333
## 5    0    0 0.2500000 0.250 0.0000000

## [1] 0.0000000 0.0000000 0.2500000 0.6666667 0.5000000

Neumayer, E., & Plümper, T. (2016). W. Political Science Research and Methods, 4(01), 175–193. https://doi.org/10.1017/psrm.2014.40

25 / 41

Row-normalization

$W = \sum_j\frac{w_{ij}}{\sum_jw_{ij}}$

An alternative would be minmax-normalization:

$W = \sum_j\frac{w_{ij} - min(w_{ij})}{max(w_{ij})-min(w_{ij})}$

26 / 41

Apply row-normalization

queens_W <- spdep::nb2listw(queens_neighborhoods, style = "W")
summary(queens_W)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 543 
## Number of nonzero links: 2978 
## Percentage nonzero weights: 1.010009 
## Average number of links: 5.484346 
## Link number distribution:
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
##   2  12  53 107 136 102  58  30  21  13   4   2   2   1 
## 2 least connected regions:
## 69 196 with 1 link
## 1 most connected region:
## 249 with 14 links
## 
## Weights style: W 
## Weights constants summary:
##     n     nn  S0       S1       S2
## W 543 294849 543 211.1469 2261.598

27 / 41

Test of spatial autocorrelation: Moran's I

$I=\frac{N}{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}}\frac{\sum_{i=1}^{N}\sum_{j=1}^Nw_{ij}(x_i-\bar{x})(x_j-\bar{x})}{\sum_{i=1}^N(x_i-\bar{x})^2}$

28 / 41

Moran's I in `spdep`

spdep::moran.test(
  election_results$immigrant_share, 
  listw = queens_W
)

## 
##     Moran I test under randomisation
## 
## data:  election_results$immigrant_share  
## weights: queens_W    
## 
## Moran I statistic standard deviate = 20.428, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5408375504     -0.0018450185      0.0007057415

29 / 41

Test of spatial autocorrelation: Geary's C

Moran's I is a global statistic for spatial autocorrelation. It can produce issues when there are only local clusters of spatial interdependence in the data. An alternative is the use of Geary's C:

$C=\frac{(N-1)\sum_i\sum_jw_{ij}(x_i-x_j)^2}{2\sum_{i=1}^N\sum_{j=1}^Nw_{ij}\sum_i(x_i-\bar{x})^2}$

30 / 41

Geary's C in `spdep`

spdep::geary.test(
  election_results$immigrant_share, 
  listw = queens_W
)

## 
##     Geary C test under randomisation
## 
## data:  election_results$immigrant_share 
## weights: queens_W   
## 
## Geary C statistic standard deviate = 17.355, p-value < 2.2e-16
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic       Expectation          Variance 
##       0.442840220       1.000000000       0.001030604

31 / 41

Modern inferface to neighbors: `sfdep` package

The sfdep package provides a more tidyverse-compliant syntax to spatial weights. See:

election_results <-
  election_results |> 
  dplyr::mutate(
    neighbors = sfdep::st_contiguity(election_results), # queen neighborhoods by default
    weights = sfdep::st_weights(neighbors)
  )
head(election_results, 2)

## Simple feature collection with 2 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
##   Stimmbezirk afd_share                       geometry immigrant_share inhabitants                          neighbors
## 1       10205   9.74026 MULTIPOLYGON (((4105588 309...        15.00740    140.1017 16, 18, 22, 25, 458, 459, 460, 462
## 2       10213  11.76471 MULTIPOLYGON (((4106748 309...        10.66915    107.6079                   8, 9, 11, 48, 49
##                                                  weights
## 1 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125
## 2                                0.2, 0.2, 0.2, 0.2, 0.2

32 / 41

Calculating once again Moran's I

library(magrittr)
election_results %$% 
  sfdep::global_moran_test(immigrant_share, neighbors, weights)

## 
##     Moran I test under randomisation
## 
## data:  x  
## weights: listw    
## 
## Moran I statistic standard deviate = 20.428, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5408375504     -0.0018450185      0.0007057415

33 / 41

Calculating once again Geary's C

election_results %$% 
  sfdep::global_c_test(immigrant_share, neighbors, weights)

## 
##     Geary C test under randomisation
## 
## data:  x 
## weights: listw   
## 
## Geary C statistic standard deviate = 17.355, p-value < 2.2e-16
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic       Expectation          Variance 
##       0.442840220       1.000000000       0.001030604

34 / 41

Exercise 2_3_1: Neighborhood Matrices

Exercise

Solution

35 / 41

Measures of local spatial autocorrelation: LISA clusters

$I_i=\frac{x_i-\bar{x}}{\frac{\sum_{i-1}^N(x_i-\bar{x})^2}{N}}\sum_{j=1}^Nw_{ij}(x_j-\bar{x})$

36 / 41

LISA clusters in `sfdep`

lisa <- 
  election_results |> 
  dplyr::mutate(
    lisa = sfdep::local_moran(afd_share, neighbors, weights)
  ) |> 
  tidyr::unnest()
head(lisa, 2)

## Simple feature collection with 2 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4104887 ymin: 3095603 xmax: 4105588 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
## # A tibble: 2 × 19
##   Stimmbezirk afd_share                          geometry immigrant_share inhabitants neighbors weights    ii      eii var_ii  z_ii   p_ii
##         <dbl>     <dbl>                <MULTIPOLYGON [m]>           <dbl>       <dbl>     <int>   <dbl> <dbl>    <dbl>  <dbl> <dbl>  <dbl>
## 1       10205      9.74 (((4105588 3096207, 4105571 3095…            15.0        140.        16   0.125  1.13 -0.00889  0.204  2.52 0.0116
## 2       10205      9.74 (((4105588 3096207, 4105571 3095…            15.0        140.        18   0.125  1.13 -0.00889  0.204  2.52 0.0116
## # ℹ 7 more variables: p_ii_sim <dbl>, p_folded_sim <dbl>, skewness <dbl>, kurtosis <dbl>, mean <fct>, median <fct>, pysal <fct>

37 / 41

They are also nice for mapping

tm_shape(lisa) +
  tm_fill("afd_share", midpoint = NA, palette = "viridis")

tm_shape(lisa) +
  tm_fill("ii", midpoint = NA, palette = "viridis")

38 / 41

One last bit: bivariate LISAs

lisa_bivariate <- 
  election_results |> 
  dplyr::mutate(
    lisa = sfdep::local_moran_bv(
      afd_share, 
      immigrant_share, 
      neighbors, 
      weights
      )
  ) |> 
  tidyr::unnest()
tm_shape(lisa_bivariate) +
  tm_fill(
    "Ib", 
    midpoint = NA, 
    palette = "viridis"
  )

39 / 41

Wrap up

You now know how to

model connectedness of spatial units
investigate spatial autocorrelation
- globally
- locally
map it

There's way more, particularly when it comes to

spatial weights (see exercise)
clustering techniques (e.g., Hot Spot Analysis)
autocorrelation with more than one or two variables

Now we know our data are spatially autocorrelated. Let's try to find out why this is the case via some spatial econometrics

40 / 41

stefan.juenger@gesis.org

@StefanJuenger

StefanJuenger

https://stefanjuenger.github.io

41 / 41

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

Introduction to Geospatial Techniques for Social Scientists in R

Investigating Spatial Autocorrelation

Stefan Jünger & Anne-Kathrin Stroppe

GESIS Workshop

April 24, 2024

Now

Thus far

Tobler's first law of geography

Spatial Interdependence or Autocorrelation

Developing a model of connectiveness: the chess board

Developing a model of connectiveness: the chess board

Rook and queen neighborhoods

Rook and queen neighborhoods

It's an interdependent system

Let's do it hands-on: Our 'research' question

Voting districts

AfD vote share

Simple ID matching to link data

Do vote shares spatially cluster?

Pull in German Census data

It's raster data

Linking: Let's get geographical

exactextractr::exact_extract()!

Voilà

How to test spatial autocorrelation

Let's try Queens neighborhoods

Connected regions

Can we now start?

Normalization

Row-normalization

Apply row-normalization

Test of spatial autocorrelation: Moran's I

Moran's I in spdep

Test of spatial autocorrelation: Geary's C

Geary's C in spdep

Modern inferface to neighbors: sfdep package

Calculating once again Moran's I

Calculating once again Geary's C

Exercise 2_3_1: Neighborhood Matrices

Measures of local spatial autocorrelation: LISA clusters

LISA clusters in sfdep

They are also nice for mapping

One last bit: bivariate LISAs

Wrap up

Now

Help

Introduction to Geospatial Techniques for Social Scientists in R

Introduction to Geospatial Techniques for Social Scientists in R

Investigating Spatial Autocorrelation

Stefan Jünger & Anne-Kathrin Stroppe

GESIS Workshop

April 24, 2024

Now

Thus far

Tobler's first law of geography

Spatial Interdependence or Autocorrelation

Developing a model of connectiveness: the chess board

Developing a model of connectiveness: the chess board

Rook and queen neighborhoods

Rook and queen neighborhoods

It's an interdependent system

Let's do it hands-on: Our 'research' question

Voting districts

AfD vote share

Simple ID matching to link data

Do vote shares spatially cluster?

Pull in German Census data

It's raster data

Linking: Let's get geographical

exactextractr::exact_extract()!

Voilà

How to test spatial autocorrelation

Let's try Queens neighborhoods

Connected regions

Can we now start?

Normalization

Row-normalization

Apply row-normalization

Test of spatial autocorrelation: Moran's I

Moran's I in spdep

`exactextractr::exact_extract()`!

Moran's I in `spdep`

Geary's C in `spdep`

Modern inferface to neighbors: `sfdep` package

LISA clusters in `sfdep`

`exactextractr::exact_extract()`!

Moran's I in `spdep`

Geary's C in `spdep`

Modern inferface to neighbors: `sfdep` package

LISA clusters in `sfdep`