Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Introduction to Geospatial Techniques for Social Scientists in R

Investigating Spatial Autocorrelation

Stefan Jünger & Anne-Kathrin Stroppe

GESIS Workshop

April 24, 2024

1 / 41

Now

Day Time Title
April 23 10:00-11:30 Introduction to GIS
April 23 11:45-13:00 Vector Data
April 23 13:00-14:00 Lunch Break
April 23 14:00-15:30 Mapping
April 23 15:45-17:00 Raster Data
April 24 09:00-10:30 Advanced Data Import & Processing
April 24 10:45-12:00 Applied Data Wrangling & Linking
April 24 12:00-13:00 Lunch Break
April 24 13:00-14:30 Investigating Spatial Autocorrelation
April 24 14:45-16:00 Spatial Econometrics & Outlook
2 / 41

Thus far

We've done some

  • wrangling,
  • mapping,
  • and linking of geospatial data (with georeferenced survey data)

We've seen that geospatial data are

  • relevant to provide context
    • as social scientists, we know that space is important!
  • nice to look at
    • we can tell a story

However, geospatial data can be interesting on their own for social science studies!

3 / 41

Tobler's first law of geography

[E]verything is related to everything else, but near things are more related than distant things (Tobler 1970, p. 236)

This means nearby geographical regions, institutions, or people are more similar to each other or do have a stronger influence on each other.

What we get is an interdependent system.

Tobler, W. R. (1970). A Computer Movie Simulating Urban Growth in the Detroit Region. Economic Geography, 46, 234–240. https://doi.org/10.2307/143141

4 / 41

Spatial Interdependence or Autocorrelation

Tobler's law is the fundamental principle of doing spatial analysis. We want to know

  1. if observations in our data are spatially interdependent
  2. and how this interdependence can be explained (= data generation process)
5 / 41

Developing a model of connectiveness: the chess board

6 / 41

Developing a model of connectiveness: the chess board

7 / 41

Rook and queen neighborhoods

8 / 41

Rook and queen neighborhoods

9 / 41

It's an interdependent system

10 / 41

Let's do it hands-on: Our 'research' question

Say, we are interested in AfD voting outcomes in relation to ethnic compositions of neighborhoods

  • combination of far-right voting research with Allport's classic contact theory
  • we are just doing it in the Urban context of Cologne (again)
11 / 41

Voting districts

voting_districts <-
sf::st_read("./data/Stimmbezirk.shp") |>
sf::st_transform(3035) |>
dplyr::transmute(Stimmbezirk = as.numeric(nummer))
## Reading layer `Stimmbezirk' from data source
## `C:\Users\mueller2\a_talks_presentations\gesis-workshop-geospatial-techniques-R-2024\data\Stimmbezirk.shp' using driver `ESRI Shapefile'
## Simple feature collection with 543 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 343914.7 ymin: 5632759 xmax: 370674.3 ymax: 5661475
## Projected CRS: ETRS89 / UTM zone 32N
head(voting_districts, 2)
## Simple feature collection with 2 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
## Stimmbezirk geometry
## 1 10205 MULTIPOLYGON (((4105588 309...
## 2 10213 MULTIPOLYGON (((4106748 309...
12 / 41

AfD vote share

afd_votes <-
glue::glue(
"https://www.stadt-koeln.de/wahlen/bundestagswahl/09-2021/praesentation/\\
Open-Data-Bundestagswahl476.csv"
) |>
readr::read_csv2() |>
dplyr::transmute(Stimmbezirk = `gebiet-nr`, afd_share = (F1 / F) * 100)
head(afd_votes, 2)
## # A tibble: 2 × 2
## Stimmbezirk afd_share
## <dbl> <dbl>
## 1 10101 13.1
## 2 10102 11.4
13 / 41
election_results <-
dplyr::left_join(
voting_districts,
afd_votes,
by = "Stimmbezirk"
)
head(election_results, 2)
## Simple feature collection with 2 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
## Stimmbezirk afd_share geometry
## 1 10205 9.74026 MULTIPOLYGON (((4105588 309...
## 2 10213 11.76471 MULTIPOLYGON (((4106748 309...
14 / 41

Do vote shares spatially cluster?

tm_shape(election_results) +
tm_polygons(
"afd_share",
palette = "viridis"
)

15 / 41

Pull in German Census data

immigrants_cologne <-
z11::z11_get_100m_attribute(STAATSANGE_KURZ_2) |>
terra::crop(election_results) |>
terra::mask(election_results)
inhabitants_cologne <-
z11::z11_get_100m_attribute(Einwohner) |>
terra::crop(election_results) |>
terra::mask(election_results)
immigrant_share_cologne <-
(immigrants_cologne / inhabitants_cologne) * 100
16 / 41

It's raster data

tm_shape(immigrant_share_cologne) +
tm_raster(palette = "-viridis")

17 / 41

Linking: Let's get geographical

As the voting (vector) data is different to the Census raster data we cannot use simple ID matching like before

  • we have to rely on spatial linking techniques
  • we could use terra::extract()
    • but as a default it only captures raster cells as a whole and not their spatial fraction
    • which is honestly okay for most applications
    • but why not try something else?
18 / 41

exactextractr::exact_extract()!

election_results <-
election_results |>
dplyr::mutate(
immigrant_share =
exactextractr::exact_extract(
immigrant_share_cologne, election_results, 'mean', progress = FALSE
),
inhabitants =
exactextractr::exact_extract(
inhabitants_cologne, election_results, 'mean', progress = FALSE
)
)
head(election_results, 2)
## Simple feature collection with 2 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
## Stimmbezirk afd_share geometry immigrant_share inhabitants
## 1 10205 9.74026 MULTIPOLYGON (((4105588 309... 15.00740 140.1017
## 2 10213 11.76471 MULTIPOLYGON (((4106748 309... 10.66915 107.6079
19 / 41

Voilà

tm_shape(election_results) +
tm_polygons(
"immigrant_share",
palette = "-viridis"
)

20 / 41

How to test spatial autocorrelation

We now have to ask

  • do the spatial units relate to each other?
  • if yes, in which way?
    • only if they are bordering each other? (i.e., Queens or Rooks)
    • or also if they are in proximity but not necessarily contiguous?

21 / 41

Let's try Queens neighborhoods

queens_neighborhoods <-
spdep::poly2nb(
election_results,
queen = TRUE
)
summary(queens_neighborhoods)
## Neighbour list object:
## Number of regions: 543
## Number of nonzero links: 2978
## Percentage nonzero weights: 1.010009
## Average number of links: 5.484346
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 2 12 53 107 136 102 58 30 21 13 4 2 2 1
## 2 least connected regions:
## 69 196 with 1 link
## 1 most connected region:
## 249 with 14 links
22 / 41

Connected regions

queens_neighborhoods |>
spdep::nb2lines(
coords = sf::st_as_sfc(election_results),
as_sf = TRUE
) |>
tm_shape() +
tm_dots() +
tm_lines()

23 / 41

Can we now start?

Unfortunately, we are yet done with creating the links between neighborhoods. What we receive is, in principle, a huge matrix with connected observations.

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 1 1 0
## 3 0 0 0 1 1 0 0 0 1 0
## 4 0 0 1 0 1 1 0 0 0 0
## 5 0 0 1 1 0 1 0 0 0 0
## 6 0 0 0 1 1 0 1 1 0 0
## 7 0 0 0 0 0 1 0 1 1 0
## 8 0 1 0 0 0 1 1 0 1 0
## 9 0 1 1 0 0 0 1 1 0 0
## 10 0 0 0 0 0 0 0 0 0 0

That's nothing we could plug into a statistical model, such as a regression or the like (see next session).

24 / 41

Normalization

Normalization is the process of creating actual spatial weights. There is a huge dispute on how to do it (Neumayer & Plümper, 2016). But nobody questions whether it should be done in the first place since, among others, it restricts the parameter space of the weights.

non-normalized

## [,1] [,2] [,3] [,4] [,5]
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 1 1
## 4 0 0 1 0 1
## 5 0 0 1 1 0
## [1] 0 0 2 2 2

normalized

## [,1] [,2] [,3] [,4] [,5]
## 1 0 0 0.0000000 0.000 0.0000000
## 2 0 0 0.0000000 0.000 0.0000000
## 3 0 0 0.0000000 0.125 0.1250000
## 4 0 0 0.3333333 0.000 0.3333333
## 5 0 0 0.2500000 0.250 0.0000000
## [1] 0.0000000 0.0000000 0.2500000 0.6666667 0.5000000



Neumayer, E., & Plümper, T. (2016). W. Political Science Research and Methods, 4(01), 175–193. https://doi.org/10.1017/psrm.2014.40

25 / 41

Row-normalization

One of the disputed but at the same time standard procedures is row-normalization. It divides all individual weights (=connections between spatial units) wij by the row-wise sum of of all other weights:

W=jwijjwij

An alternative would be minmax-normalization:

W=jwijmin(wij)max(wij)min(wij)

26 / 41

Apply row-normalization

queens_W <- spdep::nb2listw(queens_neighborhoods, style = "W")
summary(queens_W)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 543
## Number of nonzero links: 2978
## Percentage nonzero weights: 1.010009
## Average number of links: 5.484346
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 2 12 53 107 136 102 58 30 21 13 4 2 2 1
## 2 least connected regions:
## 69 196 with 1 link
## 1 most connected region:
## 249 with 14 links
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 543 294849 543 211.1469 2261.598
27 / 41

Test of spatial autocorrelation: Moran's I

I=NNi=1Nj=1wijNi=1Nj=1wij(xiˉx)(xjˉx)Ni=1(xiˉx)2

Most and foremost, Moran's I makes use of the previously created weights between all spatial units pairs wij. It weights deviations from an overall mean value of connected pairs according to the strength of the modeled spatial relations. Moran's I can be interpreted as some sort of a correlation coefficient (-1 = perfect negative spatial autocorrelation; +1 = perfect positive spatial autocorrelation).

28 / 41

Moran's I in spdep

spdep::moran.test(
election_results$immigrant_share,
listw = queens_W
)
##
## Moran I test under randomisation
##
## data: election_results$immigrant_share
## weights: queens_W
##
## Moran I statistic standard deviate = 20.428, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.5408375504 -0.0018450185 0.0007057415
29 / 41

Test of spatial autocorrelation: Geary's C

Moran's I is a global statistic for spatial autocorrelation. It can produce issues when there are only local clusters of spatial interdependence in the data. An alternative is the use of Geary's C:

C=(N1)ijwij(xixj)22Ni=1Nj=1wiji(xiˉx)2

As you can see, in the numerator, the average value ˉx is not as prominent as in Moran's I. Geary's C only produces values between 0 and 2 (value near 0 = positive spatial autocorrelation; 1 = no spatial autocorrelation; values near 2 = negative spatial autocorrelation).

30 / 41

Geary's C in spdep

spdep::geary.test(
election_results$immigrant_share,
listw = queens_W
)
##
## Geary C test under randomisation
##
## data: election_results$immigrant_share
## weights: queens_W
##
## Geary C statistic standard deviate = 17.355, p-value < 2.2e-16
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic Expectation Variance
## 0.442840220 1.000000000 0.001030604
31 / 41

Modern inferface to neighbors: sfdep package

The sfdep package provides a more tidyverse-compliant syntax to spatial weights. See:

election_results <-
election_results |>
dplyr::mutate(
neighbors = sfdep::st_contiguity(election_results), # queen neighborhoods by default
weights = sfdep::st_weights(neighbors)
)
head(election_results, 2)
## Simple feature collection with 2 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 4104887 ymin: 3094313 xmax: 4107089 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
## Stimmbezirk afd_share geometry immigrant_share inhabitants neighbors
## 1 10205 9.74026 MULTIPOLYGON (((4105588 309... 15.00740 140.1017 16, 18, 22, 25, 458, 459, 460, 462
## 2 10213 11.76471 MULTIPOLYGON (((4106748 309... 10.66915 107.6079 8, 9, 11, 48, 49
## weights
## 1 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125
## 2 0.2, 0.2, 0.2, 0.2, 0.2
32 / 41

Calculating once again Moran's I

library(magrittr)
election_results %$%
sfdep::global_moran_test(immigrant_share, neighbors, weights)
##
## Moran I test under randomisation
##
## data: x
## weights: listw
##
## Moran I statistic standard deviate = 20.428, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.5408375504 -0.0018450185 0.0007057415
33 / 41

Calculating once again Geary's C

election_results %$%
sfdep::global_c_test(immigrant_share, neighbors, weights)
##
## Geary C test under randomisation
##
## data: x
## weights: listw
##
## Geary C statistic standard deviate = 17.355, p-value < 2.2e-16
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic Expectation Variance
## 0.442840220 1.000000000 0.001030604
34 / 41

Exercise 2_3_1: Neighborhood Matrices

Exercise

Solution

35 / 41

Measures of local spatial autocorrelation: LISA clusters

The reason why we show you the sfdep package is that it provides nice functions to calculate local measures of spatial autocorrelation. One popular choice are the estimation of Local Indicators of Spatial Autocorrelation (i.e., LISA clusters). In the most straightforward way they can be interpreted as case-specific indicators of spatial autocorrelation:

Ii=xiˉxNi1(xiˉx)2NNj=1wij(xjˉx)

36 / 41

LISA clusters in sfdep

lisa <-
election_results |>
dplyr::mutate(
lisa = sfdep::local_moran(afd_share, neighbors, weights)
) |>
tidyr::unnest()
head(lisa, 2)
## Simple feature collection with 2 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 4104887 ymin: 3095603 xmax: 4105588 ymax: 3096356
## Projected CRS: ETRS89-extended / LAEA Europe
## # A tibble: 2 × 19
## Stimmbezirk afd_share geometry immigrant_share inhabitants neighbors weights ii eii var_ii z_ii p_ii
## <dbl> <dbl> <MULTIPOLYGON [m]> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10205 9.74 (((4105588 3096207, 4105571 3095… 15.0 140. 16 0.125 1.13 -0.00889 0.204 2.52 0.0116
## 2 10205 9.74 (((4105588 3096207, 4105571 3095… 15.0 140. 18 0.125 1.13 -0.00889 0.204 2.52 0.0116
## # ℹ 7 more variables: p_ii_sim <dbl>, p_folded_sim <dbl>, skewness <dbl>, kurtosis <dbl>, mean <fct>, median <fct>, pysal <fct>
37 / 41

They are also nice for mapping

tm_shape(lisa) +
tm_fill("afd_share", midpoint = NA, palette = "viridis")

tm_shape(lisa) +
tm_fill("ii", midpoint = NA, palette = "viridis")

38 / 41

One last bit: bivariate LISAs

lisa_bivariate <-
election_results |>
dplyr::mutate(
lisa = sfdep::local_moran_bv(
afd_share,
immigrant_share,
neighbors,
weights
)
) |>
tidyr::unnest()
tm_shape(lisa_bivariate) +
tm_fill(
"Ib",
midpoint = NA,
palette = "viridis"
)

39 / 41

Wrap up

You now know how to

  • model connectedness of spatial units
  • investigate spatial autocorrelation
    • globally
    • locally
  • map it

There's way more, particularly when it comes to

  • spatial weights (see exercise)
  • clustering techniques (e.g., Hot Spot Analysis)
  • autocorrelation with more than one or two variables

Now we know our data are spatially autocorrelated. Let's try to find out why this is the case via some spatial econometrics

40 / 41

Now

Day Time Title
April 23 10:00-11:30 Introduction to GIS
April 23 11:45-13:00 Vector Data
April 23 13:00-14:00 Lunch Break
April 23 14:00-15:30 Mapping
April 23 15:45-17:00 Raster Data
April 24 09:00-10:30 Advanced Data Import & Processing
April 24 10:45-12:00 Applied Data Wrangling & Linking
April 24 12:00-13:00 Lunch Break
April 24 13:00-14:30 Investigating Spatial Autocorrelation
April 24 14:45-16:00 Spatial Econometrics & Outlook
2 / 41
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow