## Now

Day | Time | Title
April 23 | 10:00-11:30 | Introduction to GIS
April 23 | 11:45-13:00 | Vector Data
April 23 | 13:00-14:00 | Lunch Break
April 23 | 14:00-15:30 | Mapping
April 23 | 15:45-17:00 | Raster Data
April 24 | 09:00-10:30 | Advanced Data Import & Processing
April 24 | 10:45-12:00 | Applied Data Wrangling & Linking
April 24 | 12:00-13:00 | Lunch Break
April 24 | 13:00-14:30 | Investigating Spatial Autocorrelation
April 24 | 14:45-16:00 | Spatial Econometrics & Outlook

## What are spatial econometrics? Econometrics could be reduced to using statistics to model (complex) theories ... - it is interesting for causal inference and thinking - as default we think about regression analysis Therefore, spatial econometrics combine spatial analysis and econometrics - study of why spatial relationships (i.e., autocorrelation) exist - how spatial autocorrelation affects our outcome of interest **What is the data generation process?** --- ## Spatial diffusion vs. spatial spillover There are at least two common mechanisms we are interested in spatial econometrics .pull-left[ .small[ **Diffusion** - `\(y_i\)` affects `\(y_j\)` through `\(w_{ij}\)` - `\(y_j\)` affects `\(y_i\)` through `\(w_{ji}\)` - that's a feedback effect - endogenous by design! - Examples: - pandemic and policy measures to contain the pandemic - diffusion of violence in a war ] ] .pull-right[ .small[ **Spillover** - `\(x_i\)` affects `\(y_j\)` through `\(w_{ij}\)` - `\(x_j\)` affects `\(y_i\)` through `\(w_{ij}\)` - Examples: - spillover of economic strength and trade ] ] --- ## Let's have another look at our chessboard .pull-left[ We have to think about theories and mechanisms and how they translate into spatial effects and the data generation process. That said, there are tests to check for the specific data generation process at hand, but they are not recommended to be used naively. ] .pull-right[ .center[ <img src="data:image/png;base64,#../img/queen_interdependent.png" width="3464" style="display: block; margin: auto;" /> ] ] --- ## Is it meaningful or just nuisances? Space can be important in our analysis in two ways. - it's meaningful in our theory and we thus interpret it accordingly after estimation - it can distort our empirical estimates, producing bias, inconsistency, and inefficiency **We can address both of these different perspectives in our analysis with spatial econometric methods.** --- ## Formulas... models, models, models Linear Regression: `$$Y = X\beta + \epsilon$$` -- Spatial Lag Y / Spatial Autoregressive Model (SAR, Diffusion): `$$Y = \rho WY + X\beta + \epsilon$$` -- Spatial Lag X Model (SLX, Spillover): `$$Y = X\beta + WX\theta + \epsilon$$` -- Spatial Error Model (SEM): `$$Y = X\beta + u$$` `$$u = \lambda Wu + \epsilon$$` --- ## Flavors and extensions .tinyisher[ Spatial Durbin Model: $$Y = \rho WY + X\beta + WX\theta + \epsilon $$ ] .tinyisher[ Spatial Durbin Error Model: `$$Y = X\beta + WX\theta + u$$` `$$u = \lambda Wu + \epsilon$$` ] .tinyisher[ Combined Spatial Autocorrelation Model: `$$Y = \rho WY + X\beta + u$$` `$$u = \lambda Wu + \epsilon$$` ] .tinyisher[ Manski Model: `$$Y = \rho WY + WX\theta + X\beta + u$$` `$$u = \lambda Wu + \epsilon$$` ] .center[ <img src="data:image/png;base64,#../img/formulas.gif" width="40%" style="display: block; margin: auto;" /> .tinyisher[ Source:[Tenor]( ] ] --- ## Intermediate summary There are a lot of models you could estimate to *explain* spatial autocorrelation. And there's a vast body of literature on what's the best choice for which application. We'd explicitly like to recommend the work of [Tobias Rüttenauer]( for us social scientists. [Here]( are some really nice workshop materials. **In this session, we will only estimate Spatial Lag Y and X and Spatial Error Models.** --- ## 'Research' question and data We will use the same example as in the previous session. But this time, we will actually test if one of our spatial regression models helps investigating the data generation process any further. We may ask: 1. Do immigrant shares have an effect on AfD voting shares within voting districts? 2. Do immigrant shares have an effect on AfD voting shares between neighborhoods? (=spillover) 3. Do AfD voting shares have an effect on AfD voting shares between neighborhoods? (=diffusion) It might also be a good idea to control for inhabitant numbers within the voting districts. --- ## Linear regression ```r linear_regression <- lm(afd_share ~ immigrant_share + inhabitants, data = election_results) summary(linear_regression) ``` ``` ## ## Call: ## lm(formula = afd_share ~ immigrant_share + inhabitants, data = election_results) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.010 -3.397 -0.232 2.790 25.032 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 27.737242 0.579582 47.857 < 2e-16 *** ## immigrant_share -0.097675 0.026150 -3.735 0.000207 *** ## inhabitants -0.079595 0.003812 -20.879 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.843 on 540 degrees of freedom ## Multiple R-squared: 0.4822, Adjusted R-squared: 0.4803 ## F-statistic: 251.4 on 2 and 540 DF, p-value: < 2.2e-16 ``` --- ## Now we need a spatial weight To estimate a spatial regression we, once again, have to construct a spatial weight as in the analysis of spatial autocorrelation. In fact, we'll use the same approach as before. ```r queen_neighborhoods <- spdep::poly2nb(election_results, queen = TRUE) queen_W <- spdep::nb2listw(queen_neighborhoods, style = "W") ``` --- ## Spatial Error Model: If we want to control nuisance .small[ ```r spatial_error_model <- spatialreg::errorsarlm( afd_share ~ immigrant_share + inhabitants, data = election_results, listw = queen_W ) summary(spatial_error_model) ``` ``` ## ## Call: ## spatialreg::errorsarlm(formula = afd_share ~ immigrant_share + ## inhabitants, data = election_results, listw = queen_W) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.60213 -2.38063 -0.40782 1.97417 25.55441 ## ## Type: error ## Coefficients: (asymptotic standard errors) ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 22.8185498 0.9398113 24.2799 < 2.2e-16 ## immigrant_share -0.0806095 0.0281025 -2.8684 0.004125 ## inhabitants -0.0337644 0.0045643 -7.3974 1.388e-13 ## ## Lambda: 0.75749, LR test value: 216.39, p-value: < 2.22e-16 ## Asymptotic standard error: 0.033094 ## z-value: 22.889, p-value: < 2.22e-16 ## Wald statistic: 523.9, p-value: < 2.22e-16 ## ## Log likelihood: -1517.349 for error model ## ML residual variance (sigma squared): 13.532, (sigma: 3.6785) ## Number of observations: 543 ## Number of parameters estimated: 5 ## AIC: NA (not available for weighted model), (AIC for lm: 3259.1) ``` ] --- ## Spatial Lag X Model: estimating spillovers .small[ ```r spatial_lag_x_model <- spatialreg::lmSLX( afd_share ~ immigrant_share + inhabitants, data = election_results, listw = queen_W ) summary(spatial_lag_x_model) ``` ``` ## ## Call: ## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), ## data =, weights = weights) ## ## Residuals: ## Min 1Q Median 3Q Max ## -10.4243 -3.0311 -0.1935 2.4388 25.0694 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 30.649157 0.671665 45.632 < 2e-16 *** ## immigrant_share -0.069702 0.034623 -2.013 0.0446 * ## inhabitants -0.026439 0.005841 -4.526 7.4e-06 *** ## lag.immigrant_share -0.026168 0.048127 -0.544 0.5869 ## lag.inhabitants -0.085389 0.007656 -11.153 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.364 on 538 degrees of freedom ## Multiple R-squared: 0.5811, Adjusted R-squared: 0.578 ## F-statistic: 186.6 on 4 and 538 DF, p-value: < 2.2e-16 ``` ] --- ## Spatial Lag Y Model: estimating diffusion .small[ ```r spatial_lag_y_model <- spatialreg::lagsarlm( afd_share ~ immigrant_share + inhabitants, data = election_results, listw = queen_W) summary(spatial_lag_y_model) ``` ``` ## ## Call: ## spatialreg::lagsarlm(formula = afd_share ~ immigrant_share + ## inhabitants, data = election_results, listw = queen_W) ## ## Residuals: ## Min 1Q Median 3Q Max ## -10.17786 -2.27359 -0.29956 1.98212 24.26683 ## ## Type: lag ## Coefficients: (asymptotic standard errors) ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 10.2465884 0.9782773 10.4741 < 2.2e-16 ## immigrant_share -0.0527021 0.0196904 -2.6765 0.007439 ## inhabitants -0.0330830 0.0034265 -9.6551 < 2.2e-16 ## ## Rho: 0.66446, LR test value: 261.11, p-value: < 2.22e-16 ## Asymptotic standard error: 0.03489 ## z-value: 19.045, p-value: < 2.22e-16 ## Wald statistic: 362.69, p-value: < 2.22e-16 ## ## Log likelihood: -1494.985 for lag model ## ML residual variance (sigma squared): 12.992, (sigma: 3.6045) ## Number of observations: 543 ## Number of parameters estimated: 5 ## AIC: 3000, (AIC for lm: 3259.1) ## LM test for residual autocorrelation ## test value: 21.043, p-value: 4.4919e-06 ``` ] --- ## Comparison: What's 'better'? .small[ ```r AIC(spatial_error_model, spatial_lag_x_model, spatial_lag_y_model) ``` ``` ## df AIC ## spatial_error_model 5 3044.697 ## spatial_lag_x_model 6 3147.995 ## spatial_lag_y_model 5 2999.971 ``` ```r spdep::lm.LMtests(linear_regression, queen_W, test = c("LMerr", "LMlag")) ``` ``` ## ## Lagrange multiplier diagnostics for spatial dependence ## ## data: ## model: lm(formula = afd_share ~ immigrant_share + ## inhabitants, data = election_results) ## weights: queen_W ## ## LMerr = 198.29, df = 1, p-value < 2.2e-16 ## ## ## Lagrange multiplier diagnostics for spatial dependence ## ## data: ## model: lm(formula = afd_share ~ immigrant_share + ## inhabitants, data = election_results) ## weights: queen_W ## ## LMlag = 299.73, df = 1, p-value < 2.2e-16 ``` ] Let's stick to our theory, shall we? --- ## Of higher importance: interpretation Unfortunately, in case of a Spatial Lag Y Model the spatial parameter `\(\rho\)` only tells us that the effect is (statistically) significant -- or not. - remember: these models are endegenous by design - we have effects of `\(y_j\)` on `\(y_i\)` and vice versa - what a mess Luckily, there's a method to decompose the spatial effects into direct, indirect and total effects: **estimating impacts** --- ## Impact estimation in `R` This time, let's start with the Spatial Lag Y Model: ```r spatialreg::impacts(spatial_lag_y_model, listw = queen_W) ``` ``` ## Impact measures (lag, exact): ## Direct Indirect Total ## immigrant_share -0.05948993 -0.09757580 -0.15706572 ## inhabitants -0.03734396 -0.06125182 -0.09859578 ``` Compare it to the 'simple' regression output: ```r coef(spatial_lag_y_model) ``` ``` ## rho (Intercept) immigrant_share inhabitants ## 0.66445817 10.24658839 -0.05270212 -0.03308301 ``` --- ## Spatial Lag X impacts ```r spatialreg::impacts(spatial_lag_x_model, listw = queen_W) ``` ``` ## Impact measures (SlX, glht): ## Direct Indirect Total ## immigrant_share -0.06970227 -0.02616764 -0.09586991 ## inhabitants -0.02643886 -0.08538884 -0.11182770 ``` Compare it to the 'simple' regression output: ```r coef(spatial_lag_x_model) ``` ``` ## (Intercept) immigrant_share inhabitants ## 30.64915652 -0.06970227 -0.02643886 ## lag.immigrant_share lag.inhabitants ## -0.02616764 -0.08538884 ``` --- ## If you need p-values and stuff ```r spatialreg::impacts(spatial_lag_y_model, listw = queen_W, R = 500) %>% summary(zstats = TRUE, short = TRUE) ``` ``` ## Impact measures (lag, exact): ## Direct Indirect Total ## immigrant_share -0.05948993 -0.09757580 -0.15706572 ## inhabitants -0.03734396 -0.06125182 -0.09859578 ## ======================================================== ## Simulation results ( variance matrix): ## ======================================================== ## Simulated standard errors ## Direct Indirect Total ## immigrant_share 0.022390179 0.038273065 0.059658509 ## inhabitants 0.003535946 0.008409926 0.009944576 ## ## Simulated z-values: ## Direct Indirect Total ## immigrant_share -2.661032 -2.575303 -2.650849 ## inhabitants -10.629243 -7.405056 -10.041695 ## ## Simulated p-values: ## Direct Indirect Total ## immigrant_share 0.0077901 0.010015 0.008029 ## inhabitants < 2.22e-16 1.3101e-13 < 2e-16 ``` --- class: middle ## Exercise 2_3_2: Spatial Regression [Exercise]( [Solution]( --- class: middle ## Outlook --- ## This week <table class="table" style="margin-left: auto; margin: auto;" /> .tinyisher[ Check out [`gganimate`]( ] ] --- ## Data Sources Some more information: - geospatial data are interdisciplinary - amount of data feels unlimited - data providers and data portals are often specific in the area and/or the information they cover -- Some random examples: - [Eurostat]( - [European Spatial Data Infrastructure]( - [John Hopkins Corona Data Hub and Dashboard]( - [US Census Bureau]( - ... --- class: middle ## The End --- class: middle ## Addon-slides: Missings in Spatial Econometrics --- ## What if you got missing values? Missing values in spatial regression models do produce similar problems as in ordinary regression analysis - yield biased estimates - reduces statistical power However, the issue gets a bit more severe as the observations interdependent - we are missing out on more information - even randomness of missings might get problematic - **Thus, it might be a good idea to think of methods to navigate this bias.** --- ## Let's produce a dataset with missing data .pull-left[ ```r # ~10% missing values missing_index <- sample( 1:nrow(election_results), round(nrow(election_results) * .1, 0) ) election_results_missing <- election_results election_results_missing$afd_share[missing_index] <- NA # list-wise deletion election_results_missing <- na.omit(election_results_missing) tm_shape(election_results_missing) + tm_fill("afd_share", palette = "viridis") ``` ] .pull-right[ .center[ <img src="data:image/png;base64,#2_4_Spatial_Econometrics_Outlook_files/figure-html/missing-elections-2-1.png" style="display: block; margin: auto;" /> ] ] --- ## How does a Spatial Lag X Model perform? ```r queen_neighborhoods_missing <- spdep::poly2nb(election_results_missing, queen = TRUE) queen_W_missing <- spdep::nb2listw(queen_neighborhoods_missing, style = "W", zero.policy = TRUE) spatial_lag_y_model_missing <- spatialreg::lagsarlm( afd_share ~ immigrant_share + inhabitants, data = election_results_missing, listw = queen_W_missing, zero.policy = TRUE ) ``` --- ## Model comparison ```r spatialreg::impacts(spatial_lag_y_model_missing, listw = queen_W_missing) ``` ``` ## Impact measures (lag, exact): ## Direct Indirect Total ## immigrant_share -0.05450334 -0.07883944 -0.13334278 ## inhabitants -0.03918632 -0.05668327 -0.09586959 ``` ```r spatialreg::impacts(spatial_lag_y_model, listw = queen_W) ``` ``` ## Impact measures (lag, exact): ## Direct Indirect Total ## immigrant_share -0.05948993 -0.09757580 -0.15706572 ## inhabitants -0.03734396 -0.06125182 -0.09859578 ``` --- ## What to do now? The way how to deal with missing data in geospatial data depends on their general geometric structure. For points, there are established methods, such as [interpolation]( Often these are somewhat ways of aggregating data, which does not help in our case. I'd say that good old imputation techniques might also help: - good for multivariate cases - yet, they are no spatial techniques and cannot create plausible values for spatial relationships - but imputing spatial relationships would be a matter of contingency anyway --- ## Simplest case of imputation .pull-left[ ```r # ~10% missing values missing_index <- sample( 1:nrow(election_results), round(nrow(election_results) * .1, 0) ) election_results_missing <- election_results election_results_missing$afd_share[missing_index] <- NA election_results_missing <- election_results_missing %>% sf::st_drop_geometry() %>% mice::mice(method = "norm.predict", m = 1) %>% mice::complete() %>% dplyr::left_join( election_results_missing %>% dplyr::select(-afd_share, -immigrant_share, -inhabitants) ) %>% sf::st_as_sf() ``` ] .pull-right[ .center[ ``` ## ## iter imp variable ## 1 1 afd_share ## 2 1 afd_share ## 3 1 afd_share ## 4 1 afd_share ## 5 1 afd_share ``` <img src="data:image/png;base64,#2_4_Spatial_Econometrics_Outlook_files/figure-html/impute-2-1.png" style="display: block; margin: auto;" /> ] ] --- ## And again run the model ```r queen_neighborhoods_missing <- spdep::poly2nb(election_results_missing, queen = TRUE) queen_W_missing <- spdep::nb2listw(queen_neighborhoods_missing, style = "W") spatial_lag_y_model_missing <- spatialreg::lagsarlm( afd_share ~ immigrant_share + inhabitants, data = election_results_missing, listw = queen_W_missing ) ``` --- ## ...and compare it with the original one ```r spatialreg::impacts(spatial_lag_y_model_missing, listw = queen_W_missing) ``` ``` ## Impact measures (lag, exact): ## Direct Indirect Total ## immigrant_share -0.04834610 -0.06855918 -0.1169053 ## inhabitants -0.04148773 -0.05883338 -0.1003211 ``` ```r spatialreg::impacts(spatial_lag_y_model, listw = queen_W) ``` ``` ## Impact measures (lag, exact): ## Direct Indirect Total ## immigrant_share -0.05948993 -0.09757580 -0.15706572 ## inhabitants -0.03734396 -0.06125182 -0.09859578 ``` --- layout: false class: center background-image: url(data:image/png;base64,#../assets/img/the_end.png) background-size: cover .left-column[ </br> <img src="data:image/png;base64,#../img/Anne.png" width="75%" style="display: block; 