Section 2 Overview

This analysis aims to create a high-resolution map of poverty, income, and literacy for El Salvador. The data on these three development indicators come from the 2017 EHPM household survey. While the survey data are available for only 1,664 segmentos, the smallest (local) administrative units, the goal is to provide a map of the three development indicators for all 12,435 segmentos in the country.

This is made possible by building a statistical model exploiting the relationship between RS data (e.g., lights at night measured from a satellite, precipitation), the EHPM survey data, and the spatial correlation between SDG outcomes across space. Once the relationship between the EHPM survey data and the RS data is modeled in locations where EHPM survey data are available, RS data, which are available for the entire country, are used to predict SDG outcomes across the entire country. Furthermore, as SDG outcomes of two segmentos are more likely to be similar if both segmentos are neighbors than if they are far away from each other, the accuracy of the spatial prediction can be enhanced by explicitly into account the spatial distances and spatial relationships between the segmentos.

The interpolation technique we use in this report is based on Bayesian geospatial methods implemented in the open-source statistical computing environment R (R Core Team 2018). The core of the modeling method is implemented in the R package INLA (Rue and Chopin 2009; Lindgren and Lindström 2011; Martins et al. 2013; F. Lindgren and Rue 2015).

Several other R packages are used in this tutorial. The main ones are dplyr (Wickham et al. 2019) for data handling, plotly (Sievert 2018) for interactive plots, leaflet (Cheng, Karambelkar, and Xie 2018) for interactive maps, raster (Hijmans 2019a) for raster data, rgdal (Bivand, Keitt, and Rowlingson 2019) for vector files, sp (Bivand, Pebesma, and Gomez-Rubio 2013) for coordinates systems adjustments, energy (Rizzo and Szekely 2018) for distance correlation computations, and parallel (R Core Team 2020) for the covariates selection process.

These are the five main steps of the analysis:

  1. Data preprocessing
  2. Covariate preselection
  3. Model fitting
  4. Diagnostic checks
  5. Out-of-sample spatial interpolation to create high-resolution maps

A brief introduction to the INLA method is provided in Section 8. Three books are recommended for those who are interested in more information on the INLA approach:

  • For readers with only a minimal quantitative background: Beginner’s Guide to Spatial, Temporal, and Spatial-Temporal Ecological Data Analysis with R-INLA (Zuur, Ieno, and Saveliev 2017)
  • For those willing to delve deeper into the methodology behind INLA: Spatial and Spatio-Temporal Bayesian Models with R-INLA (Blangiardo and Cameletti 2015)
  • For those interested in an application for public health: Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny(Moraga 2019)

Bakka et al. (Bakka, Rue, and al. 2018) provide a good overview of spatial modeling with INLA, and Steele et al. (Steele, Sundsøy, and Pezzulo 2017) provide a great example of the use of INLA for poverty mapping.

For information about the R-INLA package, please refer to the R-INLA project website.

2.1 The First Look at the Data

A total of 20,609 households were interviewed in the 2017 EHPM. The EHPM data are collected in 1,664 of El Salvador’s 12,435 segmentos. The number of households interviewed per segmento is mapped in Figure 2.1.

Here are the required steps to create the map:

  • Load the required packages: rgdal, dplyr and leaflet.
  • Load the Segmentos shapefile with rgdal and the list of households per segmento.
  • Count the number of households per segmento with dplyr and identify the segmentos where at least one survey recipient exists.
  • Map the results with leaflet.

To begin, we load the packages and initial data with rgdal.

# Load the packages
library(rgdal)
library(dplyr)
library(leaflet)
#source("../utils.R")

# Modify the directory path “dir_data” to where you stored the data
root_dir="~/"
project_dir="data/"

dir_data=paste0(root_dir,project_dir)

# Load the segmento shapefile map (simplified for faster rendering)
segmentos_to_map=readOGR(paste0(dir_data,
                       "spatial/shape/admin/STPLAN_Segmentos_simplified.shp")) 
# Load the households list surveyed in the EHPM per segmento
id_segmento_2017=readxl::read_xlsx(paste0(dir_data,
                                        "tables/Identificador de segmento.xlsx"),
                                 sheet = 2)  

We then compute the number of participants per segmento with dplyr.

id_segmento_2017_df=id_segmento_2017%>% 
  rename(SEG_ID=seg_id)%>% 
  group_by(SEG_ID)%>% # Group data per segmento
  summarize(ehpm_2017_d=1, # 1 if there is a household in the segmento
            ehpm_2017_n=n()) # # Count the number of households per segmento with the function n()
## `summarise()` ungrouping output (override with `.groups` argument)

Before merging the results with the shapefile, we change the identifier of segmentos in the shapefile, “SEG_ID”, into character format. This is to make sure the match between the SEG_ID stored as a character in the household list id_segmento_2017_df and the shapefile map segmentos is correct.

segmentos_to_map@data=segmentos_to_map@data%>%
  mutate(SEG_ID=as.character(SEG_ID))%>% # Turn SEG_ID into character format
  left_join(id_segmento_2017_df, # Merge the household list with the shapefile data
            by="SEG_ID")

We can now map all the segmentos where the EHPM data have been collected with the package leaflet. Note that when running the next command, rendering the 12,435 shapes might take a while.

leaflet(segmentos_to_map) %>% # Leaflet is used to render the map
     addProviderTiles(providers$CartoDB.Positron)%>%
  addPolygons(color = "#444444", # Color of the line of the border of each segmento
              weight = 1,  # Thickness of the line of the border of each segmento
              smoothFactor = 1, # Simplify the shape to speed rendering
              opacity = 0, # Opacity of the segmento border lines
              fillOpacity = 1, # Opacity of the segmento areas
              fillColor = ~colorQuantile("Greens",  # Define the color ramps of polygons 
                                         ehpm_2017_n,na.color = "transparent")(ehpm_2017_n)) 

Figure 2.1: Number of Participants Surveyed per Segmento

Our modeling exercise obtains estimates of income, literacy, and poverty for the segmentos where no EHPM data is available, which are the empty values in Figure 2.1.

References

Bakka, H., H. Rue, and G.-A. Fuglstad et al. 2018. “Spatial Modelling with Inla: A Review,” no. Feb. http://arxiv.org/abs/1802.06350.

Bivand, R., T. Keitt, and B. Rowlingson. 2019. Rgdal: Bindings for the ’Geospatial’ Data Abstraction Library. https://CRAN.R-project.org/package=rgdal.

Bivand, R. S., E. J. Pebesma, and V. Gomez-Rubio. 2013. Applied Spatial Data Analysis with R. Second Edition. Springer. http://www.asdar-book.org/.

Blangiardo, M., and M. Cameletti. 2015. Spatial and Spatio-Temporal Bayesian Models with R-Inla. John Wiley & Sons.

Cheng, J., B. Karambelkar, and Y. Xie. 2018. Leaflet: Create Interactive Web Maps with the Javascript ’Leaflet’ Library. https://CRAN.R-project.org/package=leaflet.

F. Lindgren, and H. Rue. 2015. “Bayesian Spatial Modelling with R-INLA.” Journal of Statistical Software 63 (19): 1–25. http://www.jstatsoft.org/v63/i19/.

Hijmans, R. J. 2019a. Raster: Geographic Data Analysis and Modeling. https://CRAN.R-project.org/package=raster.

Lindgren, H. Rue, F., and J. Lindström. 2011. “An Explicit Link Between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach (with Discussion).” Journal of the Royal Statistical Society, Series B 73 (4): 423–98.

Martins, T. G., D. Simpson, F. Lindgren, and H. Rue. 2013. “Bayesian Computing with INLA: New Features.” Computational Statistics and Data Analysis 67: 68–83.

Moraga, P. 2019. Geospatial Health Data: Modeling and Visualization with R-Inla and Shiny. Chapman & Hall/CRC Biostatistics Series.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

R Core Team. 2020. Package Parallel. Vienna, Austria: R Foundation for Statistical Computing. https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf.

Rizzo, M., and G. Szekely. 2018. Energy: E-Statistics: Multivariate Inference via the Energy of Data. https://CRAN.R-project.org/package=energy.

Rue, S. Martino, H., and N. Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations (with Discussion).” Journal of the Royal Statistical Society, Series B 71: 319–92.

Sievert, C. 2018. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. https://plotly-book.cpsievert.me.

Steele, J. E., P. Roe Sundsøy, and C. et al. Pezzulo. 2017. “Mapping Poverty Using Mobile Phone and Satellite Data.” Journal of the Royal Society Interface 14 (127): 20160690.

Wickham, H., R. François, L. Henry, and K. Müller. 2019. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Zuur, A. F., E. N. Ieno, and A. A. Saveliev. 2017. Beginner’s Guide to Spatial, Temporal, and Spatial-Temporal Ecological Data Analysis with R-Inla. Vol. 1. Highland Statistics Ltd.