Section 9 Appendix 2: Data Pre-processing
Data preprocessing can be defined as the work required to turn raw data into the data used for modeling.
The raw data used in this analysis are remote sensing and other geographic information system data, hereafter “RS data,” such as digitized maps of elevation, population density, or public service locations. These data need to be aggregated and linked with the survey containing the data on the development indicators we plan to map.
Data preprocessing is arguably one of the most time-consuming tasks of any data science project. This section documents the path required to convert the raw data into data usable for high-resolution mapping. Here are the main steps:
- Load the RS layers.
- Create summaries for the RS layers that come with a time dimension (e.g., average precipitation over the last 30 years).
- Perform additional calculations as required (e.g., compute the soil degradation index based on soil degradation sub-indicator).
- Compute the distance metrics to public services, businesses, and some natural features.
- Align the coordinate reference systems (CRSs) across layers (geographic or projected).
- Aggregate the RS layers to the map of interest (e.g., the segmento map) with a chosen set of spatial statistics (e.g., average precipitation per segmento).
- Match the RS aggregates with the outcome data (poverty, income, literacy).
- Write the data frame to a
.csv
file for later use.
Readers interested in only the modeling instructions can skip this section entirely.
9.1 Data Sources
The 2017 EHPM household survey, collected by DIGESTYC, is used to compute the three main development indicators: median income, poverty, and literacy.
The EHPM survey data are stored in the file ehpm-2017.csv
on the publicly available DIGESTYC website here. DIGESTYC provided the research team with the segmento identifiers, allowing the RS data to be linked with the EPHM data at the segmento level rather than the canton level, as would have been necessary if using only the publicly available data. The segmento identifier is stored in the file Identificador de segmento.xlsx
.
A suite of RS data is used to determine potential predictors of the three main development indicators. The data sources are listed on the next table (see web-book version of this tutorial for the links).
Names | Type | Source |
---|---|---|
Population count | raster | CIESIN |
Precipitation | raster | Climate Hazards Center |
Lights at night | raster | NOAA |
Integrated Food Security Phase Classification | vector | FEWS NET |
Livelihood zones | vector | FEWS NET |
Altitude | raster | NASA Shuttle Radar Topography Mission (SRTMv003) |
Soil degradation | raster | Trend.Earth |
Buildings | vector | OpenStreetMap |
Points of interest points | vector | OpenStreetMap |
Roads network | vector | OpenStreetMap |
Vegetation greenness (NDVI) | raster | U.S. Geological Survey (USGS) |
Schools | vector | La Direccion General de Estadistica y Censos (DIGESTYC) |
Health centers | csv | La Direccion General de Estadistica y Censos (DIGESTYC) |
Hospitals | csv | La Direccion General de Estadistica y Censos (DIGESTYC) |
Temperature | raster | WorldClim Version2 (Fick and Hijmans 2017) |
Slope | raster | WorldPop Archive global gridded spatial datasets |
Distance to OSM major roads | raster | WorldPop and CIESIN |
Distance to OSM major roads intersections | raster | WorldPop and CIESIN |
Distance to OSM major waterway | raster | WorldPop and CIESIN |
Built settlement | raster | WorldPop and CIESIN |
Land cover | raster | European Space Agency |
9.2 Loading the Survey
We start by loading the segmento shapefile and the EHPM survey data:
- ehpm: This is the EHPM survey data.
- segmento_hh_id: This is the segmento ID matched to the household ID, which allows us to map households at the segmento level.
- segmento shapefile: This is the map of all segmentos.
To create a more easily readable directory path, we create an object with the path to the folder where the data are stored:
9.3 Loading the Vectors
Vector data are polygons (e.g., administrative areas), lines (e.g., roads network), or points (e.g., coordinates of schools). They can be stored in various formats, with the most common being ESRI shapefiles and geojson files.
9.3.1 Loading the Administrative Boundaries
First, we load the administrative maps of the segmentos and departamentos with the command readOGR
from the rgdal
package. Shapefiles are loaded as a SpatialPolygonDataframe
, which is a georeferenced set of polygons to which a data frame is attached.
dir_shape="spatial/shape/"
segmento_sh=readOGR(paste0(dir_data,
dir_shape,
"admin/STPLAN_Segmentos.shp"))
## OGR data source with driver: ESRI Shapefile
## Source: "/home/rstudio/data/spatial/shape/admin/STPLAN_Segmentos.shp", layer: "STPLAN_Segmentos"
## with 12435 features
## It has 17 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/home/rstudio/data/spatial/shape/admin/STPLAN_Departamentos.shp", layer: "STPLAN_Departamentos"
## with 14 features
## It has 6 fields
Thesegmento_sh shape file loaded above, for example, consists of a data frame with 17 fields (variables).
The names of each field can be accessed by running this command:
## [1] "OBJECTID" "DEPTO" "COD_DEP" "MPIO" "COD_MUN"
## [6] "CANTON" "COD_CAN" "COD_ZON_CE" "COD_SEC_CE" "COD_SEG_CE"
## [11] "AREA_ID" "SEG_ID" "AREA_KM2" "SHAPE_Leng" "Shape_Le_1"
## [16] "Shape_Area" "demmean"
The data frame attached to the polygon map can be inspected using the following functions:
## OBJECTID DEPTO COD_DEP MPIO COD_MUN CANTON COD_CAN COD_ZON_CE
## 0 1 AHUACHAPAN 01 ATIQUIZAYA 03 AREA URBANA 00 00
## 1 2 AHUACHAPAN 01 GUAYMANGO 06 EL ESCALON 04 00
## 2 3 AHUACHAPAN 01 AHUACHAPAN 01 AREA URBANA 00 01
## 3 4 AHUACHAPAN 01 AHUACHAPAN 01 AREA URBANA 00 01
## 4 5 AHUACHAPAN 01 AHUACHAPAN 01 AREA URBANA 00 02
## 5 6 AHUACHAPAN 01 AHUACHAPAN 01 AREA URBANA 00 02
## COD_SEC_CE COD_SEG_CE AREA_ID SEG_ID AREA_KM2 SHAPE_Leng Shape_Le_1
## 0 002 1015 U 01031015 0.04352 1261.0632 1261.0217
## 1 005 0016 R 01060016 1.65778 8536.1704 8535.8890
## 2 003 1015 U 01011015 0.07500 1165.9750 1165.9366
## 3 003 1014 U 01011014 0.05010 988.8543 988.8216
## 4 009 1019 U 01011019 0.06910 1135.7582 1135.7209
## 5 009 1042 U 01011042 0.04690 857.8403 857.8122
## Shape_Area demmean
## 0 43526.87 NA
## 1 1657675.21 NA
## 2 74995.98 NA
## 3 50104.03 NA
## 4 69104.02 NA
## 5 46899.53 NA
## OBJECTID DEPTO COD_DEP MPIO
## Min. : 1 Length:12435 Length:12435 Length:12435
## 1st Qu.: 3110 Class :character Class :character Class :character
## Median : 6219 Mode :character Mode :character Mode :character
## Mean : 6219
## 3rd Qu.: 9328
## Max. :12437
##
## COD_MUN CANTON COD_CAN COD_ZON_CE
## Length:12435 Length:12435 Length:12435 Length:12435
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## COD_SEC_CE COD_SEG_CE AREA_ID SEG_ID
## Length:12435 Length:12435 Length:12435 Length:12435
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## AREA_KM2 SHAPE_Leng Shape_Le_1 Shape_Area
## Min. : 0.00000 Min. : 281.4 Min. : 281.4 Min. : 3903
## 1st Qu.: 0.06062 1st Qu.: 1256.3 1st Qu.: 1256.3 1st Qu.: 60567
## Median : 0.27345 Median : 2859.6 Median : 2859.5 Median : 273177
## Mean : 1.67100 Mean : 5494.0 Mean : 5493.8 Mean : 1672961
## 3rd Qu.: 2.08214 3rd Qu.: 8492.5 3rd Qu.: 8492.3 3rd Qu.: 2082009
## Max. :48.36518 Max. :54684.4 Max. :54682.6 Max. :48194678
##
## demmean
## Min. : NA
## 1st Qu.: NA
## Median : NA
## Mean :NaN
## 3rd Qu.: NA
## Max. : NA
## NA's :12435
We see that Shape_Le_1
and SHAPE_Leng
are duplicates, while demmean
comprises only null values. These fields are removed using the dplyr
function select
.
To get a country mask, which is the boundary of the country, we dissolve the multiple polygons of the departamentos shapefile into one overall map of the country with the function maptools::unionSpatialPolygons
.
The country mask is shown in Figure 9.1.

Figure 9.1: El Salvador country mask
9.3.1.1 Adjusting the Coordinates System
There are two families of coordinates systems:
- Geographic coordinate systems (GCS): “A system that uses a three-dimensional spherical surface to define locations on the earth.”5.
- Projected coordinate systems (PCS): “A system defined on a flat, two-dimensional surface. Unlike a geographic coordinate system, a projected coordinate system has constant lengths, angles, and areas across the two dimensions.”6
The shapefiles for the segmentos, departmentos, and the country mask have a projected coordinate system.
## CRS arguments:
## +proj=lcc +lat_0=13.783333 +lon_0=-89 +lat_1=13.316666 +lat_2=14.25
## +x_0=500000 +y_0=295809.184 +k_0=0.99996704 +datum=NAD27 +units=m
## +no_defs
The coordinates are expressed in meters:
## min max
## x 377579.6 642687.1
## y 226506.1 369595.5
Since most spatial RS data are based on a GCS, we create a second version of the country mask with one. This is done with the function spTransform
from the sp
library. The use of ::
allows us to call the function without loading the sp
library in the environment.
proj_WGS_84="+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
segmento_sh_wgs=sp::spTransform(segmento_sh,
proj_WGS_84)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum WGS_1984 in CRS definition,
## but +towgs84= values preserved
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum WGS_1984 in CRS definition,
## but +towgs84= values preserved
## min max
## x -90.13200 -87.68380
## y 13.15541 14.45093
The coordinates are now expressed in decimal degrees.
9.3.2 Loading the Shapefiles Containing the Fields Used as Covariates
We now load the shapefiles containing the fields used as covariates, which are the containing variables that will be used as predictors in the model we will fit in the next sections.
We start with the Livelihood zone maps accessed from the Famine Early Warning Systems Network (FEWS NET).
The livelihood zone map is mapped on 9.2 for inspection.
leaflet(LZ_sh)%>%
addTiles()%>%
addPolygons(weight=1,
color = "#444444",
smoothFactor = 1,
fillOpacity = 1,
fillColor = ~colorFactor("Accent", LZNAMEEN)(LZNAMEEN),
popup = ~LZNAMEEN)%>%
addLegend("bottomright",
colors = ~colorFactor("Accent", LZNAMEEN)(LZNAMEEN),
labels = ~LZNAMEEN,
opacity = 1)