Section 9 Appendix 2: Data Pre-processing

Data preprocessing can be defined as the work required to turn raw data into the data used for modeling.

The raw data used in this analysis are remote sensing and other geographic information system data, hereafter “RS data,” such as digitized maps of elevation, population density, or public service locations. These data need to be aggregated and linked with the survey containing the data on the development indicators we plan to map.

Data preprocessing is arguably one of the most time-consuming tasks of any data science project. This section documents the path required to convert the raw data into data usable for high-resolution mapping. Here are the main steps:

  1. Load the RS layers.
  2. Create summaries for the RS layers that come with a time dimension (e.g., average precipitation over the last 30 years).
  3. Perform additional calculations as required (e.g., compute the soil degradation index based on soil degradation sub-indicator).
  4. Compute the distance metrics to public services, businesses, and some natural features.
  5. Align the coordinate reference systems (CRSs) across layers (geographic or projected).
  6. Aggregate the RS layers to the map of interest (e.g., the segmento map) with a chosen set of spatial statistics (e.g., average precipitation per segmento).
  7. Match the RS aggregates with the outcome data (poverty, income, literacy).
  8. Write the data frame to a .csv file for later use.

Readers interested in only the modeling instructions can skip this section entirely.

9.1 Data Sources

The 2017 EHPM household survey, collected by DIGESTYC, is used to compute the three main development indicators: median income, poverty, and literacy.

The EHPM survey data are stored in the file ehpm-2017.csv on the publicly available DIGESTYC website here. DIGESTYC provided the research team with the segmento identifiers, allowing the RS data to be linked with the EPHM data at the segmento level rather than the canton level, as would have been necessary if using only the publicly available data. The segmento identifier is stored in the file Identificador de segmento.xlsx.

A suite of RS data is used to determine potential predictors of the three main development indicators. The data sources are listed on the next table (see web-book version of this tutorial for the links).

Names Type Source
Population count raster CIESIN
Precipitation raster Climate Hazards Center
Lights at night raster NOAA
Integrated Food Security Phase Classification vector FEWS NET
Livelihood zones vector FEWS NET
Altitude raster NASA Shuttle Radar Topography Mission (SRTMv003)
Soil degradation raster Trend.Earth
Buildings vector OpenStreetMap
Points of interest points vector OpenStreetMap
Roads network vector OpenStreetMap
Vegetation greenness (NDVI) raster U.S. Geological Survey (USGS)
Schools vector La Direccion General de Estadistica y Censos (DIGESTYC)
Health centers csv La Direccion General de Estadistica y Censos (DIGESTYC)
Hospitals csv La Direccion General de Estadistica y Censos (DIGESTYC)
Temperature raster WorldClim Version2 (Fick and Hijmans 2017)
Slope raster WorldPop Archive global gridded spatial datasets
Distance to OSM major roads raster WorldPop and CIESIN
Distance to OSM major roads intersections raster WorldPop and CIESIN
Distance to OSM major waterway raster WorldPop and CIESIN
Built settlement raster WorldPop and CIESIN
Land cover raster European Space Agency

9.2 Loading the Survey

We start by loading the segmento shapefile and the EHPM survey data:

  • ehpm: This is the EHPM survey data.
  • segmento_hh_id: This is the segmento ID matched to the household ID, which allows us to map households at the segmento level.
  • segmento shapefile: This is the map of all segmentos.

To create a more easily readable directory path, we create an object with the path to the folder where the data are stored:

# Modify dir_data to where you stored the data
root_dir="~/"
project_dir="data/"

dir_data=paste0(root_dir,project_dir)
ehpm=read.csv(paste0(dir_data,
                     "tables/ehpm-2017.csv")) 
segmento_hh_id=readxl::read_xlsx(paste0(dir_data,
                                        "tables/Identificador de segmento.xlsx"))

9.3 Loading the Vectors

Vector data are polygons (e.g., administrative areas), lines (e.g., roads network), or points (e.g., coordinates of schools). They can be stored in various formats, with the most common being ESRI shapefiles and geojson files.

9.3.1 Loading the Administrative Boundaries

First, we load the administrative maps of the segmentos and departamentos with the command readOGR from the rgdal package. Shapefiles are loaded as a SpatialPolygonDataframe, which is a georeferenced set of polygons to which a data frame is attached.

dir_shape="spatial/shape/"
segmento_sh=readOGR(paste0(dir_data,
                           dir_shape,
                           "admin/STPLAN_Segmentos.shp"))
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/rstudio/data/spatial/shape/admin/STPLAN_Segmentos.shp", layer: "STPLAN_Segmentos"
## with 12435 features
## It has 17 fields
departamentos_sh=readOGR(paste0(dir_data,
                                dir_shape,
                                       "admin/STPLAN_Departamentos.shp"))
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/rstudio/data/spatial/shape/admin/STPLAN_Departamentos.shp", layer: "STPLAN_Departamentos"
## with 14 features
## It has 6 fields

Thesegmento_sh shape file loaded above, for example, consists of a data frame with 17 fields (variables).

The names of each field can be accessed by running this command:

names(segmento_sh)
##  [1] "OBJECTID"   "DEPTO"      "COD_DEP"    "MPIO"       "COD_MUN"   
##  [6] "CANTON"     "COD_CAN"    "COD_ZON_CE" "COD_SEC_CE" "COD_SEG_CE"
## [11] "AREA_ID"    "SEG_ID"     "AREA_KM2"   "SHAPE_Leng" "Shape_Le_1"
## [16] "Shape_Area" "demmean"

The data frame attached to the polygon map can be inspected using the following functions:

head(segmento_sh@data) # Provide the 5 first rows of the data frame segmento_sh@data
##   OBJECTID      DEPTO COD_DEP       MPIO COD_MUN      CANTON COD_CAN COD_ZON_CE
## 0        1 AHUACHAPAN      01 ATIQUIZAYA      03 AREA URBANA      00         00
## 1        2 AHUACHAPAN      01  GUAYMANGO      06  EL ESCALON      04         00
## 2        3 AHUACHAPAN      01 AHUACHAPAN      01 AREA URBANA      00         01
## 3        4 AHUACHAPAN      01 AHUACHAPAN      01 AREA URBANA      00         01
## 4        5 AHUACHAPAN      01 AHUACHAPAN      01 AREA URBANA      00         02
## 5        6 AHUACHAPAN      01 AHUACHAPAN      01 AREA URBANA      00         02
##   COD_SEC_CE COD_SEG_CE AREA_ID   SEG_ID AREA_KM2 SHAPE_Leng Shape_Le_1
## 0        002       1015       U 01031015  0.04352  1261.0632  1261.0217
## 1        005       0016       R 01060016  1.65778  8536.1704  8535.8890
## 2        003       1015       U 01011015  0.07500  1165.9750  1165.9366
## 3        003       1014       U 01011014  0.05010   988.8543   988.8216
## 4        009       1019       U 01011019  0.06910  1135.7582  1135.7209
## 5        009       1042       U 01011042  0.04690   857.8403   857.8122
##   Shape_Area demmean
## 0   43526.87      NA
## 1 1657675.21      NA
## 2   74995.98      NA
## 3   50104.03      NA
## 4   69104.02      NA
## 5   46899.53      NA
summary(segmento_sh@data) # Summarize each field of the data frame segmento_sh@data
##     OBJECTID        DEPTO             COD_DEP              MPIO          
##  Min.   :    1   Length:12435       Length:12435       Length:12435      
##  1st Qu.: 3110   Class :character   Class :character   Class :character  
##  Median : 6219   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 6219                                                           
##  3rd Qu.: 9328                                                           
##  Max.   :12437                                                           
##                                                                          
##    COD_MUN             CANTON            COD_CAN           COD_ZON_CE       
##  Length:12435       Length:12435       Length:12435       Length:12435      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   COD_SEC_CE         COD_SEG_CE          AREA_ID             SEG_ID         
##  Length:12435       Length:12435       Length:12435       Length:12435      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     AREA_KM2          SHAPE_Leng        Shape_Le_1        Shape_Area      
##  Min.   : 0.00000   Min.   :  281.4   Min.   :  281.4   Min.   :    3903  
##  1st Qu.: 0.06062   1st Qu.: 1256.3   1st Qu.: 1256.3   1st Qu.:   60567  
##  Median : 0.27345   Median : 2859.6   Median : 2859.5   Median :  273177  
##  Mean   : 1.67100   Mean   : 5494.0   Mean   : 5493.8   Mean   : 1672961  
##  3rd Qu.: 2.08214   3rd Qu.: 8492.5   3rd Qu.: 8492.3   3rd Qu.: 2082009  
##  Max.   :48.36518   Max.   :54684.4   Max.   :54682.6   Max.   :48194678  
##                                                                           
##     demmean     
##  Min.   : NA    
##  1st Qu.: NA    
##  Median : NA    
##  Mean   :NaN    
##  3rd Qu.: NA    
##  Max.   : NA    
##  NA's   :12435

We see that Shape_Le_1 and SHAPE_Leng are duplicates, while demmean comprises only null values. These fields are removed using the dplyr function select.

segmento_sh@data=segmento_sh@data%>% 
  select(-c(Shape_Le_1,demmean))

To get a country mask, which is the boundary of the country, we dissolve the multiple polygons of the departamentos shapefile into one overall map of the country with the function maptools::unionSpatialPolygons.

SLV_adm0_proj=maptools::unionSpatialPolygons(departamentos_sh,
                                             rep(1, nrow(departamentos_sh)))

The country mask is shown in Figure 9.1.

plot(SLV_adm0_proj)
El Salvador country mask

Figure 9.1: El Salvador country mask

9.3.1.1 Adjusting the Coordinates System

There are two families of coordinates systems:

  • Geographic coordinate systems (GCS): “A system that uses a three-dimensional spherical surface to define locations on the earth.”5.
  • Projected coordinate systems (PCS): “A system defined on a flat, two-dimensional surface. Unlike a geographic coordinate system, a projected coordinate system has constant lengths, angles, and areas across the two dimensions.”6

The shapefiles for the segmentos, departmentos, and the country mask have a projected coordinate system.

SLV_adm0_proj@proj4string
## CRS arguments:
##  +proj=lcc +lat_0=13.783333 +lon_0=-89 +lat_1=13.316666 +lat_2=14.25
## +x_0=500000 +y_0=295809.184 +k_0=0.99996704 +datum=NAD27 +units=m
## +no_defs

The coordinates are expressed in meters:

sp::bbox(SLV_adm0_proj)
##        min      max
## x 377579.6 642687.1
## y 226506.1 369595.5

Since most spatial RS data are based on a GCS, we create a second version of the country mask with one. This is done with the function spTransform from the sp library. The use of :: allows us to call the function without loading the sp library in the environment.

proj_WGS_84="+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
segmento_sh_wgs=sp::spTransform(segmento_sh,
                         proj_WGS_84)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum WGS_1984 in CRS definition,
##  but +towgs84= values preserved
SLV_adm0=sp::spTransform(SLV_adm0_proj,
                         proj_WGS_84)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum WGS_1984 in CRS definition,
##  but +towgs84= values preserved
sp::bbox(SLV_adm0)
##         min       max
## x -90.13200 -87.68380
## y  13.15541  14.45093

The coordinates are now expressed in decimal degrees.

9.3.2 Loading the Shapefiles Containing the Fields Used as Covariates

We now load the shapefiles containing the fields used as covariates, which are the containing variables that will be used as predictors in the model we will fit in the next sections.

We start with the Livelihood zone maps accessed from the Famine Early Warning Systems Network (FEWS NET).

LZ_sh=readOGR(paste0(dir_data,
                     dir_shape,
                     "LZ/SV_LHZ_2010.shp"))

The livelihood zone map is mapped on 9.2 for inspection.

leaflet(LZ_sh)%>%
  addTiles()%>%
  addPolygons(weight=1,
              color = "#444444",
              smoothFactor = 1,
              fillOpacity = 1,
              fillColor = ~colorFactor("Accent", LZNAMEEN)(LZNAMEEN),
              popup = ~LZNAMEEN)%>%
  addLegend("bottomright",
            colors = ~colorFactor("Accent", LZNAMEEN)(LZNAMEEN), 
            labels = ~LZNAMEEN,
            opacity = 1)