Section 7 Discussion
This report introduces high-resolution mapping based on Bayesian geospatial techniques implemented in the open source statistical computing environment R (R Core Team 2018). The core of the modeling method is implemented in the R package
INLA (Rue and Chopin 2009; Lindgren and Lindström 2011; Martins et al. 2013; F. Lindgren and Rue 2015). The emphasis is on providing and explaining the R codes required to go from the raw data to the final maps.
Three development indicators are mapped for El Salvador: median income, adult literacy, and poverty. The data on these development indicators come from the EHPM, conducted in 20,645 households among 1,664 segmentos in 2017. A suite of more than 20 RS and GIS open source data sets are used to derive a list of 86 potential predictors, hereafter covariates. After having standardized the covariates, we apply an unsupervised dimension-reduction process to reduce this number to 50. This limits the effect of multicollinearity in subsequent modeling and limits the risk of retaining covariates because of chance correlation in the subsequent covariates selection. Next, we apply a backward stepwise covariates selection process to select a parsimonious model specification. We tested several likelihood functions (Gaussian, gamma, and beta) and spatial dependence structures (the SPDE and BYM 2 models).
The best results are obtained with the SPDE approach and, at the segmento level, yield a \(R^2\) on the final test set of 50 percent for income, 43 percent for illiteracy, and only 23 percent for poverty. Once the results are aggregated at the municipio level, the \(R^2\) increases to 84 percent for income, 80 percent for literacy, and 61 percent for poverty.
We spent limited time exploring interaction terms and nonlinear relationships between covariates and outcome variables. Another avenue for amelioration would be to develop a spatiotemporal model that takes into account the time dimension. This is possible, as the EHPM is collected annually. A spatiotemporal model would permit the monitoring of trends in the socioeconomic and SDG indicators of interest. Lastly, other modeling architectures such as random forests or convolutional neural networks could take advantage of nonlinear effects, potentially increasing the predictive accuracy of the models.
Another direction for potentially improving the granularity of socioeconomic maps lies in the use of mobile phone call-detail records. Covariate data derived from mobile phone call-detail records have been shown to correlate well with poverty indicators both at the household and aggregate level (e.g., Blumenstock, Cadamuro, and On 2015; Steele, Sundsøy, and Pezzulo 2017). These call records also have the advantage of being highly granular in terms of both space and time, presenting the possibility to potentially monitor variation in population well-being at a higher frequency. While the technical demands and requirements to negotiate access to call-detail-record data and conduct meaningful analyses with them are high, there are many potential applications for these data (e.g., mobility analysis, disaster response, and preparedness) and the marginal cost of each additional project is low once the agreements and the IT system are in place.
A lot of the data preprocessing work and some of the analytical steps could be packaged and automatized in software. This could further streamline the process of creating these high-resolution maps of development indicators.
High-resolution maps of development indicators may allow more accurate targeting of government interventions aimed at reducing poverty or increasing literacy rates among adults. Second, the methods presented here are portable to other socioeconomic and SDG indicators of interest. They could play a key role in monitoring and reporting on SDG achievements. Third, other than the survey’s segmento-level data, the methods described here are conducted using freely available data sources from freely available software. Fourth, the method presented here does not rely on the use of census data as do traditional small-area-estimation methods, so it can be applied independently of any census round. Fifth, the methods can be readily deployed by analysts with masters-level statistical training. These five points make the method very suitable for incorporation in routine and standard practices of national statistical offices (NSOs).
The next possible step is to present these methods to NSOs to generate discourse on the value of these methods, potential improvements to increase their operational relevance, and implications for their integration into decision-making frameworks. We hope that the relative ease of these methods, the open-source nature of the software and covariate data, as well as the large number of code examples in this report (and on the web in general) will encourage their adoption by NSOs. Experts from the Flowminder Foundation are available to provide technical training and support to strengthen in-house capacity at NSOs to use these methods. This report could constitute the backbone of a short course of up to five days to equip a first cohort of statisticians with the know-how for confidently using these methods, data, and the R software for high-resolution mapping of development indicators.
Blumenstock, J., G. Cadamuro, and R. On. 2015. “Predicting Poverty and Wealth from Mobile Phone Metadata.” Science 350 (6264): 1073–6.
F. Lindgren, and H. Rue. 2015. “Bayesian Spatial Modelling with R-INLA.” Journal of Statistical Software 63 (19): 1–25. http://www.jstatsoft.org/v63/i19/.
Lindgren, H. Rue, F., and J. Lindström. 2011. “An Explicit Link Between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach (with Discussion).” Journal of the Royal Statistical Society, Series B 73 (4): 423–98.
Martins, T. G., D. Simpson, F. Lindgren, and H. Rue. 2013. “Bayesian Computing with INLA: New Features.” Computational Statistics and Data Analysis 67: 68–83.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Rue, S. Martino, H., and N. Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations (with Discussion).” Journal of the Royal Statistical Society, Series B 71: 319–92.
Steele, J. E., P. Roe Sundsøy, and C. et al. Pezzulo. 2017. “Mapping Poverty Using Mobile Phone and Satellite Data.” Journal of the Royal Society Interface 14 (127): 20160690.