Section 4 Fitting a simple model
The goal of this section is to fit a simple model for income to introduce the modified Besag-York-Mollie model (BYM 2) and the stochastic partial differential equation (SPDE) model.
In both cases, the general process is the same:
- Define the spatial dependency.
- Split the model into training and validation sets.
- Specify the formula, expressing the outcome variables in terms of fixed and random spatial effects.
- Fit the model on the training set.
- Validate the model on the validation set.
The fixed effects are the covariates (e.g., population density or light at night). The spatial random effects are spatially correlated random effects used to model the spatial dependency.
Note that it is assumed here that the list of fixed effects is known. The next section presents a backward stepwise covariate selection process to select the list of covariates.
4.1 Loading the Data
We start by loading the covariates data and matching them with the segmento spatial polygon data frame.
rm(list=ls())
library(parallel)
library(INLA)
library(dplyr)
#INLA:::inla.dynload.workaround()
# Modify dir_data to where you stored the data
root_dir="~/"
project_dir="data/"
dir_data=paste0(root_dir,project_dir)
# Load the data ####
ehpm17_predictors=read.csv(paste0(dir_data,
"out/all_covariates_and_outcomes.csv"))
# Correct for .xls misbehavior: The SEG_ID with a leading 0 were shortened (i.e. the leading 0 was eliminated)
ehpm17_predictors=ehpm17_predictors%>%
mutate(SEG_ID=as.character(SEG_ID),
SEG_ID=ifelse(nchar(SEG_ID)==7,
paste0(0,SEG_ID),
SEG_ID))
# Shape
segmento_sh=rgdal::readOGR(paste0(dir_data,
"spatial/shape/admin/STPLAN_Segmentos.shp"))
## OGR data source with driver: ESRI Shapefile
## Source: "/home/rstudio/data/spatial/shape/admin/STPLAN_Segmentos.shp", layer: "STPLAN_Segmentos"
## with 12435 features
## It has 17 fields
4.2 Area Data
4.2.1 Spatial Dependency with Area Data
To model the spatial dependency with data from the area data, we need to inform the model of neighboring segmentos.
The function spdep::poly2nb
creates a neighbors list from the segmento polygon. The list is created in a spatial format suitable for INLA
with the function spdep::nb2INLA
. It is then read as a graph object for INLA
consumption.
# Define neighboring structures ####
segmento_sh_data@data$ID=1:length(segmento_sh_data@data$n_obs)
segmento.nb=spdep::poly2nb(segmento_sh_data)
spdep::nb2INLA(paste0(dir_data,
"out/SEG.graph"),
segmento.nb)
SEG.adj=paste0(paste0(dir_data,
"out/SEG.graph"),
sep="")
Segmento.Inla.nb <- INLA::inla.read.graph(paste0(dir_data,
"out/SEG.graph"))
Figure 4.1 illustrates the results with data at the department level.
## OGR data source with driver: ESRI Shapefile
## Source: "/home/rstudio/data/spatial/shape/admin/STPLAN_Departamentos.shp", layer: "STPLAN_Departamentos"
## with 14 features
## It has 6 fields
4.2.2 Store the Data for Modeling
The data are stored in the data frame data_model
.
# Identify EHPM 17 segmentos to fit the model ####
segmento_sh_data_model=segmento_sh_data
non_na_index=which(is.na(segmento_sh_data_model$n_obs)==F)
ehpm17_predictors_nona=segmento_sh_data_model@data[non_na_index,]
# Covariates #####
cov_candidates_selected_df=read.table(paste0(dir_data,
"out/distance_corr_var/selected_all.txt"),
header =T)
cov_candidates_selected_table=cov_candidates_selected_df
names(cov_candidates_selected_table)="Candidates"
# knitr::kable(
# cov_candidates_selected_table,
# caption = 'Candidate covariates for income',
# booktabs = TRUE
# )
covariates=ehpm17_predictors_nona[,as.character(cov_candidates_selected_table$Candidates)]
# Store data for the model ####
data_model=data.frame(ingpe=ehpm17_predictors_nona$ingpe,
intercept=array(1,dim(covariates)[1]), # intercept for INLA
ID=ehpm17_predictors_nona$ID, # ID to link data to the neibhouring structure
covariates)
4.2.3 Split the Sample into 80-20 Training and Validation Sets
The observations in the dataset are split into 80-20 training and validation sets. The model will be fitted on the training set, and income will be predicted on the validation sets. The predicted values will be compared with the observed values of the validation set to assess the model’s goodness of fit.
# Take a sample of segmentos for training, validation, and test ####
set.seed(1234)
spec = c(train = .8, validate = .2)
g = sample(cut(
seq(nrow(data_model)),
nrow(data_model)*cumsum(c(0,spec)),
labels = names(spec)
))
index_val=which(g=="validate")
index_train=which(g=="train")
data_model$pred=data_model$ingpe
data_model$pred[c(index_val)] <- NA # Set the validation predictions NA: The values will be predicted by the model and compared with observed values
4.2.4 Formula Specifying the Relationship between the Outcome Variable and the Fixed and Spatial Random Effects
Next, we write a formula to predict income with a subset of the covariates and chose to use the following covariates:
- Binary indicator for rural and urban segmentos
- Median lights at night
- Population density
- Average slope in the segmentos
- Average precipitation
- Distance to public services
- Distance to businesses
- Distance to roads
# Write the test formula ####
formula_test =pred~
AREA_ID+lights_med+pop_dens+slope+chirps_ev_2017+dist2pubamen+dist2road+
f(ID,
model = "bym2",
graph = Segmento.Inla.nb,
hyper=list(
prec = list(prior = "pc.prec", param = c(0.1, 0.0001)),
phi = list(prior = "pc", param = c(0.5, 0.5))),
scale.model = TRUE,
constr = T,
adjust.for.con.comp = T)
The variable pred
is the output variable we aim to model (e.g., income).
The selected covariates are added linearly. Remember that they have been standardized in the previous section to avoid any issues linked to scale. These covariates are called fixed effects.
The most complex aspect of the formula is the spatial dependence structure specified in f()
.
The ID
is the segmento identifiers linking the graph object Segmento.Inla.nb
, where the neighboring structure is defined, with the data frame where the covariates and output data are stored. The model
parameter is for the specification of the spatial correlation model that should be used. The bym2
is used because it is flexible and allows for a relatively intuitive specification of spatial random effects.
The flexibility comes from the fact that bym2
allows for the error term to be composed of two elements: (1) a pure random noise, and (2) a spatially correlated noise. The penalized complexity priors phi
allows the importance of each effect to be influenced (if phi
=0, then we have pure noise, if phi
=1, we have only a spatially correlated random effect).
The prec
parameter is the precision parameters, or how much the spatially correlated noise is allowed to vary spatially. If the value is high, then the spatially correlated noise is allowed to vary a lot across space and vice versa. By decreasing the penalized complexity priors on the precision parameter, we decrease the risk of overfitting: less of the spatial variation left unexplained by the covariates is attributed to the spatial random effects, and, as a result, the model is better able to generalize to areas where it was not trained.
Here, we have chosen to set the penalized complexity prior on precision as \(Pr(\sigma>1)=0.0001\) instead of the default \(Pr(\sigma>1)=0.01\). Indeed, we were facing overfitting issues with the default priors: the difference between the goodness of fit in the training and validation sets was very large (about 20 percent in the \(R^{2}\)).
4.2.5 Fit the Model on the Training Set
The model is fitted with the inla
command. We have to specify the likelihood distribution of the response variable.
The family
argument defines the type of likelihood function for the response variable. As the distribution of income is strictly positive and right skewed, a gamma distribution is a good option.
The data
argument specifies the data on which the model is fitted, which in our case is the data_model
data frame.
As we use a gamma likelihood function, the predicted values need to be back transformed to the original scale by using the exponential function. The control.predictor
automatizes the process once the option list(link=1,compute=T)
is specified.
We use a simple integration strategy to compute the posterior marginals of the model parameters by specifying control.inla =list(int.strategy = "eb")
, where eb
stands for “empirical Bayes.” In the empirical Bayes approach, we use only one “integration point equal to the posterior mode of the hyperparameters” (Moraga 2019). This speeds up the estimation of the model.
# Fit the model on the training set ####
start_time=Sys.time() # Time the start of the estimation
bym.res=inla(formula=formula_test,
family = "gamma",
data = data_model,
control.predictor=list(link=1,compute=T),
control.inla =list(int.strategy = "eb"))
end_time=Sys.time() # Time at the end of the estimation
duration=end_time-start_time
print(duration)
## Time difference of 54.15509 secs
It took 3.7 minutes to estimate the model. With the default integration strategy, the model takes 13.5 minutes to be estimated. No significant difference was found between the integration strategies.
4.2.6 Validate the Model on the Validation Set
The model’s goodness of fit can now be investigated on the validation set. We start by extracting the fitted values, which are stored in the bym.res
object under summary.fitted.values
. The inla
model yields a distribution of predicted income for each segmento. Various statistics summarizing this prediction are also available for each segmento:
- The mean fitted value
- The median fitted value
- The 2.5 and 97.5 percentiles of the fitted values
- The standard deviation of the fitted value distribution
Below, we select the mean fitted value, that is, the mean prediction value for each segmento.
We can now compute the root mean square error (RMSE) and the pseudo \(R^{2}\) as shown below.2
RMSE=function(set,outcome,data,fit){
res = data[set,outcome]-fit[set]
RMSE_val <- sqrt(mean(res^2,na.rm=T))
return(RMSE_val)
}
pseudo_r2=function(set,outcome,data,fit){
res = data[set,outcome]-fit[set]
RRes=sum((res)^2,na.rm = T)
RRtot=sum((data[set,outcome]-mean(fit[set],na.rm=T))^2,na.rm = T)
pseudo_r2_val=1-RRes/RRtot
return(pseudo_r2_val)
}
RMSE_val=RMSE(index_val,"ingpe",data_model,M_fit)
RMSE_train=RMSE(index_train,"ingpe",data_model,M_fit)
r2_val=pseudo_r2(index_val,"ingpe",data_model,M_fit)
r2_train=pseudo_r2(index_train,"ingpe",data_model,M_fit)
a <- list(
x = 600,
y = 100,
text = paste("R2 training set:",round(r2_train*100),"%",
"\nR2 validation set:",round(r2_val*100),"%",
"\n",
"\nRMSE training set:",round(RMSE_train),"USD",
"\nRMSE validation set:",round(RMSE_val),"USD"),
xref = "x",
yref = "y",
showarrow = F)
Figure 4.2 shows the predicted and observed income values for the training and validation sets. The model appears to do a relatively good job, although it has difficulty capturing the segmentos with an income above 400.
An \(R^{2}\) of 44 percent is relatively satisfactory. It means that 44 percent of the income variation in the validation set is explained by the model fitted on the training set. However, the fact that the \(R^{2}\) is at 69 percent, close to twice as large as the \(R^{2}\) in the validation set, indicates overfitting.
The RMSE is USD 66 in the validation set and USD 40 in the training set. The RMSE is the standard deviation of the residuals, the difference between the fitted and observed values.
The results are stored in a list and saved in a .RData
file for later consumption.
income_bym2_naive=list("outcome"="ingpe",
"spat_dep"="bym2",
"cov_select"="naive",
"formula"=formula_test,
"family"="gamma",
"data"=data_model,
"fit"=M_fit,
"index_val"=index_val,
"index_train"=index_train,
"r2_val"=r2_val,
"r2_train"=r2_train,
"RMSE_val"=RMSE_val,
"RMSE_train"=RMSE_train)
save(income_bym2_naive,
file=paste0(dir_data,
"out/results/income_bym2_naive.RData"))
4.3 Point-Referenced Data
4.3.1 Spatial Dependency with Point-Referenced Data
The spatial dependence is modeled here with an SPDE approach. The following steps are required:
1 Create the mesh. 2 Create the SPDE. *3 Create the matrix of weights.
4.3.1.1 Create the Mesh
To create the mesh, we start by identifying the segmentos for which there are EHPM data. Second, we create a spatial polygon of the country’s boundaries by dissolving the shapefile of the departamentos with the function unionSpatialPolygons
. Third, we transform this spatial polygon into a list that the INLA
library can process with the function inla.sp2segment
. Lastly, we use the helper function inla.mesh.create.helper
to create the mesh, using the boundary and coordinates of the segmentos’ centroids (i.e., middle points) as inputs.
# Create the mesh ####
segmento_sh_subset=subset(segmento_sh_data_model,
is.na(segmento_sh_data_model@data$ingpe)==F) # Identify where there is EHPM data
coords=coordinates(segmento_sh_subset) # Collect the coordinates of the segmentos with EHPM data
SVborder <- maptools::unionSpatialPolygons(departamentos_sh,
rep(1, nrow(departamentos_sh))) # Create one polygon using the boundaries of the countries
SV.bdry <- inla.sp2segment(SVborder) # Create an inla boundary object
SV.mesh <- inla.mesh.create.helper(
boundary=SV.bdry,
points=coords,
offset=c(2500, 25000),
max.edge=c(20000, 50000),
min.angle=c(25, 25),
cutoff=8000,
plot.delay=NULL)
plot(SV.mesh,main="",asp=1)
The offset
parameters of inla.mesh.create.helper
define the inner and outer distances. For instance, if we change the outer offset parameter to 250 kilometers instead of the original 25 kilometers, the result is plotted in Figure 4.4.
SV.meshL <- inla.mesh.create.helper(
boundary=SV.bdry,
points=coords,
offset=c(2500, 250000), # we increased from 25000 to 250000
max.edge=c(20000, 50000),
min.angle=c(25, 25),
cutoff=8000,
plot.delay=NULL)
plot(SV.meshL,main="",asp=1)
When choosing the outer offset parameter, the aim is to avoid the presence of boundary effects in the subsequent estimation of the SPDE, so it is best to allow for a sufficient distance as shown in Figure 4.3. However, Figure 4.4 is clearly too large and would slow the estimation process.
The inner offset parameter has no effect here, as it is superseded by the max.edge
, min.angle
, and cutoff
parameters. The max.edge
parameter defines the largest-allowed triangle-edge length. The min.angle
parameter defines the smallest-allowed triangle angle, and the parameter cutoff
defines the minimum-allowed distance between points.
Next, we illustrate the impact of modifying the max.edge
from 20 kilometers to 5 kilometers.
SV.meshL <- inla.mesh.create.helper(
boundary=SV.bdry,
points=coords,
offset=c(2500, 25000),
max.edge=c(20000, 5000), # Decrease from 50,000 to 5,000
min.angle=c(25, 25),
cutoff=8000,
plot.delay=NULL)
plot(SV.meshL,main="",asp=1)
Figure 4.5 shows the results. The risk of such a tight mesh is that it could lead to overfitting. Furthermore, it is computationally more expensive.
4.3.1.2 Derive the SPDE
Next, the SPDE is derived from the mesh using the inla.spde2.matern
function, and the spatial index is derived with the function inla.spde.make.index
.
# Create the SPDE ####
SV.spde <- inla.spde2.matern(mesh=SV.mesh,alpha=2)
s.index <- inla.spde.make.index(name="spatial.field", n.spde=SV.spde$n.spde)
The s.index
allows the mesh to link to the matrix of spatial weights that will be defined below.
#### Create the matrix of weights
The matrix of weights is created with the INLA
function inla.spde.make.A
.
4.3.2 Store the Data for Modeling in a Stack
We use the inla.stack
function to conveniently store the data in one object.
# Stack ####
covariates_list=c("AREA_ID","lights_med","pop_dens","slope","chirps_ev_2017","dist2pubamen","dist2road")
stack.train <- inla.stack(data = list(pred=segmento_sh_subset@data$ingpe[index_train]), # This output variable (income)
A = list(A.train, 1,1),
effects = list(s.index,
Intercept=1:length(index_train),
segmento_sh_subset@data[index_train,c(covariates_list)]),
tag="train") # The tag allows selected statistics for the desired sample to be extracted later
stack.val <- inla.stack(data = list(pred=NA), # Set it to NA
A = list(A.val, 1,1),
effects = list(s.index,
Intercept=1:length(index_val),
segmento_sh_subset@data[index_val,c(covariates_list)]),
tag="val")
# Combine both stacks
join.stack <- inla.stack(stack.train, stack.val)
4.3.3 Specifying the Relationship between the Outcome Variable and the Fixed and Spatial Random Effects
We use the same set of fixed effects.
# Write the test formula ####
formula_test =pred~ -1+Intercept+
AREA_ID+lights_med+pop_dens+slope+chirps_ev_2017+dist2pubamen+dist2road+
f(spatial.field, model=SV.spde)
The main difference from the BYM 2 specification is that the spatial random effects are specified by the f()
function, which takes two arguments: spatial.field
is the name of the spatial index s.index
, and the model
is the SPDE defined in SV.spde.
4.3.4 Fit the Model on the Training Set
The same inla
command used to fit the areal model is used to fit the SPDE model. The only difference is that the data
argument takes the stack data defined above as an input.
# Fit the model####
start_time=Sys.time()
spde_res=inla(formula_test, # Formula
data=inla.stack.data(join.stack),
family="gamma", # Likelihood of the data
control.predictor=list(A=inla.stack.A(join.stack),
link=1,compute=T))
end_time=Sys.time()
print(end_time-start_time)
## Time difference of 10.57809 secs
The model is much faster to fit: it took only 19 seconds (against 3.5 minutes for the BYM 2 model).
4.3.5 Validate the Model on the Validation Set
We can now investigate the model’s goodness of fit. To extract the fitted values for the training and validation sets from the income_spde_test
object, we use the inla.stack.index
function. Again, we extract only the mean predicted value for each segmento.
# Extract the fitted values
index_inla_train = inla.stack.index(join.stack,"train")$data
index_inla_val = inla.stack.index(join.stack,"val")$data
results.train=spde_res$summary.fitted$mean[index_inla_train]
results.val=spde_res$summary.fitted$mean[index_inla_val]
M_fit_spde=array(NA,length(M_fit))
M_fit_spde[index_train]=results.train
M_fit_spde[index_val]=results.val
r2_train_spde=pseudo_r2(index_train,"ingpe",segmento_sh_subset@data,M_fit_spde)
r2_val_spde=pseudo_r2(index_val,"ingpe",segmento_sh_subset@data,M_fit_spde)
rmse_train_spde=RMSE(index_train,"ingpe",segmento_sh_subset@data,M_fit_spde)
rmse_val_spde=RMSE(index_val,"ingpe",segmento_sh_subset@data,M_fit_spde)
a <- list(
x = 600,
y = 100,
text = paste("R2 training set:",round(r2_train_spde*100),"%",
"\nR2 validation set:",round(r2_val_spde*100),"%",
"\n",
"\nRMSE training set:",round(rmse_train_spde),"USD",
"\nRMSE validation set:",round(rmse_val_spde),"USD"),
xref = "x",
yref = "y",
showarrow = F)
Figure 4.6 shows the observed values against the predicted values for the training and validation sets. There are no major differences in the results obtained with the modified BYM model shown in Figure 4.2 except that there does not appear to be overfitting here, as the \(R^{2}\) are similar in the training and validation sets.
Next, we save the results for later use.
income_spde_naive=list("outcome"="ingpe",
"spat_dep"="spde",
"cov_select"="naive",
"formula"=formula_test,
"family"="gamma",
"data"=join.stack,
"fit"=M_fit_spde,
"index_val"=index_val,
"index_train"=index_train,
"r2_val"=r2_val_spde,
"r2_train"=r2_train_spde,
"RMSE_val"=rmse_val_spde,
"RMSE_train"=rmse_train_spde)
save(income_spde_naive,
file=paste0(dir_data,
"out/results/income_spde_naive.RData"))
Lastly, Figure 4.7 shows the results when using a Gaussian likelihood and log transforming the income data. There is a minor improvement to the goodness of fit compared to the gamma model.
# Write the test formula ####
formula_test =log(pred)~ -1+Intercept+
AREA_ID+lights_med+pop_dens+slope+chirps_ev_2017+dist2pubamen+dist2road+
f(spatial.field, model=SV.spde)
# Fit the model####
spde_res=inla(formula_test, # Formula
data=inla.stack.data(join.stack),
family="gaussian", # Likelihood of the data
control.predictor=list(A=inla.stack.A(join.stack),
link=1,compute=T))
# Extract the fitted values
index_inla_train = inla.stack.index(join.stack,"train")$data
index_inla_val = inla.stack.index(join.stack,"val")$data
results.train=spde_res$summary.fitted$mean[index_inla_train]
results.val=spde_res$summary.fitted$mean[index_inla_val]
M_fit_spde=array(NA,length(M_fit))
M_fit_spde[index_train]=results.train
M_fit_spde[index_val]=results.val
# Compute the goodness-of-fit statistics
segmento_sh_subset@data$ln_ingpe=log(segmento_sh_subset@data$ingpe)
r2_train_spde=pseudo_r2(index_train,"ln_ingpe",segmento_sh_subset@data,M_fit_spde)
r2_val_spde=pseudo_r2(index_val,"ln_ingpe",segmento_sh_subset@data,M_fit_spde)
rmse_train_spde=RMSE(index_train,"ln_ingpe",segmento_sh_subset@data,M_fit_spde)
rmse_val_spde=RMSE(index_val,"ln_ingpe",segmento_sh_subset@data,M_fit_spde)
a <- list(
x = 4,
y = 6,
text = paste("R2 training set:",round(r2_train_spde*100),"%",
"\nR2 validation set:",round(r2_val_spde*100),"%",
"\n",
"\nRMSE training set:",round(rmse_train_spde,1),"log USD",
"\nRMSE validation set:",round(rmse_val_spde,1),"log USD"),
xref = "x",
yref = "y",
showarrow = F)
# Save the results
income_spde_naive_gaussian=list("outcome"="ingpe",
"spat_dep"="spde",
"cov_select"="naive",
"formula"=formula_test,
"family"="gaussian",
"data"=join.stack,
"fit"=M_fit_spde,
"index_val"=index_val,
"index_train"=index_train,
"r2_val"=r2_val_spde,
"r2_train"=r2_train_spde,
"RMSE_val"=rmse_val_spde,
"RMSE_train"=rmse_train_spde)
save(income_spde_naive_gaussian,
file=paste0(dir_data,
"out/results/income_spde_naive_gaussian.RData"))
4.4 K-fold Cross-Validation
The split of the observations between the training and validation sets has an important impact on the model’s goodness of fit. Indeed, the model could fit the training set observations very well but the validations set observations so poorly as to fit them purely by chance.
K-fold cross-validation is a way to assess the results’ robustness to various splits of the data. In K-fold cross-validation, the data a split in K folds, and each fold is used only once for validation purposes and \(K-1\) times for training purposes. At the end of the validation process, one can assess how the goodness-of-fit statistics vary and the stastistics’ average value. A typical value for K is 10.
The 10 folds are created as follows:
set.seed(1)
spec = c(val1 = .1, val2 = .1, val3 = .1,
val4 = .1, val5 = .1, val6 = .1,
val7 = .1, val8 = .1, val9 = .1,
val10 = .1)
g = sample(cut(
seq(nrow(data_model)),
nrow(data_model)*cumsum(c(0,spec)),
labels = names(spec)
))
We then fit the model 10 times, using each fold as a validation set once.
r2_train_spde_a=r2_val_spde_a=rmse_train_spde_a=rmse_val_spde_a=c() # Prepare empty array to store goodness-of-fit statistics
for(k in 1:10){
# Define the index
index_val=which(g==paste0("val",k))
index_train=which(g!=paste0("val",k))
# Define the weights
A.train <- inla.spde.make.A(mesh=SV.mesh, loc=coords[index_train,])
A.val <- inla.spde.make.A(mesh=SV.mesh, loc=coords[index_val,])
# Build the stack
covariates_list=c("AREA_ID","lights_med","pop_dens","slope","chirps_ev_2017","dist2pubamen","dist2road")
stack.train <- inla.stack(data = list(pred=segmento_sh_subset@data$ingpe[index_train]), # This output variable (income)
A = list(A.train, 1,1),
effects = list(s.index,
Intercept=1:length(index_train),
segmento_sh_subset@data[index_train,c(covariates_list)]),
tag="train") # The tag allows selected statistics for the desired sample to be extracted later
stack.val <- inla.stack(data = list(pred=NA), # Set it to NA
A = list(A.val, 1,1),
effects = list(s.index,
Intercept=1:length(index_val),
segmento_sh_subset@data[index_val,c(covariates_list)]),
tag="val")
join.stack <- inla.stack(stack.train, stack.val)
# Write the test formula ####
formula_test =pred~ -1+Intercept+
AREA_ID+lights_med+pop_dens+slope+chirps_ev_2017+dist2pubamen+dist2road+
f(spatial.field, model=SV.spde)
# Fit the model
spde_res=inla(formula_test, # formula
data=inla.stack.data(join.stack),
family="gamma", # likelihood of the data
control.predictor=list(A=inla.stack.A(join.stack),
link=1,compute=T))
end_time=Sys.time()
# Extract the fitted values
index_inla_train = inla.stack.index(join.stack,"train")$data
index_inla_val = inla.stack.index(join.stack,"val")$data
results.train=spde_res$summary.fitted$mean[index_inla_train]
results.val=spde_res$summary.fitted$mean[index_inla_val]
M_fit_spde=array(NA,length(M_fit))
M_fit_spde[index_train]=results.train
M_fit_spde[index_val]=results.val
# Compute goodness-of-fit statistics
r2_train_spde=pseudo_r2(index_train,"ingpe",segmento_sh_subset@data,M_fit_spde)
r2_val_spde=pseudo_r2(index_val,"ingpe",segmento_sh_subset@data,M_fit_spde)
rmse_train_spde=RMSE(index_train,"ingpe",segmento_sh_subset@data,M_fit_spde)
rmse_val_spde=RMSE(index_val,"ingpe",segmento_sh_subset@data,M_fit_spde)
# Store goodness-of-fit statistics in an array
r2_train_spde_a=c(r2_train_spde_a,r2_train_spde)
r2_val_spde_a=c(r2_val_spde_a,r2_val_spde)
rmse_train_spde_a=c(rmse_train_spde_a,rmse_train_spde)
rmse_val_spde_a=c(rmse_val_spde_a,rmse_val_spde)
# Print status
# Print(paste(k, "folds done"))
}
print(paste("Average R2 in the validation set:", round(mean(r2_val_spde_a)*100),"%",
"\nAverage R2 in the training set:", round(mean(r2_train_spde_a)*100),"%"))
## [1] "Average R2 in the validation set: 44 % \nAverage R2 in the training set: 48 %"
The average \(R^{2}\) of the validation and training sets (respectively 41 percent and 47 percent) are in line with the results obtained with the original split.
References
Moraga, P. 2019. Geospatial Health Data: Modeling and Visualization with R-Inla and Shiny. Chapman & Hall/CRC Biostatistics Series.
We refer to the pseudo \(R^{2}\) as the \(R^{2}\) in the report. The pseudo \(R^{2}\) is preferred to the normal \(R^{2}\) (squared correlation between the fitted and observed values), as it allows for better comparison with nonlinear models.↩︎