Section 5 Stepwise Covariates Selection

This section describes a backward stepwise covariates selection process, which reduces the number of covariates from the 50 potential ones identified in section 3 to about 10 or fewer. Here is an overview of the process:

  1. Split the sample into 60-20-20 training, validation, and test sets.
  2. Write the test formula without one of the 50 candidate covariates.
  3. Fit the model on the training set.
  4. Validate the model on the validation set and compute the goodness-of-fit statistics.
  5. Repeat steps two to four once for each covariate (50 times).
  6. Drop the covariate in the absence of which the fit is the best;.
  7. Repeat steps two to six until no covariate is left.
  8. Inspect the results to identify a parsimonious specification (typically below 10 covariates) yielding a good fit on the validation set.
  9. Test the final model on the test set.

The backward stepwise covariates selection process is computationally expensive. Fitting one modified BYM model takes about 3.5 minutes on a standard PC.3 Carrying out steps 1 to 5 takes about 175 minutes. Steps 6 and 7 take about 74 hours. To speed up the process, we write a function to parallelize step 5 across CPU cores. As the number of cores might be relatively limited on a PC (typically up to eight), an option is to carry out the process on the cloud, where an instance with more cores can be rented. For instance, the computing time can be reduced to below two hours with 64 cores.

The backward stepwise covariates selection is presented with the income model.

5.1 Step 1: Split the Sample into 60-20-20 Training, Validation, and Test Sets

The observations in the dataset are split into 60-20-20 training, validation, and test sets. The model will be fitted on the training set, and income will be predicted on the training and validation sets. Predicted values in the validation set will be compared with the observed values to assess the model’s goodness of fit.

Once the covariate selection process is complete and a given specification has been chosen (steps 1 to 8 above), predictions are made on the final test set to assess the final model’s goodness of fit.

# Sample segmentos for training, validation, and testing  ####
set.seed(1234)
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
  seq(nrow(data_model)), 
  nrow(data_model)*cumsum(c(0,spec)),
  labels = names(spec)
))
index_val=which(g=="validate")
index_train=which(g=="train")
index_test=which(g=="test")

data_model$pred=data_model$ingpe
data_model$pred[c(index_val,index_test)] <- NA # Set the validation prediction to NA: The values will be predicted by the model and compared with observed values 

mod_data_jack = data_model # Store the data into a new data frame for model fitting      
mod_data_jack=mod_data_jack[c(index_val,index_train),] # Select only the training and validation sets; keep test set of final validation
index_val_jack=which(is.na(mod_data_jack$pred))
index_train_jack=which(is.na(mod_data_jack$pred)==F)

5.2 Steps 2 to 4: Create a Function

To make reading the code easier, we create a function INLA_steps_2_4 that implements steps 2 to 4 of the covariate selection process. The steps in INLA_steps_2_4 are

  1. Write the test formula without one of the candidate covariates.
  2. Fit the model on the training set.
  3. Validate the model on the validation set.
  4. Compute the goodness-of-fit statistics.

The INLA_steps_2_4 accepts the following arguments: covariates formulation, index of the covariate to be removed, index of the validation set, index of the training set, data frame for the model, a string defining the outcome variable (e.g., “ingpe”), and a string defining the likelihood function (e.g., a Gaussian, gamma, or beta distribution).

# INLA_set2_4_function #####
INLA_steps_2_4=function(covariates_formulation, # The formulation of each covariate
                        ix, # The index of the covariate to be removed
                        index_val, # The index of the validation set
                        index_train, # The index of the training set
                        mod_data_jack, # The data for the model
                        outcome,   # The string defining the outcome variable, such as ingpe
                        family){   # The string defining the likelihood 
  
  # Formula
  candidate_covariates <- covariates_formulation 
  
  formula_test <- reformulate(c("-1", "intercept",
                                paste(candidate_covariates[-ix],collapse="+"),
                                'f(ID,model = "bym2",
                                hyper=list(
                                prec = list(prior = "pc.prec", param = c(0.1, 0.0001)),
                                phi  = list(prior = "pc", param = c(0.5, 0.5))),
                                graph = Segmento.Inla.nb,
                                scale.model = TRUE,
                                constr = T,
                                adjust.for.con.comp = T)'),
                              "pred") # Rescale to allow the Newton-Raphson optimizer to converge
  
  if(family=="gaussian"){
    formula_test <- reformulate(c("-1", "intercept",
                                  paste(candidate_covariates[-ix],collapse="+"),
                                  'f(ID,model = "bym2",
                                hyper=list(
                                prec = list(prior = "pc.prec", param = c(0.1, 0.0001)),
                                phi  = list(prior = "pc", param = c(0.5, 0.5))),
                                graph = Segmento.Inla.nb,
                                scale.model = TRUE,
                                constr = T,
                                adjust.for.con.comp = T)'),
                                "pred/10") # Rescale to allow the Newton-Raphson optimizer to converge
  }
  # Fit the model
  if(family%in%c("beta","zeroinflatedbinomial1")){
    bym.res=INLA::inla(formula=formula_test, 
                       family = family, 
                       data = mod_data_jack,
                       Ntrials = mod_data_jack$n_obs,
                       control.predictor=list(link=1,compute=T),
                       control.compute=list(dic=T, cpo=F),
                       control.inla =list(int.strategy = "eb")
    )
  }else{
    
    bym.res=INLA::inla(formula=formula_test, 
                       family = family, 
                       data = mod_data_jack,
                       control.predictor=list(link=1,compute=T),
                       control.compute=list(dic=T, cpo=F),
                       control.inla =list(int.strategy = "eb") 
    )
  }
  # Extract the fitted values
  M_fit=bym.res$summary.fitted.values[,"mean"]
  
  if(family%in%c("gamma")){
    M_fit=exp(M_fit) # Back transform to level
  }
  if(family%in%c("gaussian")){
    M_fit=M_fit*10 # Back transform to level
  }
  
  # The function for RMSE 
  RMSE=function(set,outcome,data){
    res = data[set,outcome]-M_fit[set]
    RMSE_val <- sqrt(mean(res^2,na.rm=T)) 
    return(RMSE_val)  
  }
  # The function for pseudo_r2
  pseudo_r2=function(set,outcome,data){
    res =  data[set,outcome]-M_fit[set]
    RRes=sum((res)^2,na.rm = T)
    RRtot=sum((data[set,outcome]-mean(M_fit[set],na.rm=T))^2,na.rm = T)
    pseudo_r2_val=1-RRes/RRtot
    return(pseudo_r2_val)  
  }
  
  # RMSE: RMSE function defined above
  RMSE_val=RMSE(index_val,outcome,mod_data_jack)
  RMSE_train=RMSE(index_train,outcome,mod_data_jack)
  
  # R2: Pseudo_r2 function defined above
  r2_val=pseudo_r2(index_val,outcome,mod_data_jack)
  r2_train=pseudo_r2(index_train,outcome,mod_data_jack)
  
  # Store the results
  results_list=list("cov_i"=ix,
                    "cov_name"=candidate_covariates[ix],
                    "formula"=formula_test,
                    # "fitted_values"=M_fit,
                    # "index_val"=index_val,
                    # "index_train"=index_train,
                    # "index_train_val"=index_train_val,
                    "RMSE_val"=RMSE_val,
                    "RMSE_train"=RMSE_train,
                    "r2_val"=r2_val,
                    "r2_train"=r2_train,
                    "outcome"=outcome,
                    "family"=family)
  rm(bym.res)
  return(results_list)
}

We can now test the function.

# Test the function #####
test_fct=INLA_steps_2_4(covariates_formulation, # The formulation of each covariate
                        1, # The index of the covariate to be removed
                        index_val_jack, # The index of the validation set
                        index_train_jack, # The index of the training set
                        mod_data_jack, # The data for the model
                        "ingpe", # The outcome
                        "gaussian") # The likelihood

The results can then be accessed using the $ sign. For instance, we can look at the \(R^2\):

print(paste("R-squared in the validation set:",test_fct$r2_val,
            "R-squared in the training set:",test_fct$r2_train))
## [1] "R-squared in the validation set: 0.412553293719591 R-squared in the training set: 0.511180089748186"

5.3 Steps 5 to 7: Run a Parallel Loop

We now launch the entire backward selection process with a loop. To speed up the process, we parallelize the estimation of every single model across the cores: each core is in charge of estimating one model. In the first iteration of the loop, we start with a list of \(50\) candidate covariates, and \(50\) models are estimated. Each model contains all covariates minus one \(j\) covariate, where \(j\) is different in each model and \(j=1, ..., 50\). Once the \(50\) models have been estimated, the results are collected, and the best-performing model is identified, say, for example, model k. The corresponding \(k\) covariate is removed from the list of candidate covariate and the loop proceeds to the next iteration with a list of \(50-1=49\) candidate covariates.

# Loop all covariates until only one is left #####
candidate_covariates=covariates_formulation # Start again with the full list of candidate covariate
n2drop=length(covariates_formulation)-1 
start_t=Sys.time()
results_jacknife=list() # Create a list to store all the results of the jacknife selectio process
for(n_cov_to_drop in 1:n2drop){  # Repeat the step n2drop=12 times until having only 8 covariates
  
  set.seed(101)
  
  INLA_steps_2_4_par=function(cov_n){  # Fit the model by dropping one covariate after the other
    model_fct=INLA_steps_2_4(candidate_covariates, # The formulation of each covariate
                             cov_n, # The index of the covariate to be removed
                             index_val_jack, # The index of the validation set
                             index_train_jack, # The index of the validation set
                             mod_data_jack,  # The data for the model
                             "ingpe", # The outcome
                             "gaussian") # The likelihood
    cat("\014") # Clean the consol
    # end_t=Sys.time()
    # I think uration=end_t-start_t
    # Print(duration)
    return(model_fct)
  }
  cov_n<-1:length(candidate_covariates)
  n_cores<-detectCores()
  cl=makeCluster(n_cores)
  start_T<-Sys.time()
  clusterExport(cl=cl,
                varlist=c("candidate_covariates",
                          "INLA_steps_2_4",
                          "index_val_jack",
                          "index_train_jack",
                          "mod_data_jack",
                          "Segmento.Inla.nb"))
  results_list<-clusterApply(cl,cov_n,INLA_steps_2_4_par)
  end_T<-Sys.time()
  cat("\014") # Clean the consol
  print(end_T-start_T)
  stopCluster(cl)
  gc()
  # Store all the results for a quality check
  results_jacknife[[n_cov_to_drop]]=results_list 
  
  # Drop the covariate that least affects the goodness of fit 
  
  rmse_val=lapply(results_list,function(x) unlist(x$RMSE_val))
  rmse_val_min=which.min(rmse_val)
  
  covariates_to_be_removed=results_list[[rmse_val_min]]$cov_name
  
  candidate_covariates=candidate_covariates[-which(candidate_covariates==covariates_to_be_removed)]
  rm(results_list)
  print(paste(length(candidate_covariates),"covariates remaining"))
  
}
end_t=Sys.time()
duration=end_t-start_t
print(duration) # 1.84 hours

The process took slightly less than 2 hours using 64 cores on the cloud.

5.4 Step 8: Inspect the Results and Select a Specification

We now inspect the results of the selection process. Look at the goodness of fit of the best-performing models at each covariate selection step using the R-squared and the RMSE as goodness-of-fit statistics.

# INCOME BYM-2 ####
load(paste0(dir_data,"workspace/income_bym2_gaussian.RData"))

# Visualize the R-squared  ####
# Extract the R-squared
r2_val_max=r2_train_max=r2_val_min=r2_train_min=c()
for(k in 1:length(results_jacknife)){
  r2_val=lapply(results_jacknife[[k]],function(x) unlist(x$r2_val))
  r2_train=lapply(results_jacknife[[k]],function(x) unlist(x$r2_train))
  
  # Get the maximum r2
  r2_val_max_index=which.max(r2_val)
  r2_val_max_i=r2_val[[r2_val_max_index]]
  r2_train_max_i=r2_train[[r2_val_max_index]]
  
  r2_val_max=c(r2_val_max,r2_val_max_i)
  r2_train_max=c(r2_train_max,r2_train_max_i)
  
  # Get the minimum r2
  r2_val_min_index=which.min(r2_val)
  r2_val_min_i=r2_val[[r2_val_min_index]]
  r2_train_min_i=r2_train[[r2_val_min_index]]
  
  r2_val_min=c(r2_val_min,r2_val_min_i)
  r2_train_min=c(r2_train_min,r2_train_min_i)
}

# Visualize the r2
data2plot=data.frame(set=c(rep("val",length(r2_val_max)),rep("train",length(r2_val_max))),
                     index=c(rev(1:length(r2_val_max)),rev(1:length(r2_val_max))),
                     index_jack=c(1:length(r2_val_max),1:length(r2_val_max)),
                     r2_max=c(unlist(r2_val_max),unlist(r2_train_max)),
                     r2_min=c(unlist(r2_val_min),unlist(r2_train_min)))

plotly::plot_ly(data=data2plot,
                x=~index_jack,
                y=~r2_max,
                name = ~set,
                type="scatter",
                mode="line")
# Visualize the RMSE ####
# Extract the RMSE
RMSE_val_max=RMSE_train_max=RMSE_val_min=RMSE_train_min=c()
for(k in 1:length(results_jacknife)){
  RMSE_val=lapply(results_jacknife[[k]],function(x) unlist(x$RMSE_val))
  RMSE_train=lapply(results_jacknife[[k]],function(x) unlist(x$RMSE_train))
  
  # Get the maximum RMSE
  RMSE_val_max_index=which.max(RMSE_val)
  RMSE_val_max_i=RMSE_val[[RMSE_val_max_index]]
  RMSE_train_max_i=RMSE_train[[RMSE_val_max_index]]
  
  RMSE_val_max=c(RMSE_val_max,RMSE_val_max_i)
  RMSE_train_max=c(RMSE_train_max,RMSE_train_max_i)
  
  # Get the minimum RMSE
  RMSE_val_min_index=which.min(RMSE_val)
  RMSE_val_min_i=RMSE_val[[RMSE_val_min_index]]
  RMSE_train_min_i=RMSE_train[[RMSE_val_min_index]]
  
  RMSE_val_min=c(RMSE_val_min,RMSE_val_min_i)
  RMSE_train_min=c(RMSE_train_min,RMSE_train_min_i)
}

# Visualize the RMSE
data2plot=data.frame(set=c(rep("val",length(RMSE_val_max)),rep("train",length(RMSE_val_max))),
                     index=c(rev(1:length(RMSE_val_max)),rev(1:length(RMSE_val_max))),
                     RMSE_max=c(unlist(RMSE_val_max),unlist(RMSE_train_max)),
                     RMSE_min=c(unlist(RMSE_val_min),unlist(RMSE_train_min)))

plotly::plot_ly(data=data2plot,
                x=~index,
                y=~RMSE_min,
                name = ~set,
                type="scatter",
                mode="line")

Based on the results above, the 31st iteration appears to provide a good balance between the sparsity of the formulation and the goodness of fit.

# Identify the specification ####
step_chosen=31  
r2_val=lapply(results_jacknife[[step_chosen]],function(x) unlist(x$r2_val))
r2_val_max_index=which.max(r2_val)

formula_selected=results_jacknife[[step_chosen]][[r2_val_max_index]]$formula
formula_selected
## pred/10 ~ -1 + intercept + chirps_ev_med + lights_med + dist2coast_r + 
##     pop_dens + lc_tree + dens_all_roads + dens_secondary + dens_bus + 
##     LZ_CentralIndusServ + f(ID, model = "bym2", hyper = list(prec = list(prior = "pc.prec", 
##     param = c(0.1, 0.0001)), phi = list(prior = "pc", param = c(0.5, 
##     0.5))), graph = Segmento.Inla.nb, scale.model = TRUE, constr = T, 
##     adjust.for.con.comp = T)
## <environment: 0x5632bcc57820>

Here are the covariates that were selected:

  • chirps_ev_med: median precipitation
  • lights_med: median lights over the year
  • dist2coast_r: distance to coast
  • pop_dens: population density
  • lc_tree: percentage of trees
  • dens_all_roads: road density (all road categories)
  • dens_secondary: secondary road density
  • dens_bus : bus-lanes density
  • LZ_CentralIndusServ: Central Livelihood zones on industry and services

5.5 Step 9: Conduct a Final Test

Lastly, we perform the final goodness-of-fit test on the test set. As a reminder, the test set contains 20 percent of the data we set aside before starting the selection process, meaning that the model has not yet seen them.

We start by building the data frame for estimating the final model with the training and test sets only.

# Build a data frame for the final test ####
data_test=data_model
data_test$pred=data_test$ingpe # Create prediction variables equal to ingpe, the outcome variable
# data_test$pred[index_test] <- NA # Set the test prediction to NA: the variables will be predicted by the model and compared with observed 
data_test$pred[c(index_test, index_val)] <- NA # Set the test prediction to NA: the variables will be predicted by the model and compared with observed 
data_test=data_test[c(index_test,index_train),]

index_train_bis=which(is.na(data_test$pred)==F)
index_test_bis=which(is.na(data_test$pred))
a <- list(
  x = 600,
  y = 100,
  text = paste("R2 training set:",round(r2_train*100),"%",
               "\nR2 test set:",round(r2_test*100),"%",
               "\n",
               "\nRMSE training set:",round(RMSE_train),"USD",
               "\nRMSE test set:",round(RMSE_test),"USD"),
  
  xref = "x",
  yref = "y",
  showarrow = F)

The estimation is done with the usual inla function:

Figure 5.1: Observed versus Predicted Income Values, BYM 2 Model with Covariates Selected with the Backward Stepwise Selection Process


  1. Processor: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, 16.0 GB RAM.↩︎