How to find the most important variables that contribute most significantly to a response variable

"Selecting the most important predictor variables that explains the major part of variance of the response variable can be key to identify and build high performing models. These techniques are powerful tools that can help reveal the large sediments of gold in your data.

Data Preparation

For illustrating the various methods, we will use the ‘Ozone’ data from ‘mlbench’ package, except for Information value method which is applicable for binary categorical response variables.
# Data Preparation
library(mlbench)
data(Ozone, package="mlbench")
inputData <- Ozone
names(inputData) <- c("Month", "Day_of_month", "Day_of_week", "ozone_reading", "pressure_height", "Wind_speed", "Humidity", "Temperature_Sandburg", "Temperature_ElMonte", "Inversion_base_height", "Pressure_gradient", "Inversion_temperature", "Visibility") # assign names

Impute missing values using k-Nearest Neighbours

<em>### Impute Missing values
library(DMwR)
inputData <- knnImputation(inputData)

Segregate continuous and categorical variables

### Segregate all continuous and categorical variables
# Place all continuous vars in inputData_cont
inputData_cont <- inputData[, c("pressure_height", "Wind_speed", "Humidity", "Temperature_Sandburg", "Temperature_ElMonte", "Inversion_base_height", "Pressure_gradient", "Inversion_temperature", "Visibility")]
# Place all categorical variables in inputData_cat
inputData_cat <- inputData[, c("Month", "Day_of_month", "Day_of_week")]
# create the response data frame
inputData_response <- data.frame(ozone_reading=inputData[, "ozone_reading"]) # response variable as a dataframe
response_name <- "ozone_reading" # name of response variable
response <- inputData[, response_name] # response variable as a vector

1. Random Forest Method

Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.

library(party)
cf1 <- cforest(ozone_reading ~ . , data= inputData, control=cforest_unbiased(mtry=2,ntree=50)) # fit the random forest
varimp(cf1) # get variable importance, based on mean decrease in accuracy
varimp(cf1, conditional=TRUE) # conditional=True, adjusts for correlations between predictors
varimpAUC(cf1) # more robust towards class imbalance.

## Based on mean drop in accuracy
# Month 
# 3.07240463 
# 
# Day_of_month 
# 0.05763824 
# 
# Day_of_week 
# 0.23705288 
# 
# pressure_height 
# 7.07962903 
# 
# Wind_speed 
# 0.18803550 
# 
# Humidity 
# 3.48513051 
# 
# Temperature_Sandburg 
# 11.15242224 
# 
# Temperature_ElMonte 
# 13.76939819 
# 
# Inversion_base_height 
# 3.97817073 
# 
# Pressure_gradient 
# 2.34617215 
# 
# Inversion_temperature 
# 7.88156535 
# 
# Visibility 
# 2.43741341 

 

2. Relative Importance

Using calc.relimp {relaimpo}, the relative importance of variables fed into a lm model can be determined as a relative percentage.

library(relaimpo)
lmMod <- lm(ozone_reading ~ . , data = inputData) # fit lm() model
relImportance <- calc.relimp(lmMod, type = "lmg", rela = TRUE) # calculate relative importance scaled to 100
sort(relImportance$lmg, decreasing=TRUE) # relative importance

## Sums up to 1
# Temperature_ElMonte 
# 0.184722438 

# Temperature_Sandburg 
# 0.164540381 

# Month 
# 0.163371978 

# Inversion_temperature 
# 0.137890248 

# pressure_height 
# 0.087594494 

# Inversion_base_height 
# 0.083696664 

# Humidity 
# 0.068573808 

# Visibility 
# 0.039202230 

# Day_of_month 
# 0.031248599 

# Pressure_gradient 
# 0.026557629 

# Day_of_week 
# 0.008371262 

# Wind_speed 
# 0.004230269 

 

4. MARS (earth package)

The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS).

library(earth)
marsModel <- earth(ozone_reading ~ ., data=inputData) # build model
ev <- evimp (marsModel) # estimate variable importance
plot (ev)

 # ev
#                      nsubsets   gcv    rss
# Temperature_ElMonte        21 100.0  100.0
# Pressure_gradient          20  42.7   47.1
# pressure_height            18  30.7   36.3
# Month9                     17  26.8   32.9
# Month5                     16  22.6   29.3
# Month4                     15  20.6   27.4
# Month3                     14  18.7   25.6
# Visibility                 13  15.4   23.0
# Month6                     11  12.6   20.1
# Day_of_month7               9  10.9   17.9
# Month2                      9  10.5   17.7
# Temperature_Sandburg        9  10.5   17.7
# Day_of_month21              6   6.7   13.5
# Day_of_month23              4   3.7   10.4
# Wind_speed                  2   3.5    7.6
# Month11                     1   2.8    5.5
MARS: Variable Selection
MARS: Variable Selection

 

5. Step-wise Regression

If you have large number of predictors (> 15), split the inputData in chunks of 10 predictors with each chunk holding the responseVar.

base.mod <- lm(ozone_reading ~ 1 , data= inputData) # base intercept only model
all.mod <- lm(ozone_reading ~ . , data= inputData) # full model with all predictors
stepMod <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = "both", trace = 1, steps = 1000) # perform step-wise algorithm
shortlistedVars <- names(unlist(stepMod[[1]])) # get the shortlisted variable.
shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"] # remove intercept

The output might includes levels within categorical variables, since ‘stepwise’ is a linear regression based technique, as seen in this case.

 # Selected Variables
# [1] "Temperature_Sandburg" "Month2"               "Month3"               "Month4"              
# [5] "Month5"               "Month6"               "Month7"               "Month8"              
# [9] "Month9"               "Month10"              "Month11"              "Month12"             
# [13] "Temperature_ElMonte"  "Humidity"             "Pressure_gradient"    "Visibility"          
# [17] "Wind_speed"           "pressure_height"

If you have a large number of predictor variables (100+), the above code may need to be placed in a loop that will run stepwise on sequential chunks of predictors. The shortlisted variables can be accumulated for further analysis towards the end of each iteration. This can be very effective method, if you want to (i) be highly selective about discarding valuable predictor variables. (ii) build multiple models on the response variable.

 

6. Boruta

The ‘Boruta’ method can be used to decide if a variable is important or not.
library(Boruta)
# Decide if a variable is important or not using Boruta
boruta_output <- Boruta(response ~ ., data=na.omit(inputData), doTrace=2) # perform Boruta search

# Confirmed 10 attributes: Humidity, Inversion_base_height, Inversion_temperature, Month, Pressure_gradient and 5 more.
# Rejected 3 attributes: Day_of_month, Day_of_week, Wind_speed.

boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables

# Confirmed variables
# [1] "Month"                 "ozone_reading"         "pressure_height"       "Humidity"             
# [5] "Temperature_Sandburg"  "Temperature_ElMonte"   "Inversion_base_height" "Pressure_gradient"    
# [9] "Inversion_temperature" "Visibility"

7. Information value and Weight of evidence

library(devtools)
install_github("riv","tomasgreif")
install_github("woe","tomasgreif")
library(woe)
library(riv)
iv_df <- iv.mult(german_data, y="gb", summary=TRUE, verbose=TRUE)
iv <- iv.mult(german_data, y="gb", summary=FALSE, verbose=TRUE)

# iv_df
#                     Variable InformationValue Bins ZeroBins    Strength
# 1                  ca_status      0.666011503    4        0 Very strong
# 2             credit_history      0.293233547    5        0      Strong
# 3                   duration      0.259146834    5        0      Strong
# 4              credit_amount      0.207970035    5        0      Strong
# 5                    savings      0.196009557    5        0     Average
# 6                    purpose      0.169195066   10        0     Average
# 7                        age      0.125210683    5        0     Average
# 8                   property      0.112638262    4        0     Average
# 9   present_employment_since      0.086433631    5        0        Weak
# 10                   housing      0.083293434    3        0        Weak
# 11         other_installment      0.057614542    3        0        Weak
# 12                status_sex      0.044670678    5        1        Weak
# 13            foreign_worker      0.043877412    2        0        Weak
# 14             other_debtors      0.032019322    3        0        Weak
# 15   installment_rate_income      0.023858552    2        0        Weak
# 16          existing_credits      0.010083557    2        0   Wery weak
# 17                       job      0.008762766    4        0   Wery weak
# 18                 telephone      0.006377605    2        0   Wery weak
# 19 liable_maintenance_people      0.000000000    1        0   Wery weak
# 20   present_residence_since      0.000000000    1        0   Wery weak

Plot the information value summary

# Plot information value summary
iv.plot.summary(iv_df)

Variable selection using Information Value
Variable selection using Information Value

Calculate weight of evidence variables

german_data_iv <- iv.replace.woe(german_data, iv, verbose=TRUE) # add woe variables to original data frame.
The newly created woe variables can alternatively be in place of the original factor variables.

Summary
Article Name
VARIABLE SELECTION STRATEGIES
Description
Strategies to filter down independent (predictor) variables that best explains the dependent (response) variable. This is done in R language with the code to execute them quickly. Plug-in these code templates to almost instantly apply the logic to your data.
Author

If you like us, please tell your friends.Share on LinkedInShare on Google+Share on RedditTweet about this on TwitterShare on Facebook