Advanced Linear Regression: A Case study

It is possible to build multiple regression models for just one set of response and predictor variables. When you are manually building the models, it can be a herculean task to build even one valid statistically significant regression model especially when you are new to the data/problem. It can be rather frustrating if you later find out there is multi-collinearity or that your model does not perform equally well when cross validated on random samples or does not have good prediction accuracy on test data, or worse.

What if we have the flexibility to build all choicest statistically valid models, see their prediction accuracy, cross-validate on random samples, compare all important diagnostic parameters from one place and finally pick the best one that suits your case?. The details that follow will attempt to solve this.

Since R offers the muscle for both the statistical analysis and algorithmic approach, we combine the best of both worlds to find, build, analyse and choose the best of the best linear regression models. The following framework attempts to strike a balance between a robust algorithm vs investigative approach.

At the end of this, you will have a list of all the best models in a list along with a host of diagnostic parameters, cross validation and accuracy measures all in one final regression output.

If you need to learn the fundamentals, the basics of regression modelling will offer more details on this subject. Or, if you are relatively new to the concept, check out the numeric example for simple linear regression, before going ahead with this.

Problem description

The ‘Ozone’ data from mlbench package contains the Los Angeles ozone pollution data collected in 1976. It is available as a data frame with 366 rows and 13 variables, that may be helpful to explain the ozone reading in the region. It is now up to us to find out which of the available variables would suit best to predict the ozone reading with maximum accuracy. The objective of this analysis is to accurately predict the ‘daily maximum one-hour average ozone reading’, using linear regression models.
We will ensure this by making a 80:20 split of the data as development and validation samples, and use only the 80% sample for building linear models(training).

The models thus built will be used to predict the ozone reading on the 20% validation sample. With the actual ozone readings from 20% validation sample and model predictions thus computed, we will calculate the prediction accuracy and use it as one of the important parameters to decide the optimal model that suits this problem.

Setting Up: Load the packages and prepare input data

Load the packages

### Initialize libraries------------------------------------
library(e1071) # for skewness
library(PerformanceAnalytics) # for correlogram (chart.Correlation)
library(Hmisc) # for missing value treatement (impute)
library(corrplot) # for correlogram (corrplot)
library(party) # selecting best variables (cforest)
library(Boruta) # deciding if a variable is important (Boruta)
library(caret) # for boxcox transformation (BoxCoxTrans)
library(car) # for vif
library(DMwR) # for knnImputation
library(mlbench) # for Ozone Data
library(DAAG) # for cross validation
library(relaimpo) # for finding relative importance of predictors in lm mod

Prepare input data

### Create Input Datasets ----------------------------------
data (Ozone, package="mlbench") # initialize the data
inputData <- Ozone # data from mlbench
# assign names
names(inputData) <- c("Month", "Day_of_month", "Day_of_week", "ozone_reading", "pressure_height", "Wind_speed", "Humidity", "Temperature_Sandburg", "Temperature_ElMonte", "Inversion_base_height", "Pressure_gradient", "Inversion_temperature", "Visibility")
# Segregate all continuous and categorical variables
# Place all continuous vars in inputData_cont
inputData_cont <- inputData[, c("pressure_height", "Wind_speed", "Humidity", "Temperature_Sandburg", "Temperature_ElMonte", "Inversion_base_height", "Pressure_gradient", "Inversion_temperature", "Visibility")]
# Place all categorical variables in inputData_cat
inputData_cat <- inputData[, c("Month", "Day_of_month", "Day_of_week")]
# create the response data frame
inputData_response <- data.frame(ozone_reading=inputData[, "ozone_reading"])
response_name <- "ozone_reading" # name of response variable
response <- inputData[, response_name] # response variable as a vector

Ozone data
Ozone data

Creating derived variables

Before moving on to exploratory analysis, it is a good idea to derive as many new variables from your predictors, that you think would be relevant for your dependent variable. Why? Because we want to capture all the nuances that might be an explanatory cause for the dependent variable. Why create it so early? Because, you can perform the analyses that follows on these variables as well.

Ideas for creating derived variables
Combination The new variable is a combination (+/*) of few of the original raw variables
Aggregation Aggregate the variables to some level. Eg: Sales in last 12 months, Average salary of a certain employment class, etc.
Transformation Transform the variable into log, cube, etc. More discussion on Box-Cox transformation follows.
Dummy variables Line seasonality, holiday indicators, month, week indicator, day of week etc.
Indicators Relevant economic indicators from publicly (or purchasable) available data sources

Here is a quick challenge: Can you come up with a new derived variable for ‘Ozone’ dataset? If you succeed, attach that variable to your input data and proceed with the analysis. It will be interesting to see if you can build a better predicting model than what we have achieved here, at the end of this analysis.

Exploratory analysis

### Exploratory analysis -----------------------------------------------
# Generate plots: Density, Scatter, Box plots
# Set up your working directory here, to a location where you want to store plots.
for (k in names(inputData_cont)){
png(file=paste(k,"_dens_scatter_box" ,".png", sep=""), width=900, height=550)
x <- as.numeric (inputData_cont[, k])
Skewness <- round(skewness(x), 2) # calc skewness
dens <- density(x, na.rm=T) # density func
par(mfrow=c(1, 3)) # setup plot-area in 3 columns
# Density plot
plot(dens, type="l", col="red", ylab="Frequency", xlab = k, main = paste(k, ": Density Plot"), sub=paste("Skewness: ", Skewness))
polygon(dens, col="red")
# scatterplot
plot(x, response, col="blue", ylab="Response", xlab = k, main = paste(k, ": Scatterplot"), pch=20)
abline(response ~ x)
# boxplot
boxplot(x, main=paste(k, ": Box Plot"), sub=paste("Outliers: ", paste(boxplot.stats(x)$out, collapse=" ")))

The above code will create the density plot, scatter plot and box-plot, all within one frame, for all the variables in input data and store it in your working directory. It is preferable for density plot to have a bell shaped curve aligned to the center of chart area. Scatterplot reveals if there is a underlying linear relationship between the predictor and response while the box-plot reveals any outliers in the variable (in which case, dots will be present outside the top and bottom lines (IQR)).

Inversion temperature density scatter boxplot
Inversion temperature density scatter boxplot

Outlier analysis

Method 1: If you choose to remove the outliers (not recommended)
x <- x[!x %in% boxplot.stats(x)$out] # NOT run!

Method 2: Replace outliers with NAs, be be filled up later during missing value treatment.
replace_outlier_with_missing <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...) # get %iles
H <- 1.5 * IQR(x, na.rm = na.rm) # outlier limit threshold
y <- x
y[x < (qnt[1] - H)] <- NA # replace values below lower bounds
y[x > (qnt[2] + H)] <- NA # replace values above higher bound
y # returns treated variable
inputData_cont <- (sapply(inputData_cont, replace_outlier_with_missing)) # this will make outliers as NA

Re-assemble inputdata with the outlier corrected part
inputData <- cbind(inputData_response, inputData_cont, inputData_cat) # column bind the response, continuous and categorical predictors

Missing value treatment

Approaches for handling missing values

Eliminating the variable: If a particular variable has more than 30% missing values, it is advisable to consider removing that variable altogether.
Eliminating the observations: If there are missing values scattered through out your dataset and you have a good number of observations in your sample, you may choose to eliminate those observations with missing values. After doing this, if you find losing more than 30% of sample size, then you probably need to look for any particular variable that is contributing the most number of missing values.
Imputation: Alternatively, you may also impute the missing values with one of these place holders.

Imputation approaches
Mean: Replace with mean
Median: Replace with median
Midrange: Replace with mid-range (max + min)/2
Mode: Most frequent value
Interpolation based: Interpolate based on a function for the variable.
Based on distribution: Identify the distribution of the variable, and estimate using the distribution parameters.
Predict the missing values: Based on treating the missing as a dependent variable in a regression model – See method 2 below.
K Nearest Neighbours: Based on finding the observations that have the closest characteristics of the observation with missing value, through k-Nearest Neighbours algorithm. See method 3 below.

Apply the missing value treatment

Method 1: Replace missing values with mean (NOT Run)
inputdata_cont_matrix <- sapply(inputData_cont, FUN=function(x){impute(x, mean)}) # impute missing values with the mean using 'impute' func from 'Hmisc' pkg
inputdata_cont <- # store as dataframe

Method 2: Predict the missing values by modelling it as a dependent variable with a regression model based approach
Replace all missing values with the mean, except for response variable. Build model with the variable for which missing value treatment needs to be done as the response variable, while the remaining variables can be used a predictors. This method is discussed in Julian J Faraways’s Book P.158

Method 3: k-Nearest neighbor imputation method – Applied!
# Impute all except the response variable.
if( anyNA(inputData)) {
inputData[, !names(inputData) %in% response_name] <- knnImputation(inputData[, !names(inputData) %in% response_name]) # missing value treatment

Correlation analysis and Summary

The correlogram is a graphical representation of correlation amongst all combinations of response and predictor variables. The investigator may choose to pick variables that have high correlation with the response based on a pre-determined cutoff. But we do not do that filtering here, since it is possible that a variable with relatively lower correlation can prove significant and valuable in regression models to capture unexplained variations, even if they are of ‘lower’ importance.
par (mfrow=c(1,1))
inputData_response_cont_noNA <- na.omit(cbind(inputData_response, inputData_cont)) # remove rows with missing values
chart.Correlation(inputData_response_cont_noNA, histogram=TRUE, pch=19) # plot correlogram
corrplot(cor(inputData_response_cont_noNA), method="circle", is.corr=TRUE)
# Get and store the summary files
sapply(inputData, summary) # get summary statistics for all columns

Correlogram – Blue circle shows positive correlation with Red circle shows negative effect. The bigger the circle, higher the correlation.


Correlogram with density, scatter plot and significant correlation values.
Correlogram with density, scatter plot and significant correlation values.

Box Cox Transformation for Continuous variables

Upon performing outlier treatment, you can expect the skewness of the variable to improve. But not always. Sometimes a box-cox transformation of the response variable is preferable to reduce the skewness and make it closer to a normal distribution. While the variable gets transformed, ranking order of the variable remains the same after the transformation.

Why does a highly skewed non-normal response variable matter?

Because, for the long tailed region of the response, there is a larger variance. This also means, more predictors are expected to have a stronger effect while explaining the larger variance part of the response variable, thereby creating a bias. This can affect all the estimates resulting in spurious models. It can also be argued that a high skewness is a consequence of heteroscedasticity (non-constant variance).

Therefore, if the skewness does not improve after treating the outliers, consider performing a box-cox transformation especially when k-fold cross-validation reveals inconsistent results. Note that, it is not mandatory to apply box-cox transformation on all variables that are skewed. It is up to the best judgement of the investigator to decide if box-cox transformation is indeed needed on a variable.

Nevertheless, just so you know how a box cox transform is done, below is the procedure. As a default, box-cox transformation in the response variable is sufficient, but here, I have chosen to apply it on variables that have a skewness > 1, even after outlier corrections are applied.

skewness(inputData_response, na.rm=T) # 0.878864
# Transforming the predictor variables
boxcoxTransformedVars <- character(0) # variables that underwent box-cox transformation collects here.
for (colname in colnames(inputData_cont[, -1])) {
x <- inputData_cont[, colname]
if(abs(skewness(x, na.rm=T)) > 1){ # check for high skewness.
boxcoxMod <- BoxCoxTrans(x) # calculates lambda value and store in a model
boxcoxTransformedVars <<- c(boxcoxTransformedVars, colname)
inputData_cont[, colname] <- predict(boxcoxMod, x) # calculate transformed variable

Checking for significance: Continuous Variables

For each continuous predictor variable, we build a simple regression model. If the p-Value of the model is less than a predetermined significance level, we choose not to reject the variable and therefore store it in ‘significantPreds_cont’. This step is done purely to qualify meaningful variables that may later help us to build better models.

# Selecting signifiant variables (continuous) - checking for p-Values
significantPreds_cont <- character() # initialise output. Significant preds will accrue here
for (predName in names(inputData_cont)) {
pred <- inputData[, predName]
mod <- lm(response ~ pred) # build linear model with only current predictor
p_value <- summary(mod)$coefficients[, 4][2] # capture p-Value
if (p_value < 0.1 & p_value > 0) { # check for significance
significantPreds_cont <- c (significantPreds_cont, predName) # if selected, bind the predictor name if main output
inputData_cont_signif <- inputData[, names(inputData) %in% significantPreds_cont] # filter selected predictors

Checking for significance: Categorical Variables

Since the dependent variable here can be considered as a ordinal count variable, we can use the chi-squared test to determine if a categorical predictor is significant in explaining the dependent variable. Had our dependent variable been a continuous variable, you should use the same method we just saw for continuous predictors (i.e. fitting a simple regression model with the predictor as the categorical variable of interest and checking the p Value).
So, after we perform the chi-squared test (see code below), if the p-Value turns out to be less than the predetermined significance level, we qualify the categorical variable and store it in ‘significantPreds_cat’ for further steps.
# Selecting signifiant variables (categorical) - checking for chi-sq test
significantPreds_cat <- character() # initialise output. Significant preds will accrue here
for (predName in names(inputData_cat)) {
pred <- inputData[, predName]
chi_sq <- invisible(suppressWarnings(chisq.test(table(response, pred)))) # perform chi-Squared test
p_value <- chi_sq[3] # capture p-Value
if (p_value < 0.1 & p_value > 0) { # check for significance
significantPreds_cat <- c (significantPreds_cat, predName)
inputData_cat_signif <-[, names(inputData) %in% significantPreds_cat])
colnames(inputData_cat_signif) <- significantPreds_cat # assign column names

Put together the input data with only significant variables
# combine the response, continuous significant predictors, categorical significant predictors
inputData_signif <- cbind(response, inputData_cont_signif, inputData_cat_signif)

Feature selection

Part 1: Boruta: Decide if a variable is important or not

# Decide if a variable is important or not using Boruta
boruta_output <- Boruta(response ~ ., data=na.omit(inputData_signif), doTrace=2) # perform Boruta search
boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables

Part 2: Estimate important variables based on mean drop in accuracy – using cforest

cforest is one of the many techniques we can use to find the most important variables. You might want to try out these other techniques especially when your data size becomes larger. For smaller data, cforest is more convenient especially because you can pass in categorical variables to it as well.
# Selecting best variables (variable selection using cforest)
cf1 <- cforest(response ~ . , data= na.omit(inputData_signif[, names(inputData_signif) %in% c(boruta_signif, "response")]), control=cforest_unbiased(mtry=2,ntree=50)) # fit the random forest
impVars <- sort(varimp(cf1), decreasing=T) # get variable importance
# Pick top 7. Adjust this as per your needs. More variables will take more computational time.
impVars <- if(length(impVars) > 7){
} else {

Part 3: Removing multicollinearity

Remove multi-collinearlity, by eliminating variable with max VIF until, all VIFs doesn’t exceed 4. Since each level in a categorical variable will be treated as separate variable when used in a regression model, we do not include them for VIF filtering.
vars_considered_for_VIF_Filters <- names(impVars)[!names(impVars) %in% names(inputData_cat_signif)] # exclude categoricals
fullForm <- (paste ("response ~ ", paste (vars_considered_for_VIF_Filters, collapse=" + "), sep="")) # model formula
fullMod <- lm(as.formula(fullForm), data=inputData_signif) # build linear model
all_vifs <- try(vif(fullMod), silent=TRUE) # get all vifs
# if vif of any of the variables in model is greater than threshold (4), drop the max VIF variable and build model again, until all variables have VIF less than threshold.
if(class(all_vifs) != "try-error"){
all_vifs <- # cast as a dataframe
# check if there are more > 2 rows, max-VIF exceeds 4 and not of 'try-error' class
while((nrow(all_vifs) > 2)& (max(all_vifs[, 1]) > 4) & (class(all_vifs) != "try-error")) {
remove_var <- rownames(all_vifs)[which(all_vifs[, 1] == max(all_vifs[, 1]))] # find max VIF variable
impVars <- impVars[!names(impVars) %in% remove_var] # remove
fullForm <- paste ("response ~ ", paste (names(impVars), collapse=" + "), sep="") # new formula
fullMod <- lm(as.formula(fullForm), data=inputData_signif) # re-build model with new formula
all_vifs <- try(, silent=TRUE) # get all vifs
vif_filtered_variables <- names(fullMod$model)[!names(fullMod$model) %in% "response"]

Now, the following variables have been filtered as the most suited variables to explain Ozone reading. By adjusting various setting such as VIF level, p-Value significance level, including/dropping certain steps, you may end up with a different set of variables based on the business need. Despite, the variable selection procedures, if the business / real world logic suggest that a variable (that did not get picked up by the algorithm) is influential, then you probably should try and include it while building the models in the next step. So here are are current best ones:

  • Temperature_Sandburg
  • pressure_height
  • Inversion_base_height
  • Humidity
  • Month (tentative)

Create all combinations of selected variables that will go into models as predictors

Lets build all combinations of the selected predictor variables, for building models of all sizes. If you wish to limit the maximum size the model, set the size in max_model_size variable.
# create all combinations of predictors
max_model_size <- length(vif_filtered_variables)
combos_matrix <- matrix(ncol=max_model_size) # initialise final output
for (n in 1:length(vif_filtered_variables)){
combs_mat <- t(combn(vif_filtered_variables, n)) # all combinations of variable
nrow_out <- nrow(combs_mat)
ncol_out <- length(vif_filtered_variables)
nrow_na <- nrow_out
ncol_na <- ncol_out-ncol(combs_mat)
na_mat <- matrix(rep(NA, nrow_na*ncol_na), nrow=nrow_na, ncol=ncol_na)
out <- cbind(combs_mat, na_mat)
combos_matrix <- rbind(combos_matrix, out)
combos_matrix <- combos_matrix[-1, ] # remove the first row that has all NA

Create Development and validation samples

## Split training and test data (development and validation)
trainingIndex <- sample(1:nrow(inputData_signif), 0.8*nrow(inputData_signif)) # random sample rows for training data
inputData_signif_training <- na.omit(inputData_signif[trainingIndex, ]) # remove NA's in response variable
inputData_signif_test <- na.omit(inputData_signif[-trainingIndex, ]) # omitting NA's, since we need to calculate prediction accuracy

Building all the linear models and diagnostics

Now that we have all the possible combinations for predictors, we can expect to generate several good models from this mix. In the code that follows, you will first build the model and capture the summary of the model in a variable (currSumm). Then, the diagnostic parameters such as R-sq, p-Value, AIC etc are captured.
Post that, the Influential observations and accuracy measures are calculated on the training and test data sets. Finally, we perform the k-Fold cross validation and capture the mean-squared error for the model. We calculate all these metrics for all the model predictor combinations from combos_matrix, which get accrued in Final_Output dataframe, which gets exported in Final_Regression_Output.csv file. Apart from this, the model summary, VIF and predictions from each of the models are also printed out to console.

Final_Output <- data.frame() # initialise the final output dataframe
for (rownum in 1:nrow(combos_matrix)){
## Build model for current formula from combos_matrix-
preds <- na.omit(combos_matrix[rownum, ]) # get the predictor names
form <- paste ("response ~ ", paste (preds, collapse=" + "), sep="") # model formula
currMod <- lm(as.formula(form), data=inputData_signif_training) # build the linear model
currSumm <- summary(currMod) # model summary
## Diagnostic parameters-
f_statistic <- currSumm$fstatistic[1] # fstatistic
f <- currSumm$fstatistic # parameters for model p-value calc
model_p <- pf(f[1], f[2], f[3], lower=FALSE) # model p-Value
r_sq <- currSumm[8] # R-Squared
adj_r_sq <- currSumm[9] # adj R-Squared
aic <- AIC(currMod) # AIC
bic <- BIC(currMod) # BIC
## Get Influential Observations-
cutoff <- 4/(nrow(inputData_signif_training)-length(currMod$coefficients)-2) # cooks.distance cutoff
influential_obs_rownums <- paste(which(cooks.distance(currMod) > cutoff), collapse=" ") # caputure influential observations
## Calculate accuracy measures on Development sample-
training_mape <- mean((abs(currMod$residuals))/abs(na.omit(inputData_signif_training$response))) * 100 # MAPE
actual_fitted <- data.frame(actuals=inputData_signif_training$response, fitted=currMod$fitted)
min_vals <- apply(actual_fitted, 1, min)
max_vals <- apply(actual_fitted, 1, max)
training_min_max <- min_vals/max_vals
training_min_max_accuracy <- mean(training_min_max)
## Calculate accuracy measures on Validation sample-
predicteds <- predict(currMod, inputData_signif_test) # predict for test data
predicted_mape <- mean((abs(inputData_signif_test$response - predicteds)/abs(inputData_signif_test$response))) * 100 # calculate predicted mape
actual_predicteds <- data.frame(actuals=inputData_signif_test$response, predicted=predicteds)
min_vals <- apply(actual_predicteds, 1, min)
max_vals <- apply(actual_predicteds, 1, max)
predicted_min_max <- min_vals/max_vals
predicted_min_max_accuracy <- mean(min_vals/max_vals) # min-max accuracy on test data
## Perform k-Fold cross-validation-
par(mar=c(2,0,4,0)) # set margins
cvResults <- suppressWarnings(CVlm(df=na.omit(inputData_signif), form.lm=currMod, m=5, dots=FALSE, seed=29, legend.pos="topleft", printit=FALSE));
mean_squared_error <- attr(cvResults, 'ms')
## Collect all stats for Final Output-
currOutput <- data.frame(formula=form, R_Sq=r_sq, Adj_RSq=adj_r_sq, AIC=aic, BIC=bic, Model.pValue= model_p, F.Statistic=f[1], training.mape=training_mape, predicted.mape=predicted_mape, training.minmax.accuracy = training_min_max_accuracy, predicted.minmax.accuracy=predicted_min_max_accuracy, influential.obs = influential_obs_rownums, k_fold_ms=mean_squared_error)
# Final Output
Final_Output <- rbind(Final_Output, currOutput)
## Print output so they get accumulated in 'All_Models_In_Combos.txt'
names(actual_fitted) <- c("actuals", "predicted")
actuals_predicted_all <- rbind(actual_fitted, actual_predicteds)
print (currSumm)
print (vif(currMod))
print (actuals_predicteds_predictors <- cbind(actuals_predicted_all, na.omit(inputData_signif)[, preds]))
write.csv(Final_Output, "Final_Regression_Output.csv", row.names=F) # Export

Final_Regression_Output is the final output file containing diagnostic parameters for all models. In below table, you will find a sample row of ‘Final_Regression_Output for one of the models, that we just built using the above logic.

Sample Row in Final_Regression_Output.csv
Parameter Value
formula: response ~ Temperature_Sandburg + Inversion_base_height + Humidity
r.squared: 65.19%
adj.r.squared: 64.83%
AIC: 1728.90
BIC: 1747.23
Model.pValue: 5.27E-65
F.Statistic: 177.95
training.mape: 49.91%
predicted.mape: 45.71%
training.minmax.accuracy: 68.36%
predicted.minmax.accuracy: 73.14%
influential.obs (row nums): 8 20 78 90 104 105 127 166 182 195 221 240 243
k_fold_ms (cross validation mean squared error): 21.37%

Select the best model and make final prediction

Once we have the models, diagnostics and performance accuracies, it is up to the investigator to pick the best model that suits the specific business problem. A general convention to go by when picking good regression model is as follows:

Model selection guidelines
AIC / BIC: As low as possible
MAPE and Min-Max Accuracy: As high as possible, closer to 100%
k-Fold Cross validation mean squared error: Lower the better, decipher from the cv-plot.
R-Sq and Adj R-Sq: Higher the better.

All things being equal, if you have two models with approximately equal predictive accuracy, always favour the simpler one.

Based on the results in Final_Output, we can infer that model in row number 13 has fared well in almost every parameter, especially with respect to AIC, predicted min-max accuracy and mean squared error in k-fold cross validation. So choosing it to be the optimal model and printing the outputs.

How to interpret the cross validation chart?

Since we are performing a 5 fold cross validation (m=5 in CVlm function), you will notice 5 colors and symbols in the chart, one each for one fold. You will also see 5 dashed lines in the chart, which are the best fit line for the respective fold. What happens during cross validation is that, your data is randomly split into n (=5) folds. By holding one of the 5 parts as test, a regression model is built on the remaining 4 parts as training data. Then the model is used to predict on the 5th fold. This is done for each of the 5 parts and then, the actuals (bigger symbols in chart) and predictors(smaller symbols close to the line) are plotted on the same chart along with the lines of best fit.
A really good model, will have all these lines parallel to each other and the bigger symbols closer to the line, indicating a high accuracy of prediction. If either of these conditions are not satisfied, you can imagine what the implications of using that model will be.

Cross validation chart
Cross validation chart

Now lets pick the best model in row 13 from Final_Regression_Output.csv and generate the predictions and stats.

rownum=13 # row number of best model from Final_Regression_Output.csv
preds <- na.omit(combos_matrix[rownum, ]) # get the predictor names
form <- paste ("response ~ ", paste (preds, collapse=" + "), sep="") # model fornula
currMod <- lm(as.formula(form), data=inputData_signif_training) # build llinear model
currSumm <- summary(currMod) # model summary
vif(currMod) # variance inflation factor
# Calculate relative importance
rel_importance <- try(calc.relimp(currMod, rela = T), silent=TRUE)
print (rel_importance$lmg) # print relative importance
predicted <- predict(currMod, inputData_signif_test) # predict on test data
actuals <- inputData_signif_test$response # actuals on test data
actuals_predicted_test <- data.frame(actuals=actuals, predicted=predicted)
actuals_predicted_test$min_max_accuracy <- apply(actuals_predicted_test, 1, min)/apply(actuals_predicted_test, 1, max) # calc min_max accuracy
print (actuals_predicted_test)
influence.measures(currMod) # get the influential obs

# Call:
#   lm(formula = as.formula(form), data = inputData_signif_training)
# Residuals:
#   Min     1Q Median     3Q    Max 
# -13.66  -3.15  -0.24   2.88  14.39 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)           -1.08e+01   1.78e+00   -6.08  3.8e-09 ***
#   Temperature_Sandburg   3.30e-01   2.31e-02   14.26  <2e-16 ***
#   Inversion_base_height -8.85e-04   1.85e-04   -4.78  2.9e-06 ***
#   Humidity               7.46e-02   1.55e-02    4.82  2.4e-06 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 4.77 on 285 degrees of freedom
# Multiple R-squared:  0.652,  Adjusted R-squared:  0.648 
# F-statistic:  178 on 3 and 285 DF,  p-value: < 2e-16
# Temperature_Sandburg Inversion_base_height              Humidity 
#                 1.47                  1.38                  1.15 

Get influential observations

# Get Influential Observations
cutoff <- 4/(nrow(inputData_signif_training)-length(currMod$coefficients)-2)
plot (currMod, which=4, cook.levels=cutoff)
abline (h=cutoff, lty=2, col="red")
influencePlot (currMod) # Influence plot for best model fits

Cook's distance - Influential observations plot
Cook’s distance – Influential observations plot

If you like us, please tell your friends.Share on LinkedInShare on Google+Share on RedditTweet about this on TwitterShare on Facebook