Robust Regression

"Robust regression can be used in any situation in which OLS regression can be applied. It yields better accuracies over OLS and is particularly resourceful when there are no compelling reasons to exclude outliers in your data."

Robust regression can be implemented using the rlm() function in MASS package. The outliers can be weighted down differently based on ‘huber’, ‘bi-square’ and ‘hampel’s methods.

How To Use Robust Regression

library (MASS)
olsModel <- lm (response ~ pred1 + pred2, data = inputData) # OLS Regression
robModel <- rlm (response ~ pred1 + pred2, psi = psi.huber, data = inputData) # Robust Regrression
summary(robModel) # model summary
predict (robModel, testData) # apply model to predict response on test data

The psi argument in rlm() which determines the method for re-weighing outliers can take psi.huber, psi.hampel and psi.bisquare values.

View Outlier Observations

cDist <- cooks.distance (olsModel) # get cooks distance
n <- nrow (inputData) # No of rows
resids <- stdres (olsModel) #errors
allObs <- cbind (inputData, cDist, resids)
allObs[cDist > 4/n, ] # show outlier observations

Example Problem

Lets see how robust regression fares over an equivalent lm() model, with respect to prediction accuracy. For this analysis, we will utilize the built-in Boston dataset in MASS package. Lets predict the median value of owner occupied homes homes (medv) by training the model on 80% of data, picked randomly. After building a basic model, we will test how well the model predicts on the remaining 20% test data. We will build both regular OLS (lm()) model as well as the corresponding robust regression model (rlm()).

Step 1: Split the dataset for training and tests.
For this exercise, We will create the training and test datasets in 80:20 ratio.
library (MASS)
inputData <- Boston  # plug-in the input data
num_rows <- nrow(inputData)  # number of rows
set.seed(100)  # set seed to reproduce data
training_rows <- sample(1:num_rows, 0.8*num_rows)  # row indices of training data
training_data <- inputData[training_rows, ] # 404 rows  # make training data
test_data <- inputData[-training_rows, ] # 102 rows  # make test data

Step 2: Build the equivalent lm() and rlm() models.
lmMod <- lm (medv ~ crim + zn + chas + nox + rm + dis + ptratio +black +lstat, inputData)  # make lm model
robMod <- rlm (medv ~ crim + zn + chas + nox + rm + dis + ptratio +black +lstat, inputData)  # make robust model

Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + ptratio + 
    black + lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.803  -2.832  -0.625   1.454  27.766 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29.507997   4.872538   6.056 2.76e-09 ***
crim         -0.061174   0.030377  -2.014 0.044567 *  
zn            0.042032   0.013422   3.131 0.001842 ** 
chas          3.029924   0.868349   3.489 0.000527 ***
nox         -16.088513   3.232702  -4.977 8.93e-07 ***
rm            4.149667   0.407685  10.179  < 2e-16 ***
dis          -1.431665   0.188603  -7.591 1.59e-13 ***
ptratio      -0.838640   0.117342  -7.147 3.19e-12 ***
black         0.008292   0.002688   3.084 0.002153 ** 
lstat        -0.525004   0.048351 -10.858  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.833 on 496 degrees of freedom
Multiple R-squared:  0.7288,	Adjusted R-squared:  0.7239 
F-statistic: 148.1 on 9 and 496 DF,  p-value: < 2.2e-16

Step 3: Use the built models to predict on test data.
lm_Predicted <- predict(lmMod, test_data)
rob_Predicted <- predict(robMod, test_data)

# Calculate Accuracy
lm_actuals_pred <- cbind(lm_Predicted, test_data$medv)
rob_actuals_pred <- cbind(rob_Predicted, test_data$medv)
mean(apply(lm_actuals_pred, 1, min)/ apply(lm_actuals_pred, 1, max))  # 85.48%
mean(apply(rob_actuals_pred, 1, min)/ apply(rob_actuals_pred, 1, max))  # 85.98%

As you can see, the accuracy from a rlm() model has only marginally improved, which could be because Boston is a 'standard dataset'. But in real world scenarios you can expect real time data to have many more outliers, in which case the improvement in prediction accuracy of rlm() can be more pronounced.

Optional diagnostics: Get the outlier observations

The cook's distance is computed for all the data points is computed to reveal the outliers. In this case, those observations that has a cook's distance distance > (4/num_rows) is considered as an outlier.
cDist <- cooks.distance (lmMod) # get cooks distance
resids <- stdres (lmMod) #errors
allObs <- cbind (inputData, cDist, resids)
allObs[cDist > 4/num_rows, ] # show outlier observations

If you like us, please tell your friends.Share on LinkedInShare on Google+Share on RedditTweet about this on TwitterShare on Facebook