# Robust Regression

`"Robust regression can be used in any situation in which OLS regression can be applied. It yields better accuracies over OLS and is particularly resourceful when there are no compelling reasons to exclude outliers in your data."`

Robust regression can be implemented using the rlm() function in MASS package. The outliers can be weighted down differently based on ‘huber’, ‘bi-square’ and ‘hampel’s methods.

## How To Use Robust Regression

```library (MASS) olsModel <- lm (response ~ pred1 + pred2, data = inputData) # OLS Regression robModel <- rlm (response ~ pred1 + pred2, psi = psi.huber, data = inputData) # Robust Regrression summary(robModel) # model summary predict (robModel, testData) # apply model to predict response on test data```

The psi argument in rlm() which determines the method for re-weighing outliers can take psi.huber, psi.hampel and psi.bisquare values.

## View Outlier Observations

```cDist <- cooks.distance (olsModel) # get cooks distance n <- nrow (inputData) # No of rows resids <- stdres (olsModel) #errors allObs <- cbind (inputData, cDist, resids) allObs[cDist > 4/n, ] # show outlier observations ```

## Example Problem

Lets see how robust regression fares over an equivalent lm() model, with respect to prediction accuracy. For this analysis, we will utilize the built-in Boston dataset in MASS package. Lets predict the median value of owner occupied homes homes (medv) by training the model on 80% of data, picked randomly. After building a basic model, we will test how well the model predicts on the remaining 20% test data. We will build both regular OLS (lm()) model as well as the corresponding robust regression model (rlm()).

Step 1: Split the dataset for training and tests.
For this exercise, We will create the training and test datasets in 80:20 ratio.
```library (MASS) inputData <- Boston  # plug-in the input data num_rows <- nrow(inputData)  # number of rows set.seed(100)  # set seed to reproduce data training_rows <- sample(1:num_rows, 0.8*num_rows)  # row indices of training data training_data <- inputData[training_rows, ] # 404 rows  # make training data test_data <- inputData[-training_rows, ] # 102 rows  # make test data```

Step 2: Build the equivalent lm() and rlm() models.
```lmMod <- lm (medv ~ crim + zn + chas + nox + rm + dis + ptratio +black +lstat, inputData)  # make lm model robMod <- rlm (medv ~ crim + zn + chas + nox + rm + dis + ptratio +black +lstat, inputData)  # make robust model```

```Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + ptratio +
black + lstat, data = Boston)

Residuals:
Min      1Q  Median      3Q     Max
-15.803  -2.832  -0.625   1.454  27.766

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  29.507997   4.872538   6.056 2.76e-09 ***
crim         -0.061174   0.030377  -2.014 0.044567 *
zn            0.042032   0.013422   3.131 0.001842 **
chas          3.029924   0.868349   3.489 0.000527 ***
nox         -16.088513   3.232702  -4.977 8.93e-07 ***
rm            4.149667   0.407685  10.179  < 2e-16 ***
dis          -1.431665   0.188603  -7.591 1.59e-13 ***
ptratio      -0.838640   0.117342  -7.147 3.19e-12 ***
black         0.008292   0.002688   3.084 0.002153 **
lstat        -0.525004   0.048351 -10.858  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.833 on 496 degrees of freedom
Multiple R-squared:  0.7288,	Adjusted R-squared:  0.7239
F-statistic: 148.1 on 9 and 496 DF,  p-value: < 2.2e-16```

Step 3: Use the built models to predict on test data.
```lm_Predicted <- predict(lmMod, test_data) rob_Predicted <- predict(robMod, test_data)```

```# Calculate Accuracy lm_actuals_pred <- cbind(lm_Predicted, test_data\$medv) rob_actuals_pred <- cbind(rob_Predicted, test_data\$medv) mean(apply(lm_actuals_pred, 1, min)/ apply(lm_actuals_pred, 1, max))  # 85.48% mean(apply(rob_actuals_pred, 1, min)/ apply(rob_actuals_pred, 1, max))  # 85.98%```
As you can see, the accuracy from a rlm() model has only marginally improved, which could be because Boston is a 'standard dataset'. But in real world scenarios you can expect real time data to have many more outliers, in which case the improvement in prediction accuracy of rlm() can be more pronounced.

### Optional diagnostics: Get the outlier observations

The cook's distance is computed for all the data points is computed to reveal the outliers. In this case, those observations that has a cook's distance distance > (4/num_rows) is considered as an outlier.
```cDist <- cooks.distance (lmMod) # get cooks distance resids <- stdres (lmMod) #errors allObs <- cbind (inputData, cDist, resids) allObs[cDist > 4/num_rows, ] # show outlier observations ```