Absence of heteroscedasticity is one of the assumptions of linear regression, which means that the variance of residuals in the fitted model should not increase as the fitted value increases. We are about to learn how to test for the presence of heteroscedasticity, and if found, what measures can be taken to correct it?

Lets begin by building a simple linear regression model that tries to explain the *‘dist’* from the built-in *‘cars’* dataset. The predictor in this case is the *‘speed’* and the linear model object that is built is tested for heterorscedasticity using *‘bptest()’* function from the *‘lmtest’* package. Along with this we also use the *‘gvlma()’* that tests if the assumptions of linear regression hold your a *‘lm’* object

### Build the initial model and test for heteroscedasticity

`# initialize the packages)`

library(e1071)

library(caret) *# for box-cox transformation*

library(lmtest) *# bptest for testing heteroscedasticity*

library(gvlma) *# global validation of linear model assumptions*

Let’s now build the model and perform bptest on the object created.

`lmMod <- lm(dist ~ speed, data=cars) `

*# initial model*

bptest(lmMod) *# Breusch-Pagan test*

studentized Breusch-Pagan test data: lmMod BP = 3.2149, df = 1, p-value = 0.07297

A p-Value > 0.05 indicates that the null hypothesis(the variance is unchanging in the residual) can be rejected and therefore heterscedasticity exists. This can be confirmed by running a global validation of linear model assumptions (gvlma) on the *lm* object.

`gvlma(lmMod) `

*# validate if assumptions of linear regression holds true.*

# Call: gvlma(x = lmMod) Value p-value Decision Global Stat 15.801 0.003298 Assumptions NOT satisfied! Skewness 6.528 0.010621 Assumptions NOT satisfied! Kurtosis 1.661 0.197449 Assumptions acceptable. Link Function 2.329 0.126998 Assumptions acceptable. Heteroscedasticity 5.283 0.021530 Assumptions NOT satisfied!

### Graphical interpretation

As suspected, the heteroscedasticity condition is NOT satisfied. Now lets plot and see how it looks as well.

`par(mfrow=c(2,2)) `

*# 4 charts in 1 panel*

plot(lmMod)

The graph in top-left of the panel reveals a bit of a *‘pattern’* in the variance and that it is not completely random as suggested from the curvy smoothing line. So definitely, the errors are not so *‘innocent’*. They appear know something (if only less) about the dependent variable that the error variable itself can be used as a predictor to model the *‘dist’* variable. In other words, the errors (read residuals) is not completely random and could potentially contain some information (read variance) that could in turn be used to predict the dependent variable. This above method is actually one of the ways that we could use to overcome the problem of heteroscedasticity.

### How to correct for presence of heteroscedasticity?

**By re-building the model with the created error term***(lmMod$residuals)*as one of the predictors**By transforming the dependent variable**

#### Using box-cox transformation

For this example we will proceed to use the second method to make amends to the model and predict the *‘dist’*. The transformation we will use by default in this case is the box-cox transformation.

Once the optimal ‘lambda’ value is found, we will use it to transform the dependent variable and build a model using the newly transformed variable as dependent.

`distBCMod <- BoxCoxTrans(cars$dist)`

print(distBCMod)

Box-Cox Transformation 50 data points used to estimate Lambda Input data summary: Min. 1st Qu. Median Mean 3rd Qu. Max. 2.00 26.00 36.00 42.98 56.00 120.00 Largest/Smallest: 60 Sample Skewness: 0.759 Estimated Lambda: 0.5

As seen in the last line, the estimated value of lambda was 0.5. So, how is this lambda helpful? how is the transformation actually applied?. Simple. Each data point in the dependent variable is raised to the power of ‘lambda’, then subtract 1 from that number and finally divide this number by ‘lambda’ itself.

**transformed var = (y^lambda – 1)/lambda**

But we don’t have to do this manually, it can done by using the *‘BoxCoxTrans’* function from the *caret* package as you just saw. Now lets append this new transformed variable to *cars* data and re-build the model.

`cars <- cbind(cars, dist_new=predict(distBCMod, cars$dist)) `

*# append the transformed variable to cars*

head(cars) *# view the top 6 rows*

speed dist dist_new 1 4 2 0.8284271 2 4 10 4.3245553 3 7 4 2.0000000 4 7 22 7.3808315 5 8 16 6.0000000 6 9 10 4.3245553

### Build the new regression model and test for heteroscedasticity

We will use the *dist_new* variable above to build the regression model.

`lmMod_bc <- lm(dist_new ~ speed, data=cars)`

bptest(lmMod_bc)

studentized Breusch-Pagan test data: lmMod_bc BP = 0.011192, df = 1, p-value = 0.9157

`gvlma(lmMod_bc) # checking assumptions `

Call: lm(formula = dist_new ~ speed, data = cars) Coefficients: (Intercept) speed 0.5541 0.6448 ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM: Level of Significance = 0.05 Call: gvlma(x = lmMod_bc) Value p-value Decision Global Stat 3.03592 0.55183 Assumptions acceptable. Skewness 2.81284 0.09351 Assumptions acceptable. Kurtosis 0.04915 0.82455 Assumptions acceptable. Link Function 0.09269 0.76078 Assumptions acceptable. Heteroscedasticity 0.08124 0.77563 Assumptions acceptable.

Both from the above output as well as the *bptest()*, we can verify that the heteroscedasticity problem is effectively resolved. Lets verify graphically as well.

`plot(lmMod_bc)`

In the top left plot, the residuals appear to be random with a more straighter smoothing line, which implies that the variance of residuals doesn’t change as the fitted value increases, which implies heteroscedasticity is removed. Case resolved!

NOTE: An important point to note is that, the current model you’ve built is on a transformed predictor variable But since this new variable is on a different scale compared to the original *‘dist’*, we need to back-transform this variable by doing the opposite of box-cox transformation i.e. *“by multiplying it by lambda, subtracting 1 and then raising it to the power of 1/lambda”*. I will leave this as a fun exercise to try this out on your own ðŸ™‚