"When there are too many variables in your data, sometimes, reducing them to few principal components that could explain a major portion of variation in data is preferred. These principal components can be used for further model building, clustering etc. Read below for the concept and practical implementation of principal components.."

### What is PCA and why is it used ?

The actual information in data is the total variation it contains. In Principle Components Analysis (PCA), a large number of variables are reduced to a few principal components that explains maximum variability of your response (dependent) variable. This can be desirable because it reduces the noise and redundancy and enables quicker computations for data modelling because of the potential to reduce the data volume by many folds.

The original identity of predictor variables may be masked-up in the process, as each of the original variable contributes (but not equally) to the making of the ‘condensed variable’, a.k.a, principal components. By convention, they are represented as PC1, PC2, etc.. with PC1 explaining the maximum variability, PC2 lesser than PC2 .. and so on.

Therefore, for prediction of the response variable based on new predictors data, the principle components will need to be derived first before using them for model predictions. If your motivation is solely to improve the prediction accuracy of dependent variable and you care less about having a formula composed of the original predictors like we have in linear regression, then PCA can be a highly valuable and powerful tool.

### How to apply PCA?

Let study this with an example.

`data(Boston, package="MASS") `

*# initialize data*

inputData <- Boston *# plug-in your data here*

scaledData <- data.frame (scale (inputData)) *# standardise to same scale*

# Partial Output Of Input Data:Bostoncrim zn indus chas nox rm age dis rad tax ptratio black lstat medv 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4

# Partial Output of Scaled Data crim zn indus chas nox rm age dis rad tax ptratio black lstat 1 -0.42 0.28 -1.29 -0.27 -0.14 0.41 -0.12 0.14 -0.98 -0.67 -1.46 0.44 -1.07 2 -0.42 -0.49 -0.59 -0.27 -0.74 0.19 0.37 0.56 -0.87 -0.99 -0.30 0.44 -0.49 3 -0.42 -0.49 -0.59 -0.27 -0.74 1.28 -0.27 0.56 -0.87 -0.99 -0.30 0.40 -1.21 4 -0.42 -0.49 -1.31 -0.27 -0.83 1.02 -0.81 1.08 -0.75 -1.11 0.11 0.42 -1.36

*# Lets applying PCA . .*

pcaDone <- prcomp (scaledData) *# PCA*

summary (pcaDone) *# get summary *

pcaDone$sdev *# standard deviations*

sum((pcaDone$sdev)^2) *# sum of variances of each scaled var is 1*

# Partial output ofsummary (pcaDone)Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 Standard deviation 2.5585 1.2843 1.16142 0.94156 0.92244 0.81241 Proportion of Variance 0.4676 0.1178 0.09635 0.06332 0.06078 0.04714 Cumulative Proportion 0.4676 0.5854 0.68174 0.74507 0.80585 0.85299

### Deciding Number Of Principal Components

A common way to choose the number of principal components is to apply the *Kaiser’s criterion*. According to the rule, only those components with variance above 1 is retained.

`screeplot(pcaDone, type="lines") `

*# screeplot*

library (qcc)

variances <- pcaDone$sdev^2 *# variances*

pareto.chart (variances, ylab="Variances") *# plot pareto chart*

### Understanding Loadings Of Principal Components

Loadings can be construed as co-efficients used to calculate the principal components. Each principal component is computed as linear combination of the product of loadings and the respective scaled variables. The loadings can be accessed via a variable named ‘rotation’.

`pcaDone$rotation `

*# loadings of all principal components*

pcaDone$rotation[, 1] *# loadings of 1st principal component*

*# Value of Principal components is product of loadings of each variable and the data on which PCA was applied*

Loadings <- pcaDone$rotation[, 1] *# loadings of first PC*

rowOneData <- scaledData[1, ] *# row one of data on which PCA is applied*

calculatedPCA <- sum (Loadings * rowOneData) *# Calc Principal component: PC[1, 1]*

all.equal (calculatedPCA, pcaDone$x[1, 1]) *# TRUE*

# Partial output of PCA loadings: round (pcaDone$rotation[, 1], 2) crim zn indus chas nox rm age 0.24 -0.25 0.33 -0.01 0.33 -0.20 0.30

### Getting the Most Important Variables from Principal Components

It is possible to get the most important variables that contribute to explain the data. Since the top 2 PCs usually contributes a major portion of the variance in data, we can compute the most important variables that contribute to these PCs. This can be done with the ‘factoextra’ package available on GitHub. To import this, the *‘devtools’ *package comes handy. Below is how you can the *‘factoextra’ *package and compute the PCs using the *PCA *function from *‘FactoMineR’ *package.

**Note:** The term ‘important variables’ does not mean that these variables best explains the dependent variable. It simply means these variables explain the maximum variance in the whole of the data. Therefore this method should NOT be used as a variable selection method, but rather as a technique to be used to identify the best of the best variables *after* the variable subset selection is done.

`library(devtools)`

install_github("kassambara/factoextra")

library("factoextra")

library("FactoMineR")

res.pca <- PCA(inputData, graph = FALSE)

print (res.pca)

Name Description1 "$eig" "eigenvalues" 2 "$var" "results for the variables" 3 "$var$coord" "coord. for the variables" 4 "$var$cor" "correlations variables - dimensions" 5 "$var$cos2" "cos2 for the variables" 6 "$var$contrib" "contributions of the variables" 7 "$ind" "results for the individuals" 8 "$ind$coord" "coord. for the individuals" 9 "$ind$cos2" "cos2 for the individuals" 10 "$ind$contrib" "contributions of the individuals" 11 "$call" "summary statistics" 12 "$call$centre" "mean of the variables" 13 "$call$ecart.type" "standard error of the variables" 14 "$call$row.w" "weights for the individuals" 15 "$call$col.w" "weights for the variables"

`eigenvalues <- res.pca$eig # get eigenvalues`

head(eigenvalues[, 1:2])

# eigenvalue percentage of variance # comp 1 6.5459896 46.757068 # comp 2 1.6495319 11.782371 # comp 3 1.3489059 9.635042 # comp 4 0.8865399 6.332428 # comp 5 0.8508994 6.077853 # comp 6 0.6600108 4.714363

`fviz_screeplot (res.pca, ncp=10) # % variance explained by each PC`

fviz_pca_var(res.pca) # Variables factor map

### Which variables contributed the most to maximum variance in data ?

`# Contribution of variables on PC1 and PC2`

fviz_pca_contrib(res.pca, choice = "var", axes = 1:2) *# set axes=1 to consider only PC1*

### Which individual observations contributed the most to variance ?

Below we get the top 20 observations, since it is not meaningful to display all the observations on one plot. However, you can customise this by choosing number of observations to plot from the *individual_contributions$data* variable.

`# Contribution of individual observations on PC1 and PC2`

plot(head(individual_contributions$data[, 2], 20), type="b", pch=20, xaxt="n", ylab="Perc % Contribution", xlab="Observation Row Number", main="Top 20 - Contribution of individuals on PC1 & PC2")

axis(1, at=1:20, labels=head(individual_contributions$data[, 1], 10))