Principal Component Analysis: How to reduce redundancy and noise in data ?

"When there are too many variables in your data, sometimes, reducing them to few principal components that could explain a major portion of variation in data is preferred. These principal components can be used for further model building, clustering etc. Read below for the concept and practical implementation of principal components.."

What is PCA and why is it used ?

The actual information in data is the total variation it contains. In Principle Components Analysis (PCA), a large number of variables are reduced to a few principal components that explains maximum variability of your response (dependent) variable. This can be desirable because it reduces the noise and redundancy and enables quicker computations for data modelling because of the potential to reduce the data volume by many folds.

The original identity of predictor variables may be masked-up in the process, as each of the original variable contributes (but not equally) to the making of the ‘condensed variable’, a.k.a, principal components. By convention, they are represented as PC1, PC2, etc.. with PC1 explaining the maximum variability, PC2 lesser than PC2 .. and so on.

Therefore, for prediction of the response variable based on new predictors data, the principle components will need to be derived first before using them for model predictions. If your motivation is solely to improve the prediction accuracy of dependent variable and you care less about having a formula composed of the original predictors like we have in linear regression, then PCA can be a highly valuable and powerful tool.

How to apply PCA?

Let study this with an example.

data(Boston, package="MASS") # initialize data
inputData <- Boston  # plug-in your data here
scaledData <- data.frame (scale (inputData)) # standardise to same scale

# Partial Output Of Input Data: Boston
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat medv
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
# Partial Output of Scaled Data
   crim    zn indus  chas   nox   rm   age  dis   rad   tax ptratio black lstat
1 -0.42  0.28 -1.29 -0.27 -0.14 0.41 -0.12 0.14 -0.98 -0.67   -1.46  0.44 -1.07
2 -0.42 -0.49 -0.59 -0.27 -0.74 0.19  0.37 0.56 -0.87 -0.99   -0.30  0.44 -0.49
3 -0.42 -0.49 -0.59 -0.27 -0.74 1.28 -0.27 0.56 -0.87 -0.99   -0.30  0.40 -1.21
4 -0.42 -0.49 -1.31 -0.27 -0.83 1.02 -0.81 1.08 -0.75 -1.11    0.11  0.42 -1.36

# Lets applying PCA . .
pcaDone <- prcomp (scaledData) # PCA
summary (pcaDone)  # get summary 
pcaDone$sdev  # standard deviations
sum((pcaDone$sdev)^2) # sum of variances of each scaled var is 1

# Partial output of summary (pcaDone)
Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     
Standard deviation     2.5585 1.2843 1.16142 0.94156 0.92244 0.81241 
Proportion of Variance 0.4676 0.1178 0.09635 0.06332 0.06078 0.04714 
Cumulative Proportion  0.4676 0.5854 0.68174 0.74507 0.80585 0.85299

Deciding Number Of Principal Components

A common way to choose the number of principal components is to apply the Kaiser’s criterion. According to the rule, only those components with variance above 1 is retained.
screeplot(pcaDone, type="lines") # screeplot
library (qcc)
variances <- pcaDone$sdev^2  # variances
pareto.chart (variances, ylab="Variances")  # plot pareto chart

Screeplot
Screeplot of Variances
Pareto Chart Of Cumulative Variances Explained
Pareto Chart Of Cumulative Variances Explained %

Understanding Loadings Of Principal Components

Loadings can be construed as co-efficients used to calculate the principal components. Each principal component is computed as linear combination of the product of loadings and the respective scaled variables. The loadings can be accessed via a variable named ‘rotation’.
pcaDone$rotation # loadings of all principal components
pcaDone$rotation[, 1] # loadings of 1st principal component

# Value of Principal components is product of loadings of each variable and the data on which PCA was applied
Loadings <- pcaDone$rotation[, 1]  # loadings of first PC
rowOneData <- scaledData[1, ]  # row one of data on which PCA is applied
calculatedPCA <- sum (Loadings * rowOneData) # Calc Principal component: PC[1, 1]
all.equal (calculatedPCA, pcaDone$x[1, 1]) # TRUE

# Partial output of PCA loadings: round (pcaDone$rotation[, 1], 2) 
crim      zn   indus    chas     nox      rm     age     
0.24   -0.25    0.33   -0.01    0.33   -0.20    0.30

Getting the Most Important Variables from Principal Components

It is possible to get the most important variables that contribute to explain the data. Since the top 2 PCs usually contributes a major portion of the variance in data, we can compute the most important variables that contribute to these PCs. This can be done with the ‘factoextra’ package available on GitHub. To import this, the ‘devtools’ package comes handy. Below is how you can the ‘factoextra’ package and compute the PCs using the PCA function from ‘FactoMineR’ package.

Note: The term ‘important variables’ does not mean that these variables best explains the dependent variable. It simply means these variables explain the maximum variance in the whole of the data. Therefore this method should NOT be used as a variable selection method, but rather as a technique to be used to identify the best of the best variables after the variable subset selection is done.

library(devtools)
install_github("kassambara/factoextra")
library("factoextra")
library("FactoMineR")
res.pca <- PCA(inputData, graph = FALSE)
print (res.pca)

    Name               Description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"

eigenvalues <- res.pca$eig # get eigenvalues
head(eigenvalues[, 1:2])

#        eigenvalue percentage of variance
# comp 1  6.5459896              46.757068
# comp 2  1.6495319              11.782371
# comp 3  1.3489059               9.635042
# comp 4  0.8865399               6.332428
# comp 5  0.8508994               6.077853
# comp 6  0.6600108               4.714363

fviz_screeplot (res.pca, ncp=10) # % variance explained by each PC
fviz_pca_var(res.pca) # Variables factor map

Percentage of Variation explained by PCs
Percentage of Variation explained by PCs

 

Variables factor map - correlations with PC1 and PC2
Variables factor map – correlations with PC1 and PC2

Which variables contributed the most to maximum variance in data ?

# Contribution of variables on PC1 and PC2
fviz_pca_contrib(res.pca, choice = "var", axes = 1:2)  # set axes=1 to consider only PC1

Contribution of Variables on PC1 and 2
Contribution of Variables on PC1 and 2

Which individual observations contributed the most to variance ?

Below we get the top 20 observations, since it is not meaningful to display all the observations on one plot. However, you can customise this by choosing number of observations to plot from the individual_contributions$data variable.

# Contribution of individual observations on PC1 and PC2
plot(head(individual_contributions$data[, 2], 20), type="b", pch=20, xaxt="n", ylab="Perc % Contribution", xlab="Observation Row Number", main="Top 20 - Contribution of individuals on PC1 & PC2")
axis(1, at=1:20, labels=head(individual_contributions$data[, 1], 10))

 

Contribution of individual observations on PC1 and PC2
Contribution of individual observations on PC1 and PC2

If you like us, please tell your friends.Share on LinkedInShare on Google+Share on RedditTweet about this on TwitterShare on Facebook