Cluster Analysis: How to identify inherent groups within data?

"Clustering can be used to identify and group similar objects within your data. Sometimes it can be used to predict which group a new object will fall in. R has many facilities for clustering analysis. The primary packages to get started are cluster and fpc."

Partition Based

K Means Clustering

K Means does a good job of dividing your data into a fixed number of clusters and it does that regardless of whether the data really has as many groups inherently. It just finds the ‘n’ different groups that have the maximum dissimilarity. The input data can either be a numerical vector or a matrix / data.frame with 2 columns.

clus <- kmeans(inputData, n)  # apply K-Means
plot(inputData, col = clus$cluster) # 'col' chooses colors the clusters based on cluster number.
points(clus$centers, col = 1:n, pch = 8, cex = 2) # Plot the centroid of each cluster

Partitioning Around Medoids (PAM)

PAM is similar to K-Means, but it is more robust with outliers. It can be implemented using pam() from cluster package. The pamk() facility in fpc package additionally helps to figure the optimum number of clusters.

pamClus <- pam(inputData, n)  # implements PAM func
plot(pamClus) # display cluster plot an silhouette plot
summary(pamClus) # display cluster diagnostics

How To Find The Optimum Number Of Clusters ?

The summary(pamClus) calculated above from pam() gives the silhouette width information for each data point, average silhouette width for each cluster and for the whole data.

Silhouette Width is a measure to estimate the dissimilarity between clusters. A higher silhouette width is preferred to determine the optimal number of clusters. The pamk() in fpc  package tests it out for multiple number of clusters.
pamkClus <- pamk(x, krange = 1:n, criterion="multiasw", ns=2, critout=TRUE) # n is max number of cluster to test out

Clustering Large Applications (CLARA): Clustering Large Datasets

clara() from cluster package uses a sampling approach to cluster large datasets. It provide silhouette plot and information to determine optimum number of clusters.

claraClus <- clara (x, 2, samples=50) # cluster with clara
summary(claraClus)  # summary
plot(claraClus)  # plot

Advanced Clustering

Fuzzy Analysis Clustering (FANNY)

If you are uncertain about assigning the cluster of certain data-points and want to find out where they belong to,  fanny() from cluster package is a potent consideration.

fannyClus <- fanny(inputData, n) # inputData can be a numeric vector or dataframe with 2 columns, n is the number of clusters

Hierarchical Clustering

hclust() is used to understand the hierarchical dissimilarity relationship between data points. Plotting a hclust object generates a dendrogram (tree-like structure).

How To Measure Dissimilarity Between Data Points?

Use dist() to compute the dissimilarity matrix, showing the distance between each of the observations. The distances can be computed based on either of the following method options – euclidean“, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”.

distObject <- dist(inputData, method = "euclidean") # input data is numerical matrix or dataframe.
hClus <- hclust(distObject)  # apply hierarchical clustering
hCutClus <- cuttree(hClus, n) # n is the number of clusters

Dendrogram - clustering

Density Based Clustering

dbscan() from fpc package is ideal for clustering arbitrary shapes and can be used to predict the cluster for new data points.

bClus <- dbscan(x, eps) # where, eps is reachability distance. Increase eps to include more points to clusters and x is a matrix with 2 columns.
plot(dbClus, x)
predict(dbClus, x, x2) #determine which cluster the observations within x2 belongs.

Density based clustering


Clustering On Categorical Data


Use Principal Components Analysis (PCA) if your data is predominantly continuous. If you have categorical variables in your data, indicate the column names in ‘quali.sup’ argument in the PCA() call.

pcaDat <- PCA(inputData, scale.unit = TRUE, ncp = Inf, graph = FALSE, quanti.sup = c(3,4,5,..), quali.sup = c(6,7,8,..), ind.sup = c(1,2,3,..)) # implements Principal components function
result <- HCPC(pcaDat) # hierarchical clustering on principal components
unclass(result) # prints all the results

 If the data is in form of contingency tables, use CA(). The HMFA() and FAMD() functions offer facilities for factor analysis on mixed data.

Hierarchical clustering of Principal Components          Factor-Map

If you like us, please tell your friends.Share on LinkedInShare on Google+Share on RedditTweet about this on TwitterShare on Facebook