"Clustering can be used to identify and group similar objects within your data. Sometimes it can be used to predict which group a new object will fall in. R has many facilities for clustering analysis. The primary packages to get started are cluster and fpc."

**Partition Based**

### K Means Clustering

K Means does a good job of dividing your data into a fixed number of clusters and it does that regardless of whether the data really has as many groups inherently. It just finds the ‘n’ different groups that have the maximum dissimilarity. The input data can either be a numerical vector or a matrix / data.frame with 2 columns.

`clus <- kmeans(inputData, n) `

* # apply K-Means*

plot(inputData, col = clus$cluster) *# 'col' chooses colors the clusters based on cluster number.*

points(clus$centers, col = 1:n, pch = 8, cex = 2) *# Plot the centroid of each cluster*

### Partitioning Around Medoids (PAM)

**PAM** is similar to K-Means, but it is more robust with outliers. It can be implemented using ** pam() **from

**cluster**package

**.**The

*facility in*

**pamk()****fpc**package additionally helps to figure the optimum number of clusters

**.**

`pamClus <- pam(inputData, n) `

*# implements PAM func*

plot(pamClus) *# display cluster plot an silhouette plot*

summary(pamClus) *# display cluster diagnostics*

### How To Find The Optimum Number Of Clusters ?

The summary(pamClus) calculated above from *pam()* gives the **silhouette width** information for each data point, average silhouette width for each cluster and for the whole data.

** Silhouette Width** is a measure to estimate the dissimilarity between clusters. A higher silhouette width is preferred to determine the optimal number of clusters. The

**pamk()**in fpc package tests it out for multiple number of clusters.

`library(fpc)`

pamkClus <- pamk(x, krange = 1:n, criterion="multiasw", ns=2, critout=TRUE) *# n is max number of cluster to test out*

### Clustering Large Applications (CLARA): Clustering Large Datasets

**clara() **from cluster package uses a sampling approach to cluster large datasets. It provide silhouette plot and information to ** determine optimum number of clusters**.

`claraClus <- clara (x, 2, samples=50) `

*# cluster with clara*

summary(claraClus) *# summary*

plot(claraClus) *# plot*

**Advanced Clustering**

### Fuzzy Analysis Clustering (FANNY)

If you are uncertain about assigning the cluster of certain data-points and want to find out where they belong to, fanny() from cluster package is a potent consideration.

`fannyClus <- fanny(inputData, n) `

*# inputData can be a numeric vector or dataframe with 2 columns, n is the number of clusters*

summary(fannyClus)

plot(fannyClus)

### Hierarchical Clustering

hclust() is used to understand the hierarchical dissimilarity relationship between data points. Plotting a hclust object generates a dendrogram (tree-like structure).

#### How To Measure Dissimilarity Between Data Points?

Use dist() to compute the dissimilarity matrix, showing the distance between each of the observations. The distances can be computed based on either of the following method options – *“**euclidean**“, **“maximum”, “manhattan”, “canberra”, “binary” or “minkowski”.*

`distObject <- dist(inputData, method = "euclidean") `

*# input data is numerical matrix or dataframe.*

hClus <- hclust(distObject) *# apply hierarchical clustering*

hCutClus <- cuttree(hClus, n) *# n is the number of clusters*

plot(hClus)

**Density Based Clustering**

**dbscan()** from fpc package is ideal for clustering arbitrary shapes and can be used to predict the cluster for new data points.

`bClus <- dbscan(x, eps) `

*# where, eps is reachability distance. Increase eps to include more points to clusters and x is a matrix with 2 columns.*

plot(dbClus, x)

predict(dbClus, x, x2) *#determine which cluster the observations within x2 belongs.*

**Clustering On Categorical Data**

### FactoMineR

Use Principal Components Analysis (PCA) if your data is predominantly continuous. If you have categorical variables in your data, indicate the column names in ‘quali.sup’ argument in the PCA() call.

`library(FactoMineR)`

pcaDat <- PCA(inputData, scale.unit = TRUE, ncp = Inf, graph = FALSE, quanti.sup = c(3,4,5,..), quali.sup = c(6,7,8,..), ind.sup = c(1,2,3,..)) *# implements Principal components function*

result <- HCPC(pcaDat) *# hierarchical clustering on principal components*

unclass(result) *# prints all the results*

* *If the data is in form of contingency tables, use CA(). The HMFA() and FAMD() functions offer facilities for factor analysis on mixed data.