Support vector machines is a relatively new and advanced machine learning technique, originally conceived for solving binary classification problems. It is now used widely to address multi-class non-linear classification as well as regression problems. Read on to learn how SVM works and how to implement in R

**How does SVM work?**

Here’s what SVM does in simple terms:

Assuming that your data-points are of 2 classes, SVM attempts to find the optimal line (hyperplane) that maximises the distance (margin) between the closest points from these classes. It is possible that sometimes, the boundary points may cross over the wrong side of hyperplane and overlap, in which case, these points are weighted down to lower their importance.

The ‘support vector’ in this case is the line formed by data points that lie on the margin of separation.

**What happens if a separating line (linear hyperplane) cannot be determined? **

The data points are projected on to a higher dimensional space where they may become linearly separable. This is usually framed and solved as a constrained optimization problem, that aims to maximise the margins between the two classes.

**What if my data has more than two classes?**

SVM will still view the problem as a binary classification, except this time, multiple SVMs are fitted for classes against each other until all participating classes are differentiated.

**Example Problem**

Lets see how to implement a binary classifier using SVM with the *cats *dataset from MASS package. In this example you will try to predict the *sex* of a cat using the *body weight* and *heart weight* variables. We till keep drop 20% of the data points from this data set and keep it aside to test the accuracy of the model (built on the rest 80% of data).

*# Setup*

library(e1071)

data(cats, package="MASS")

inputData <- data.frame(cats[, c (2,3)], response = as.factor(cats$Sex)) *# response as factor*

**Linear SVM**

The key parameters passed to the *svm() *are *kernel, cost and gamma.* Kernel is the type of SVM which could be linear, polynomial, radial or sigmoid. Cost is the cost function of constraint violation and gamma is a parameter used by all kernels except linear. There is a *type* parameter that determines whether the model is used for regression, classification or novelty detection. But this need not be explicitly set as SVM will auto detect this based on the class of response variable being a factor or a continuous variable. So for classification problems, be sure to cast your response variable as a factor.

*# linear SVM*

svmfit <- svm(response ~ ., data = inputData, kernel = "linear", cost = 10, scale = FALSE) *# linear svm, scaling turned OFF*

print(svmfit)

plot(svmfit, inputData)

compareTable <- table (inputData$response, predict(svmfit)) *# tabulate*

mean(inputData$response != predict(svmfit)) *# 19.44% misclassification error*

**Radial SVM**

The radial basis function, a popular kernel function can be used by setting the *kernel *parameter as *“radial”. *When a ‘radial’ kernel is used the resulting hyperplane need not be a line anymore. A curved region of separation is usually defined to demarcate the separation between classes, often leading to higher accuracy within the training data.

*# radial SVM*

` svmfit <- svm(response ~ ., data = inputData, kernel = "radial", cost = 10, scale = FALSE) `

*# radial svm, scaling turned OFF*

print(svmfit)

plot(svmfit, inputData)

compareTable <- table (inputData$response, predict(svmfit)) *# tabulate*

mean(inputData$response != predict(svmfit)) *# 18.75% misclassification error*

**Finding the optimal parameters**

You can find the optimal parameters for the *svm() *using the *tune.svm() *function.

*### Tuning*

* # Prepare training and test data*

set.seed(100) *# for reproducing results*

rowIndices <- 1 : nrow(inputData) *# prepare row indices*

sampleSize <- 0.8 * length(rowIndices) *# training sample size*

trainingRows <- sample (rowIndices, sampleSize) *# random sampling*

trainingData <- inputData[trainingRows, ] *# training data*

testData <- inputData[-trainingRows, ] *# test data*

tuned <- tune.svm(response ~., data = trainingData, gamma = 10^(-6:-1), cost = 10^(1:2)) *# tune*

summary (tuned) *# to select best gamma and cost*

# Parameter tuning of 'svm': # - sampling method: 10-fold cross validation # # - best parameters: # gamma cost # 0.001 100 # # - best performance: 0.26 # # - Detailed performance results: # gamma cost error dispersion # 1 1e-06 10 0.36 0.09660918 # 2 1e-05 10 0.36 0.09660918 # 3 1e-04 10 0.36 0.09660918 # 4 1e-03 10 0.36 0.09660918 # 5 1e-02 10 0.27 0.20027759 # 6 1e-01 10 0.27 0.14944341 # 7 1e-06 100 0.36 0.09660918 # 8 1e-05 100 0.36 0.09660918 # 9 1e-04 100 0.36 0.09660918 # 10 1e-03 100 0.26 0.18378732 # 11 1e-02 100 0.26 0.17763883 # 12 1e-01 100 0.26 0.15055453

Turns out cost value of 100 and a gamma value of 0.001 yields the least error. Lets fit a radial SVM with these parameters.

`svmfit <- svm (response ~ ., data = trainingData, kernel = "radial", cost = 100, gamma=0.001, scale = FALSE) `

*# radial svm, scaling turned OFF*

print(svmfit)

plot(svmfit, trainingData)

compareTable <- table (testData$response, predict(svmfit, testData)) *# comparison table*

mean(testData$response != predict(svmfit, testData)) *# 13.79% misclassification error*

F M F 6 3 M 1 19

**The Grid Plot**

A 2-coloured grid plot, makes is visually clear which regions of the plot is designated to which class of response by the SVM classifier. In the below example, the data points are plotted against such a grid and the support vectors points are marked by a tilted square around the data points. Obviously, in this case, there are many constraint violations marked by the boundary cross-overs, but these are weighted down by the SVM internally.

*# Grid Plot*

n_points_in_grid = 60 *# num grid points in a line*

x_axis_range <- range (inputData[, 2]) *# range of X axis*

y_axis_range <- range (inputData[, 1]) *# range of Y axis*

X_grid_points <- seq (from=x_axis_range[1], to=x_axis_range[2], length=n_points_in_grid) *# grid points along x-axis*

Y_grid_points <- seq (from=y_axis_range[1], to=y_axis_range[2], length=n_points_in_grid) *# grid points along y-axis*

all_grid_points <- expand.grid (X_grid_points, Y_grid_points) *# generate all grid points*

names (all_grid_points) <- c("Hwt", "Bwt") *# rename*

all_points_predited <- predict(svmfit, all_grid_points) *# predict for all points in grid*

color_array <- c("red", "blue")[as.numeric(all_points_predited)] *# colors for all points based on predictions*

plot (all_grid_points, col=color_array, pch=20, cex=0.25) *# plot all grid points*

points (x=trainingData$Hwt, y=trainingData$Bwt, col=c("red", "blue")[as.numeric(trainingData$response)], pch=19) *# plot data points*

points (trainingData[svmfit$index, c (2, 1)], pch=5, cex=2) *# plot support vectors*