# Support Vector Machines

`Support vector machines is a relatively new and advanced machine learning technique, originally conceived for solving binary classification problems. It is now used widely to address multi-class non-linear classification as well as regression problems. Read on to learn how SVM works and how to implement in R`

### How does SVM work?

Here’s what SVM does in simple terms:

Assuming that your data-points are of 2 classes, SVM attempts to find the optimal line (hyperplane) that maximises the distance (margin) between the closest points from these classes. It is possible that sometimes, the boundary points may cross over the wrong side of hyperplane and overlap, in which case, these points are weighted down to lower their importance.

The ‘support vector’ in this case is the line formed by data points that lie on the margin of separation. ### What happens if a separating line (linear hyperplane) cannot be determined?

The data points are projected on to a higher dimensional space where they may become linearly separable. This is usually framed and solved as a constrained optimization problem, that aims to maximise the margins between the two classes.

### What if my data has more than two classes?

SVM will still view the problem as a binary classification, except this time, multiple SVMs are fitted for classes against each other until all participating classes are differentiated.

### Example Problem

Lets see how to implement a binary classifier using SVM with the cats dataset from MASS package. In this example you will try to predict the sex of a cat using the body weight and heart weight variables. We till keep drop 20% of the data points from this data set and keep it aside to test the accuracy of the model (built on the rest 80% of data).

```# Setup library(e1071) data(cats, package="MASS") inputData <- data.frame(cats[, c (2,3)], response = as.factor(cats\$Sex)) # response as factor```

### Linear SVM

The key parameters passed to the svm() are kernel, cost and gamma. Kernel is the type of SVM which could be linear, polynomial, radial or sigmoid. Cost is the cost function of constraint violation and gamma is a parameter used by all kernels except linear. There is a type parameter that determines whether the model is used for regression, classification or novelty detection. But this need not be explicitly set as SVM will auto detect this based on the class of response variable being a factor or a continuous variable. So for classification problems, be sure to cast your response variable as a factor.

```# linear SVM svmfit <- svm(response ~ ., data = inputData, kernel = "linear", cost = 10, scale = FALSE) # linear svm, scaling turned OFF print(svmfit) plot(svmfit, inputData) compareTable <- table (inputData\$response, predict(svmfit))  # tabulate mean(inputData\$response != predict(svmfit)) # 19.44% misclassification error```

The radial basis function, a popular kernel function can be used by setting the kernel parameter as “radial”. When a ‘radial’ kernel is used the resulting hyperplane need not be a line anymore. A curved region of separation is usually defined to demarcate the separation between classes, often leading to higher accuracy within the training data.

``` svmfit <- svm(response ~ ., data = inputData, kernel = "radial", cost = 10, scale = FALSE) # radial svm, scaling turned OFF print(svmfit) plot(svmfit, inputData) compareTable <- table (inputData\$response, predict(svmfit))  # tabulate mean(inputData\$response != predict(svmfit)) # 18.75% misclassification error```

### Finding the optimal parameters

You can find the optimal parameters for the svm() using the tune.svm() function.

```### Tuning # Prepare training and test data set.seed(100) # for reproducing results rowIndices <- 1 : nrow(inputData) # prepare row indices sampleSize <- 0.8 * length(rowIndices) # training sample size trainingRows <- sample (rowIndices, sampleSize) # random sampling trainingData <- inputData[trainingRows, ] # training data testData <- inputData[-trainingRows, ] # test data tuned <- tune.svm(response ~., data = trainingData, gamma = 10^(-6:-1), cost = 10^(1:2)) # tune summary (tuned) # to select best gamma and cost```

```# Parameter tuning of 'svm':
#   - sampling method: 10-fold cross validation
#
# - best parameters:
#   gamma cost
# 0.001  100
#
# - best performance: 0.26
#
# - Detailed performance results:
#   gamma cost error dispersion
# 1  1e-06   10  0.36 0.09660918
# 2  1e-05   10  0.36 0.09660918
# 3  1e-04   10  0.36 0.09660918
# 4  1e-03   10  0.36 0.09660918
# 5  1e-02   10  0.27 0.20027759
# 6  1e-01   10  0.27 0.14944341
# 7  1e-06  100  0.36 0.09660918
# 8  1e-05  100  0.36 0.09660918
# 9  1e-04  100  0.36 0.09660918
# 10 1e-03  100  0.26 0.18378732
# 11 1e-02  100  0.26 0.17763883
# 12 1e-01  100  0.26 0.15055453```

Turns out cost value of 100 and a gamma value of 0.001 yields the least error. Lets fit a radial SVM with these parameters.

```svmfit <- svm (response ~ ., data = trainingData, kernel = "radial", cost = 100, gamma=0.001, scale = FALSE) # radial svm, scaling turned OFF print(svmfit) plot(svmfit, trainingData) compareTable <- table (testData\$response, predict(svmfit, testData))  # comparison table mean(testData\$response != predict(svmfit, testData)) # 13.79% misclassification error```

```     F  M
F  6  3
M  1 19```

### The Grid Plot

A 2-coloured grid plot, makes is visually clear which regions of the plot is designated to which class of response by the SVM classifier. In the below example, the data points are plotted against such a grid and the support vectors points are marked by a tilted square around the data points. Obviously, in this case, there are many constraint violations marked by the boundary cross-overs, but these are weighted down by the SVM internally.

```# Grid Plot n_points_in_grid = 60 # num grid points in a line x_axis_range <- range (inputData[, 2]) # range of X axis y_axis_range <- range (inputData[, 1]) # range of Y axis X_grid_points <- seq (from=x_axis_range, to=x_axis_range, length=n_points_in_grid) # grid points along x-axis Y_grid_points <- seq (from=y_axis_range, to=y_axis_range, length=n_points_in_grid) # grid points along y-axis all_grid_points <- expand.grid (X_grid_points, Y_grid_points) # generate all grid points names (all_grid_points) <- c("Hwt", "Bwt") # rename all_points_predited <- predict(svmfit, all_grid_points) # predict for all points in grid color_array <- c("red", "blue")[as.numeric(all_points_predited)] # colors for all points based on predictions plot (all_grid_points, col=color_array, pch=20, cex=0.25) # plot all grid points points (x=trainingData\$Hwt, y=trainingData\$Bwt, col=c("red", "blue")[as.numeric(trainingData\$response)], pch=19) # plot data points points (trainingData[svmfit\$index, c (2, 1)], pch=5, cex=2) # plot support vectors```