Support Vector Machines

Support vector machines is a relatively new and advanced machine learning technique, originally conceived for solving binary classification problems. It is now used widely to address multi-class non-linear classification as well as regression problems. Read on to learn how SVM works and how to implement in R

How does SVM work?

Here’s what SVM does in simple terms:

Assuming that your data-points are of 2 classes, SVM attempts to find the optimal line (hyperplane) that maximises the distance (margin) between the closest points from these classes. It is possible that sometimes, the boundary points may cross over the wrong side of hyperplane and overlap, in which case, these points are weighted down to lower their importance.

The ‘support vector’ in this case is the line formed by data points that lie on the margin of separation.

Support vector machines

What happens if a separating line (linear hyperplane) cannot be determined?

The data points are projected on to a higher dimensional space where they may become linearly separable. This is usually framed and solved as a constrained optimization problem, that aims to maximise the margins between the two classes.

What if my data has more than two classes?

SVM will still view the problem as a binary classification, except this time, multiple SVMs are fitted for classes against each other until all participating classes are differentiated.

Example Problem

Lets see how to implement a binary classifier using SVM with the cats dataset from MASS package. In this example you will try to predict the sex of a cat using the body weight and heart weight variables. We till keep drop 20% of the data points from this data set and keep it aside to test the accuracy of the model (built on the rest 80% of data).

# Setup
data(cats, package="MASS")
inputData <- data.frame(cats[, c (2,3)], response = as.factor(cats$Sex)) # response as factor

Linear SVM

The key parameters passed to the svm() are kernel, cost and gamma. Kernel is the type of SVM which could be linear, polynomial, radial or sigmoid. Cost is the cost function of constraint violation and gamma is a parameter used by all kernels except linear. There is a type parameter that determines whether the model is used for regression, classification or novelty detection. But this need not be explicitly set as SVM will auto detect this based on the class of response variable being a factor or a continuous variable. So for classification problems, be sure to cast your response variable as a factor.

# linear SVM
svmfit <- svm(response ~ ., data = inputData, kernel = "linear", cost = 10, scale = FALSE) # linear svm, scaling turned OFF
plot(svmfit, inputData)
compareTable <- table (inputData$response, predict(svmfit))  # tabulate
mean(inputData$response != predict(svmfit)) # 19.44% misclassification error

Linear SVM Plot

Radial SVM

The radial basis function, a popular kernel function can be used by setting the kernel parameter as “radial”. When a ‘radial’ kernel is used the resulting hyperplane need not be a line anymore. A curved region of separation is usually defined to demarcate the separation between classes, often leading to higher accuracy within the training data.

# radial SVM

svmfit <- svm(response ~ ., data = inputData, kernel = "radial", cost = 10, scale = FALSE) # radial svm, scaling turned OFF
plot(svmfit, inputData)
compareTable <- table (inputData$response, predict(svmfit))  # tabulate
mean(inputData$response != predict(svmfit)) # 18.75% misclassification error

Radial SVM
Radial SVM

Finding the optimal parameters

You can find the optimal parameters for the svm() using the tune.svm() function.

### Tuning
# Prepare training and test data
set.seed(100) # for reproducing results
rowIndices <- 1 : nrow(inputData) # prepare row indices
sampleSize <- 0.8 * length(rowIndices) # training sample size
trainingRows <- sample (rowIndices, sampleSize) # random sampling
trainingData <- inputData[trainingRows, ] # training data
testData <- inputData[-trainingRows, ] # test data
tuned <- tune.svm(response ~., data = trainingData, gamma = 10^(-6:-1), cost = 10^(1:2)) # tune
summary (tuned) # to select best gamma and cost

# Parameter tuning of 'svm':   
#   - sampling method: 10-fold cross validation 
# - best parameters:
#   gamma cost
# 0.001  100
# - best performance: 0.26 
# - Detailed performance results:
#   gamma cost error dispersion
# 1  1e-06   10  0.36 0.09660918
# 2  1e-05   10  0.36 0.09660918
# 3  1e-04   10  0.36 0.09660918
# 4  1e-03   10  0.36 0.09660918
# 5  1e-02   10  0.27 0.20027759
# 6  1e-01   10  0.27 0.14944341
# 7  1e-06  100  0.36 0.09660918
# 8  1e-05  100  0.36 0.09660918
# 9  1e-04  100  0.36 0.09660918
# 10 1e-03  100  0.26 0.18378732
# 11 1e-02  100  0.26 0.17763883
# 12 1e-01  100  0.26 0.15055453

Turns out cost value of 100 and a gamma value of 0.001 yields the least error. Lets fit a radial SVM with these parameters.

svmfit <- svm (response ~ ., data = trainingData, kernel = "radial", cost = 100, gamma=0.001, scale = FALSE) # radial svm, scaling turned OFF
plot(svmfit, trainingData)
compareTable <- table (testData$response, predict(svmfit, testData))  # comparison table
mean(testData$response != predict(svmfit, testData)) # 13.79% misclassification error

     F  M
  F  6  3
  M  1 19

The Grid Plot

A 2-coloured grid plot, makes is visually clear which regions of the plot is designated to which class of response by the SVM classifier. In the below example, the data points are plotted against such a grid and the support vectors points are marked by a tilted square around the data points. Obviously, in this case, there are many constraint violations marked by the boundary cross-overs, but these are weighted down by the SVM internally.

# Grid Plot
n_points_in_grid = 60 # num grid points in a line
x_axis_range <- range (inputData[, 2]) # range of X axis
y_axis_range <- range (inputData[, 1]) # range of Y axis
X_grid_points <- seq (from=x_axis_range[1], to=x_axis_range[2], length=n_points_in_grid) # grid points along x-axis
Y_grid_points <- seq (from=y_axis_range[1], to=y_axis_range[2], length=n_points_in_grid) # grid points along y-axis
all_grid_points <- expand.grid (X_grid_points, Y_grid_points) # generate all grid points
names (all_grid_points) <- c("Hwt", "Bwt") # rename
all_points_predited <- predict(svmfit, all_grid_points) # predict for all points in grid
color_array <- c("red", "blue")[as.numeric(all_points_predited)] # colors for all points based on predictions
plot (all_grid_points, col=color_array, pch=20, cex=0.25) # plot all grid points
points (x=trainingData$Hwt, y=trainingData$Bwt, col=c("red", "blue")[as.numeric(trainingData$response)], pch=19) # plot data points
points (trainingData[svmfit$index, c (2, 1)], pch=5, cex=2) # plot support vectors

SVM Grid Plot
SVM Grid Plot

If you like us, please tell your friends.Share on LinkedInShare on Google+Share on RedditTweet about this on TwitterShare on Facebook