The support vector machine (SVM) is one of the most popular and effective classification algorithms. Like nearest centroid, SVM is a linear classifier. Recall we were able to turn linear regression into non-linear regression by explicitly adding more variables using non-linear variable transformations (e.g. polynomial terms). Similarly, we can turn a linear classifier into a non-linear classifier by adding non-linear terms.
There is a way of automatically adding non-linear variables into a linear classifier by doing something called the kernel trick.
This lecture will cover
Hard margin SVM is useful to help understand soft margin SVM, but is not commonly used (although sometimes hard margin SVM works well for very high-dimensional data). These notes present the intuition behind soft margin SVM and point to reading for the mathematical details. Classification accuracy is an important metric, but there are other classification metrics you should be aware of.
library(e1071) # implements SVM
library(kernlab) # implements kernel SVM
# for model tuning
library(caret)
library(mvtnorm)
library(tidyverse)
# some helper functions I wrote
source('synthetic_distributions.R')
source('svm_fun.R')
The point of this lecture is to introduce the intuitive ideas behind SVM and kernels and show you how to use them in R. The math underlying these algorithms is explained well in the resources listed below.
For more details see
The maximal margin (MM) classifier (also known as hard margin support vector machine assumes the two classes of data are linearly separable i.e. we can draw a line where every data point in the first class is on one side and every point in the other class is on the other side of the line.
Recall a linear classifier makes predictions by separating the data space into two regions by a hyperplane. When the data are two dimensional, as above, this means a linear classifier draws a line in the plane.
In the case the data are separable a reasonable aspiration is to have the separating hyperplane lie between the two classes i.e. one class is totally on one side of the hyperplane and the other class is totally on the other side. Suppose we call the distance between the hyperplane and the nearest data point the margin.
The MM hyperplane is allergic to data points; MM seeks to find the hyperplane that is furthest away from the nearest data point. In other words MM finds the hyperplane that maximizes the margin.
In the plot above the MM separating hyperplane is shown as the solid line. The red arrows show the margin width – the distance between the separating hyperplane and the nearest point. The dashed lines are the so called marginal hyperplanes. These marginal hyperplanes are parallel to the separating hyperplane. The highlighted points are called support vectors and are the points that are closest to the separating hyperplane (all three are equidistant to the separating hyperplane and lie on the marginal hyperplanes).
The support vectors play an important role in understanding MM and SVM in general. The MM normal vector and intercept depend only on the support vectors – none of the other data points. In contrast, nearest centroid depends on all the data points.
We can write the maximal margin classifier as an optimization problem: of all hyperplanes that separate the two classes, find the one that maximizes the margin. All of the aforementioned resources discuss how to write this as a math problem then solve the math problem.
Linear separability is a strong assumption that is typically not true for data sets; often the two classes cannot be separating by a hyperplane.
Soft margin support vector machine (which we will call just SVM) is an adaptation of the maximal margin classifier that is cool with data points lying on the wrong side of the separating hyperplane.
Keep the following intuition in mind for this section. A good linear classifier should aim to put points
See ISLR chapter 9 for details about SVM.
Maximizing the margin is about putting points as far on the correct side of the separating hyperplane as possible. Soft margin SVM allows some “bad” points to be on the wrong side of the separating hyperplane, but penalizes these bad points. SVM then wants to put the remaining “good” points as far on the correct side of the margin as possible.
SVM thus has two competing objects: maximize the margin, but penalize disobedient points. SVM has a devil one on shoulder and angle on the other shoulder. Like many machine learning algorithms, SVM tries to strike a balance between two competing objectives.
In order to strike this balance between the two competing objectives, SVM comes with a tuning parameter, \(C >0\). This tuning parameter, \(C\), controls how much SVM cares about the “bad” points. When \(C\) is large SVM tries very hard to put put every training point on the correct side of the separating hyperplane. When \(C\) is small SVM has more chill about points being on the wrong side of the separating hyperplane.
All of the above mentioned resources discuss how to write this as a math problem then solve the math problem.
The plots below show the SVM fit for three different values of C (C=100, .01, .0001). The solid line shows the SVM separating hyperplane. The dashed lines show the SVM marginal hyperplanes which play an important role in the inner workings of SVM (these do not have the same interpretation as the marginal hyperplanes for the MM classifier).
Notice that as \(C\) decreases more points are misclassified. This is another example of the bias-variance tradeoff (recall chapter 2 from ISLR). Large values of \(C\) make the classifier try really hard to not misclassify anyone – the so called low bias, high variance regime (leads to overfitting). Small values of \(C\) are more ok with misclassified points – high bias, low variance. As with many things in life the key is to find the right balance.
The best value of \(C\) is typically selected using cross-validation.
Unlike linear regression, or nearest centroid, SVM cannot be solved in closed form (i.e. there is no simple formula using the data to get the normal vector and intercept). Fitting SVM requires solving a quadratic program which you could probably do after taking one course in optimization. Luckily for us, people have already coded up SVM.
We will use the e1071 package to implement SVM. The kernlab package also implements SVM in R, but for some reason I prefer e1071
…
First let’s sample some training data and some test data (same as the data shown above).
# this function comes from the synthetic_distributions.R package
train <- two_class_guasssian_meatballs(n_pos=200, n_neg=200,
mu_pos=c(1,0), mu_neg=c(-1,0),
sigma_pos=diag(2), sigma_neg=diag(2),
seed=103)
train
## # A tibble: 400 × 3
## x1 x2 y
## <dbl> <dbl> <fctr>
## 1 2.15074931 -0.4792060 1
## 2 -0.08786829 1.6166332 1
## 3 1.76506913 -0.1514896 1
## 4 0.61865671 1.2622023 1
## 5 1.07803836 -0.5458768 1
## 6 0.85736662 0.3856852 1
## 7 1.75866283 0.3809878 1
## 8 2.21017318 -2.4745634 1
## 9 0.96406375 0.3872157 1
## 10 -0.60946315 1.7316120 1
## # ... with 390 more rows
test <- two_class_guasssian_meatballs(n_pos=1000, n_neg=1000,
mu_pos=c(1,0), mu_neg=c(-1,0),
sigma_pos=diag(2), sigma_neg=diag(2),
seed=634)
Before fitting SVM we have to decide on the value of the tuning parameter to use – let’s stick with \(C = 10\) for now. The svm()
function is from the e1071
package. The function might look intimidating – there are an annoying number of arguments being set discussed below. Pay attention to the three lines with comments next to them – these are the important lines.
# fit SVM
svmfit <- svm(y ~ ., # R's formula notation
data=train, # data frame to use
cost=10, # set the tuning paramter
scale=FALSE,
type='C-classification',
shrinking=FALSE,
kernel='linear')
main arguments
data=train
says fit SVM using the data stored in the train
data frame.
The svm()
function uses R’s formula notation. Recall from linear regression y ~ .
means fit y
on all the rest of the columns of data. We could have equivalently used y ~ x1 + x2
.
cost = 10
fixes the tuning parameter \(C\) to 10. The tuning parameter \(C\) is also sometimes called a cost parameter.
shrinking=FALSE
I’m not sure what this does, but I don’t want anything extra to happen so I told it to stop.
`
Other arguments Check out the documentation to read more about the arguments (i.e. run ?svm
). The svm()
function can do a lot of stuff which is why it has so many arguments.
scale = FALSE
says please do not center and scale our data. svm()
applies some preprocessing to the data by default. While preprocessing (e.g. center and scale) is often a good thing to do, I strongly disagree with making this the default behavior.
type='C-classification'
tells svm()
to do classification. It turns out SVM can be used to do other things than classification](http://kernelsvm.tripod.com/).
kernel='linear'
says do linear SVM. The svm()
function can do kernel SVM (discussed below).
Now that we have fit SVM let’s see what we have. Use the names()
function to see what the svmfit
object has stored.
names(svmfit)
## [1] "call" "type" "kernel"
## [4] "cost" "degree" "gamma"
## [7] "coef0" "nu" "epsilon"
## [10] "sparse" "scaled" "x.scale"
## [13] "y.scale" "nclasses" "levels"
## [16] "tot.nSV" "nSV" "labels"
## [19] "SV" "index" "rho"
## [22] "compprob" "probA" "probB"
## [25] "sigma" "coefs" "na.action"
## [28] "fitted" "decision.values" "terms"
You can read about these in the ?svm()
documentation. One value that might be of interest is the predictions for the training points i.e. (also called fitted values)
svmfit$fitted[1:5]
## 1 2 3 4 5
## 1 -1 1 1 1
## Levels: -1 1
To use SVM to make prediction on new data we can use the predict
function i.e.
# this is equivalent to svmfit$fitted
train_predictions <- predict(svmfit, newdata = train)
train_predictions[1:5]
## 1 2 3 4 5
## 1 -1 1 1 1
## Levels: -1 1
Ok let’s see how SVM did on the training data
train %>%
mutate(y_pred= train_predictions) %>%
summarise(error = mean(y != y_pred))
## # A tibble: 1 × 1
## error
## <dbl>
## 1 0.1575
And how about the test set?
test %>%
mutate(y_pred = predict(svmfit, newdata = test)) %>%
summarise(error = mean(y != y_pred))
## # A tibble: 1 × 1
## error
## <dbl>
## 1 0.162
In reality we would have first done cross-validation to select \(C\).
The fact that there are good, open source implementations of SVM is something you should take a minute to appreciate. The e1071
package is written in R and does not do any heavy lifting; it calls the LIBSVM package which is written in C. C has some benefits over R – it is typically faster and allows for better memory handling. C, however, is harder to write code it.
Many popular machine learning algorithms are coded in a lower level language like C. They then wrapped in a higher level language like R or Python so they can be used with minimal headache.
This means some kind soul (more likely wretched grad student) took the time to code up
This saves you, the user, a lot of time and money. A machine learning PhD student could probably do all of this themselves, but it would likely take them a couple weeks to get the code working correctly and quickly. You could also hire someone to do this, but it would be pricey.
These open source software implementations mean that instead of spending weeks and/or lots of money, you have access to a quality implementation of SVM in a matter of seconds for free. I would guess open source software is a big, unsung driver behind the explosion of big data.
Like all things in life, there are tradeoffs to using open source software.
The documentation for open source software can be poor (again no financial incentive to make it clear).
Approximating complex patterns with simple patterns is a very powerful idea in mathematics. Often this comes down to approximating a non-linear thing with a linear thing (curve are hard, lines are easy!) Linear regression is a very effective regression algorithm even though many relationships are not in fact linear. Similarly, linear classifiers (nearest centroid, LDA, SVM, etc) are very effective classifiers even though many classification problems are not linear. Sometimes, however, the linear approximation is not good enough and we need a more sophisticated pattern.
Recall we were able to turn linear regression into non-linear regression by adding transformations of the original variables. For example, instead of linear regression of \(y\) on \(x\) (y ~ x
) we added polynomial \(x\) terms (e.g. y ~ x + x^2 + x^3
). The resulting model is linear in the added variables, but non-linear in the original variables. We could have added any function of \(x\) we wanted e.g. \(e^{4.3 x}, \sin(8.3 x), \max(x, 1),\) etc. This idea also works for classification.
We can use non-linear variable transformations to turn a linear classifier into a non-linear classifier.
# some training data
train <- gmm_distribution2d(n_neg=200, n_pos=201, mean_seed=238, data_seed=1232)
# test grid
test_grid <- expand.grid(x1 = seq(-5, 5, length = 100),
x2 = seq(-5, 5, length = 100)) %>%
as_tibble()
# fit SVM
svm_linear <- svm(y ~ .,
data=train,
scale=FALSE,
type='C-classification',
shrinking=FALSE,
kernel='linear',
cost=10)
grid_predictions <- predict(svm_linear, newdata = test_grid)
Now let’s manually add some polynomial terms.
# add polynomial terms to
train_poly <- train %>%
mutate(x1_sq = x1^2, x1x2 = x1*x2, x2_sq = x2^2)
test_grid_poly <- test_grid %>%
mutate(x1_sq = x1^2, x1x2 = x1*x2, x2_sq = x2^2)
# fit SVM
svm_poly <- svm(y ~ .,
data=train_poly,
scale=FALSE,
type='C-classification',
shrinking=FALSE,
kernel='linear',
cost=10)
grid_poly_predictions <- predict(svm_poly, newdata = test_grid_poly)
# The above is equivalent to using R's formula notation
# svm_poly <- svm(y ~ x1 + x2 + x1^2 + x2^2 + x1*x2,
# data=train,
# scale=FALSE,
# type='C-classification',
# shrinking=FALSE,
# kernel='linear',
# cost=10)
#
# grid_poly_predictions <- predict(svm_poly, newdata = test_grid)
Why stop at degree two polynomials – why not add 3, 4, … 412, etc? Two issues come up when we add more and more non-linear variables
We can use cross-validation to (attempt) to solve the problem of overfitting. There is not an immediately obvious way to fix the computational problem. Fortunately, there is a mathematical trick that can significantly reduce the computational cost.
See section 9.3.2 from ISLR for the details about Kernel methods.
Kernel methods (also known as the kernel trick) is a trick that reformulates the mathematical problem of fitting a linear classifier to many non-linear variables into an easier to solve problem. This many not sound like much, but it makes more sophisticated algorithms usable in many settings where they would otherwise not be feasible. The key idea is that
This means that in order to fit SVM we only need to compute the distance between each pair of data points. There are two important consequences
The first point is the focus of this lecture: kernels = easier to compute non-linear version of SVM.
The second point means we can apply SVM to non-standard data such as: stirngs, graphs and [images](https://en.wikipedia.org/wiki/Digital_image_processing. For exmaple, there are ways of measuring how similar two strings are (called string kernels). If we have a bunch of strings (e.g. Yelp restaurant reviews) and a bunch of labels (e.g. good/bad) we can use a SVM with a string kernel to build a string classifier. This idea can be applied in many settings such as protien classification (DNA can be viewed as a long string). Other kernel examples include image kernels and graph kernels.
Suppose we have \(n\) data points \(x_1, \dots, x_n\) and \(d\) variables (i.e. \(x_i \in \mathbb{R}^d\)). A kernel is a function that takes two data vectors and computes a measure of distance (or equivalently a measure of similarity) between the two points. For example, the degree \(m\) polynomial kernel is defined as
\[K(a, b) = (a^T b + 1)^m\]
where \(a, b \in \mathbb{R}^d\) are two vectors and \(a^Tb\) is the dot product between the two vectors. Sometimes the polynomial kernel is defined with another parameter \(c \in \mathbb{R}\) that replaces the 1 i.e. \((a^T b + c)^m\). Or even a third parameter \(\gamma \in \mathbb{R}\) that scales the dot product i.e. \((\gamma a^T b + c)^m\).
Using a degree \(m\) polynomial kernel on the original data set is equivalent to adding all polynomial transformations up to degree \(m\) of the original data. It’s not obvious this is true, but the wikipedia page on polynomial kernels illustrates this fact with a degree two example.
This brief section requires some knowledge of computational complexity (e.g. bit O notation) to appreciate. You can skip this section if you want.
Why does this help? Suppose we have \(d\) variables. Suppose we add a ton of transformed variables into the model and now have \(D\) variables in the model where \(d < D\). For example, if we add all quadratic transformations of our variables we will have \(D = \frac{d(d+1)}{2}\) variables in the model. Computing the distance between two data points in the transformed space requires \(O(D)=O(d^2)\) computations (e.g. a dot product in \(\mathbb{R}^D\)). However, computing the polynomial above requires \(O(d)\) computations (e.g. a dot product in \(\mathbb{R}^d\)).
For a quadratic kernel we go from \(O(d^2)\) to \(O(d)\) to compute distances between data points. For larger degree polynomials \(D\) is an even worse function of \(d\) (not hard to work out exactly what it is). However, computing the polynomial kernel is always an order \(O(d)\) operations.
If you want to learn more about how the SVM optimization problem is solved with Kernels read sections 8 and 9 from Andrew Ng’s notes and the Sequentioa Minimization Optimization.
Thanks to the e1071
package it’s easy to use Kernel SVMs in R. Below we fit SVM with a degree 2 polynomial kernel.
# fit SVM
svm_kern2 <- svm(y ~ .,
data=train,
cost=10,
kernel='polynomial', # use a polynomial kernel
degree = 2, # degree two polynomial
gamma=1, # other kernel parameters
coef0 =1, # other kernel parameters
scale=FALSE,
type='C-classification',
shrinking=FALSE)
kern2_predictions <- predict(svm_kern2, newdata = test_grid)
The above plot should look the same as the previous plot when we explicitly added degree two variable transformations into the data. What happens when we try a degree 100 polynomial kernel?
There are many kernel functions you might use. Probably the most popular kernel is the radial basis (aka Gaussian kernel) which you will try on the homework. Different kernels correspond to different explicit variable transformations.
So how do you select the parameters for the kernel? The basic polynomial kernel has one parameter (the degree). It could have up to three parameters (i.e. \(\gamma\) and \(C\)). You still need to select the cost value parameters \(C\). Typically, cross-validation (specifically grid-search cross-validation) is used to pick all parameters. In particular
This should feel some what clunky – because it is. There are other fancier ways of selecting many tuning parameters (e.g. Bayesian optimization), but cross-validation is used fairly often.
The code to do this will start to get a little uglier. Luckily someone made an R package that tunes many machine learning models.
The caret
package is built to make fitting models much easier (caret stands for Classification And REgression Training). The package even comes with book on how to use it (if you want to learn more about the packaged I’d start with chapter 5).
caret
works as an interface to functions that have already been implemented in other packages (e.g. e1071::svm()
).
Before going forward spend a minute outlining how you would code up cross-validation for the polynomial kernel SVM described above with two tuning parameters (e.g. what goes into the three loops, what variables do you have to create ahead of time, etc).
The code below uses the caret
package to do this procedure much more succinctly.
# break the data frame up into separate x and y data
train_x <- train %>% select(-y)
train_y <- train$y
The two caret
functions to pay attention to are trControl()
and train()
.
# specify tuning procedure
trControl <- trainControl(method = "cv", # perform cross validation
number = 5) # use 5 folds
# the values of tuning parameters to look over in cross validation
# C: cost parameters
# degree: polynomial degree
# scale: another polynomial kernel paramter -- we don't care about today
tune_grid <- expand.grid(C=c(.01, .1, 1, 10, 100),
degree=c(1, 5, 10, 20),
scale=1)
# fit the SVM model
tuned_svm <- train(x=train_x,
y=train_y,
method = "svmPoly", # use linear SVM from the e1071 package
tuneGrid = tune_grid, # tuning parameters to look at
trControl = trControl, # tuning precedure defined above
metric='Accuracy') # what classification metric to use
tuned_svm
## Support Vector Machines with Polynomial Kernel
##
## 401 samples
## 2 predictor
## 2 classes: '-1', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 320, 321, 321, 321, 321
## Resampling results across tuning parameters:
##
## C degree Accuracy Kappa
## 1e-02 1 0.9252160 0.8504163
## 1e-02 5 0.9476235 0.8952559
## 1e-02 10 0.9401235 0.8802318
## 1e-02 20 0.9376235 0.8752318
## 1e-01 1 0.9301543 0.8602973
## 1e-01 5 0.9326543 0.8652823
## 1e-01 10 0.9376852 0.8753659
## 1e-01 20 0.9251852 0.8503659
## 1e+00 1 0.9301852 0.8603659
## 1e+00 5 0.9401235 0.8802318
## 1e+00 10 0.9376852 0.8753478
## 1e+00 20 0.9177160 0.8354374
## 1e+01 1 0.9276852 0.8553659
## 1e+01 5 0.9525926 0.9051784
## 1e+01 10 0.9202160 0.8403952
## 1e+01 20 0.9152160 0.8304374
## 1e+02 1 0.9276852 0.8553659
## 1e+02 5 0.9575926 0.9151784
## 1e+02 10 0.9078086 0.8156098
## 1e+02 20 0.9127469 0.8254878
##
## Tuning parameter 'scale' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 5, scale = 1 and C = 100.
The trainControl
object from the caret
package sets the tuning procedure. In our case we are using 5 fold cross-validation.
tune_grid
is the grid of tuning parameters we want to search over. Warning: the names of the columns for the tune_grid
data frame are important; they need to match the tuning parameter names from underlying function. For example, svmPoly
calls e1071::svm
so the columns names need to match the parameter names from e1071::svm
(in this case:C
, degree
, scale
).
You can find more information about the tuning paramters in this section of the book.
Main arguments to train
method = "svmPoly"
says use SVM with a polynomial kernel. caret
then uses the ksvm()
function from the kernlab
package.
tuneGrid = tune_grid
tells train what tuning parameters to search over (defined above)
trControl = trControl
sets the tuning procedure (defined above)
metric='Accuracy'
tells caret
to use the cross-validation accuracy to pick the optimal tuning parameters (this equivalent to using error rate).
The train
function returns a fully trained model that also has some meta data about the tuning procedure (namely the average Accuracy for each value of the tuning parameter). For example, to see the optimal value of the tuning parameters that the tuning procedure choose
tuned_svm$bestTune
## degree scale C
## 18 5 1 100
We can now use the predict
function to get predictions out of tuned_svm
test_grid_pred <- predict(tuned_svm, newdata = test_grid)
See this post for some more example code using the caret
package.
caret
?The caret
package implements over 200 machine learning models from many different R packages. These models tend to follow similar patterns, for example most models have:
There are many ways of selecting the tuning parameters for a model that don’t depend on the model (for example cross-validation). The caret
package does a couple things
e1071::svm
was called like svm(y~., train)
while class::knn(train=train_x, test=train_y...)
wants you to split the data up into separate x and y objectscaret
caret
package has many bells and whistles already implemented (different flavors of CV, bootstrapping, different error metrics, etc)Basically the caret
package makes use of modularity and abstraction which makes your code: more likely to be correct, faster to write, and faster to run.
As always, there are trade offs. For example, caret
might do things you don’t want or expect. It also encourages you to think of model tuning as a blackbox. You might get a better model if you look more deeply at the tuning process (for example, plotting the train and cross-validation error curves).
The obvious way of measuring how well a classifier is doing is the misclassification error rate i.e. \(1 - \frac{\text{number classified correctly}}{\text{total}}\). There are a number of other measures of classifier performance. For example, in some applications the classes are very unbalanced e.g. 99.9% of the data are in one class and .1% of the data are in the other class. The naive classifier that assigns every point to the larger class would be a great classifier as judged by miclassification error.
Here are some other standard classification error metrics