The support vector machine (SVM) is one of the most popular and effective classification algorithms. Like nearest centroid, SVM is a linear classifier. Recall we were able to turn linear regression into non-linear regression by explicitly adding more variables using non-linear variable transformations (e.g. polynomial terms). Similarly, we can turn a linear classifier into a non-linear classifier by adding non-linear terms.

There is a way of automatically adding non-linear variables into a linear classifier by doing something called the kernel trick.

This lecture will cover

  • maximal margin classifier (aka hard margin SVM)
  • soft margin SVM
  • Kernel SVM
  • other classification metrics than just misclassification error

Hard margin SVM is useful to help understand soft margin SVM, but is not commonly used (although sometimes hard margin SVM works well for very high-dimensional data). These notes present the intuition behind soft margin SVM and point to reading for the mathematical details. Classification accuracy is an important metric, but there are other classification metrics you should be aware of.

library(e1071) # implements SVM
library(kernlab) # implements kernel SVM

# for model tuning
library(caret)


library(mvtnorm)
library(tidyverse)

# some helper functions I wrote
source('synthetic_distributions.R')
source('svm_fun.R')

Takeaway and Resources

The point of this lecture is to introduce the intuitive ideas behind SVM and kernels and show you how to use them in R. The math underlying these algorithms is explained well in the resources listed below.

For more details see

Maximal Margin

The maximal margin (MM) classifier (also known as hard margin support vector machine assumes the two classes of data are linearly separable i.e. we can draw a line where every data point in the first class is on one side and every point in the other class is on the other side of the line.

Recall a linear classifier makes predictions by separating the data space into two regions by a hyperplane. When the data are two dimensional, as above, this means a linear classifier draws a line in the plane.

In the case the data are separable a reasonable aspiration is to have the separating hyperplane lie between the two classes i.e. one class is totally on one side of the hyperplane and the other class is totally on the other side. Suppose we call the distance between the hyperplane and the nearest data point the margin.

The MM hyperplane is allergic to data points; MM seeks to find the hyperplane that is furthest away from the nearest data point. In other words MM finds the hyperplane that maximizes the margin.

In the plot above the MM separating hyperplane is shown as the solid line. The red arrows show the margin width – the distance between the separating hyperplane and the nearest point. The dashed lines are the so called marginal hyperplanes. These marginal hyperplanes are parallel to the separating hyperplane. The highlighted points are called support vectors and are the points that are closest to the separating hyperplane (all three are equidistant to the separating hyperplane and lie on the marginal hyperplanes).

The support vectors play an important role in understanding MM and SVM in general. The MM normal vector and intercept depend only on the support vectors – none of the other data points. In contrast, nearest centroid depends on all the data points.

We can write the maximal margin classifier as an optimization problem: of all hyperplanes that separate the two classes, find the one that maximizes the margin. All of the aforementioned resources discuss how to write this as a math problem then solve the math problem.

Soft Margin Support Vector Machine

Linear separability is a strong assumption that is typically not true for data sets; often the two classes cannot be separating by a hyperplane.

Soft margin support vector machine (which we will call just SVM) is an adaptation of the maximal margin classifier that is cool with data points lying on the wrong side of the separating hyperplane.

Keep the following intuition in mind for this section. A good linear classifier should aim to put points

  • on the correct side of the separating hyperplane far away from the separating hyperplane
  • on the wrong side of the separating hyperplane close to the separating hyperplane

Soft margin SVM (words)

See ISLR chapter 9 for details about SVM.

Maximizing the margin is about putting points as far on the correct side of the separating hyperplane as possible. Soft margin SVM allows some “bad” points to be on the wrong side of the separating hyperplane, but penalizes these bad points. SVM then wants to put the remaining “good” points as far on the correct side of the margin as possible.

SVM thus has two competing objects: maximize the margin, but penalize disobedient points. SVM has a devil one on shoulder and angle on the other shoulder. Like many machine learning algorithms, SVM tries to strike a balance between two competing objectives.

In order to strike this balance between the two competing objectives, SVM comes with a tuning parameter, \(C >0\). This tuning parameter, \(C\), controls how much SVM cares about the “bad” points. When \(C\) is large SVM tries very hard to put put every training point on the correct side of the separating hyperplane. When \(C\) is small SVM has more chill about points being on the wrong side of the separating hyperplane.

All of the above mentioned resources discuss how to write this as a math problem then solve the math problem.

Tuning SVM

The plots below show the SVM fit for three different values of C (C=100, .01, .0001). The solid line shows the SVM separating hyperplane. The dashed lines show the SVM marginal hyperplanes which play an important role in the inner workings of SVM (these do not have the same interpretation as the marginal hyperplanes for the MM classifier).

Notice that as \(C\) decreases more points are misclassified. This is another example of the bias-variance tradeoff (recall chapter 2 from ISLR). Large values of \(C\) make the classifier try really hard to not misclassify anyone – the so called low bias, high variance regime (leads to overfitting). Small values of \(C\) are more ok with misclassified points – high bias, low variance. As with many things in life the key is to find the right balance.

The best value of \(C\) is typically selected using cross-validation.

SVM code

Unlike linear regression, or nearest centroid, SVM cannot be solved in closed form (i.e. there is no simple formula using the data to get the normal vector and intercept). Fitting SVM requires solving a quadratic program which you could probably do after taking one course in optimization. Luckily for us, people have already coded up SVM.

We will use the e1071 package to implement SVM. The kernlab package also implements SVM in R, but for some reason I prefer e1071

First let’s sample some training data and some test data (same as the data shown above).

# this function comes from the synthetic_distributions.R package
train <- two_class_guasssian_meatballs(n_pos=200, n_neg=200,
                                       mu_pos=c(1,0), mu_neg=c(-1,0),
                                       sigma_pos=diag(2), sigma_neg=diag(2),
                                       seed=103)

train
## # A tibble: 400 × 3
##             x1         x2      y
##          <dbl>      <dbl> <fctr>
## 1   2.15074931 -0.4792060      1
## 2  -0.08786829  1.6166332      1
## 3   1.76506913 -0.1514896      1
## 4   0.61865671  1.2622023      1
## 5   1.07803836 -0.5458768      1
## 6   0.85736662  0.3856852      1
## 7   1.75866283  0.3809878      1
## 8   2.21017318 -2.4745634      1
## 9   0.96406375  0.3872157      1
## 10 -0.60946315  1.7316120      1
## # ... with 390 more rows
test <- two_class_guasssian_meatballs(n_pos=1000, n_neg=1000,
                                       mu_pos=c(1,0), mu_neg=c(-1,0),
                                       sigma_pos=diag(2), sigma_neg=diag(2),
                                       seed=634)

Before fitting SVM we have to decide on the value of the tuning parameter to use – let’s stick with \(C = 10\) for now. The svm() function is from the e1071 package. The function might look intimidating – there are an annoying number of arguments being set discussed below. Pay attention to the three lines with comments next to them – these are the important lines.

# fit SVM
svmfit <- svm(y ~ ., # R's formula notation
              data=train, # data frame to use
              cost=10, # set the tuning paramter
              scale=FALSE,
              type='C-classification',
              shrinking=FALSE,
              kernel='linear') 

main arguments

  • data=train says fit SVM using the data stored in the train data frame.

  • The svm() function uses R’s formula notation. Recall from linear regression y ~ . means fit y on all the rest of the columns of data. We could have equivalently used y ~ x1 + x2.

  • cost = 10 fixes the tuning parameter \(C\) to 10. The tuning parameter \(C\) is also sometimes called a cost parameter.

  • shrinking=FALSE I’m not sure what this does, but I don’t want anything extra to happen so I told it to stop.

`

Other arguments Check out the documentation to read more about the arguments (i.e. run ?svm). The svm() function can do a lot of stuff which is why it has so many arguments.

  • scale = FALSE says please do not center and scale our data. svm() applies some preprocessing to the data by default. While preprocessing (e.g. center and scale) is often a good thing to do, I strongly disagree with making this the default behavior.

  • type='C-classification' tells svm() to do classification. It turns out SVM can be used to do other things than classification](http://kernelsvm.tripod.com/).

  • kernel='linear' says do linear SVM. The svm() function can do kernel SVM (discussed below).

Now that we have fit SVM let’s see what we have. Use the names() function to see what the svmfit object has stored.

names(svmfit)
##  [1] "call"            "type"            "kernel"         
##  [4] "cost"            "degree"          "gamma"          
##  [7] "coef0"           "nu"              "epsilon"        
## [10] "sparse"          "scaled"          "x.scale"        
## [13] "y.scale"         "nclasses"        "levels"         
## [16] "tot.nSV"         "nSV"             "labels"         
## [19] "SV"              "index"           "rho"            
## [22] "compprob"        "probA"           "probB"          
## [25] "sigma"           "coefs"           "na.action"      
## [28] "fitted"          "decision.values" "terms"

You can read about these in the ?svm() documentation. One value that might be of interest is the predictions for the training points i.e. (also called fitted values)

svmfit$fitted[1:5]
##  1  2  3  4  5 
##  1 -1  1  1  1 
## Levels: -1 1

To use SVM to make prediction on new data we can use the predict function i.e.

# this is equivalent to svmfit$fitted
train_predictions <- predict(svmfit, newdata = train)
train_predictions[1:5] 
##  1  2  3  4  5 
##  1 -1  1  1  1 
## Levels: -1 1

Ok let’s see how SVM did on the training data

train %>% 
    mutate(y_pred= train_predictions) %>% 
    summarise(error = mean(y != y_pred))
## # A tibble: 1 × 1
##    error
##    <dbl>
## 1 0.1575

And how about the test set?

test %>% 
    mutate(y_pred = predict(svmfit, newdata = test)) %>% 
    summarise(error = mean(y != y_pred))
## # A tibble: 1 × 1
##   error
##   <dbl>
## 1 0.162

In reality we would have first done cross-validation to select \(C\).

software tangent

The fact that there are good, open source implementations of SVM is something you should take a minute to appreciate. The e1071 package is written in R and does not do any heavy lifting; it calls the LIBSVM package which is written in C. C has some benefits over R – it is typically faster and allows for better memory handling. C, however, is harder to write code it.

Many popular machine learning algorithms are coded in a lower level language like C. They then wrapped in a higher level language like R or Python so they can be used with minimal headache.

This means some kind soul (more likely wretched grad student) took the time to code up

  1. a good implementation of SVM in C and then release it to the public
  2. a package in R (and Python and many other languages) so that us data scientists don’t have to learn C to aforementioned C implementation of SVM

This saves you, the user, a lot of time and money. A machine learning PhD student could probably do all of this themselves, but it would likely take them a couple weeks to get the code working correctly and quickly. You could also hire someone to do this, but it would be pricey.

These open source software implementations mean that instead of spending weeks and/or lots of money, you have access to a quality implementation of SVM in a matter of seconds for free. I would guess open source software is a big, unsung driver behind the explosion of big data.

Like all things in life, there are tradeoffs to using open source software.

  • There can be bugs in open source software – no one has a financial incentive to thoroughly test the code.
    • Of course there can be bugs in professional software. Also the more people use a piece of software, the more likely a bug is to be caught.
  • The documentation for open source software can be poor (again no financial incentive to make it clear).

  • You don’t have control over design choices.
    • Your favorite SVM package has bells and whistles 1 - 5, but you want bell and whistle number 6? You’re SOL since you didn’t write the source code.

Non-linear classifiers

Approximating complex patterns with simple patterns is a very powerful idea in mathematics. Often this comes down to approximating a non-linear thing with a linear thing (curve are hard, lines are easy!) Linear regression is a very effective regression algorithm even though many relationships are not in fact linear. Similarly, linear classifiers (nearest centroid, LDA, SVM, etc) are very effective classifiers even though many classification problems are not linear. Sometimes, however, the linear approximation is not good enough and we need a more sophisticated pattern.

Explicit variable transformation

Recall we were able to turn linear regression into non-linear regression by adding transformations of the original variables. For example, instead of linear regression of \(y\) on \(x\) (y ~ x) we added polynomial \(x\) terms (e.g. y ~ x + x^2 + x^3). The resulting model is linear in the added variables, but non-linear in the original variables. We could have added any function of \(x\) we wanted e.g. \(e^{4.3 x}, \sin(8.3 x), \max(x, 1),\) etc. This idea also works for classification.

We can use non-linear variable transformations to turn a linear classifier into a non-linear classifier.

# some training data
train <- gmm_distribution2d(n_neg=200, n_pos=201, mean_seed=238, data_seed=1232)

# test grid
test_grid <- expand.grid(x1 = seq(-5, 5, length = 100),
                         x2 = seq(-5, 5, length = 100)) %>% 
            as_tibble()

# fit SVM
svm_linear <- svm(y ~ ., 
                  data=train,
                  scale=FALSE,
                  type='C-classification',
                  shrinking=FALSE,
                  kernel='linear', 
                  cost=10)

grid_predictions <- predict(svm_linear, newdata = test_grid)

Now let’s manually add some polynomial terms.

# add polynomial terms to 
train_poly <- train %>% 
                mutate(x1_sq = x1^2, x1x2 = x1*x2, x2_sq = x2^2)


test_grid_poly <- test_grid %>% 
                    mutate(x1_sq = x1^2, x1x2 = x1*x2, x2_sq = x2^2)


# fit SVM
svm_poly <- svm(y ~ ., 
                  data=train_poly,
                  scale=FALSE,
                  type='C-classification',
                  shrinking=FALSE,
                  kernel='linear', 
                  cost=10)

grid_poly_predictions <- predict(svm_poly, newdata = test_grid_poly)


# The above is equivalent to using R's formula notation
# svm_poly <- svm(y ~ x1 + x2 + x1^2 + x2^2 + x1*x2, 
#                   data=train,
#                   scale=FALSE,
#                   type='C-classification',
#                   shrinking=FALSE,
#                   kernel='linear', 
#                   cost=10)
# 
# grid_poly_predictions <- predict(svm_poly, newdata = test_grid)

Why stop at degree two polynomials – why not add 3, 4, … 412, etc? Two issues come up when we add more and more non-linear variables

  • overfitting: if we add enough non-linear terms we might overfit the data
  • computational cost: the more variables we add, the more computational costs we will incur (e.g. time + memory). If we add enough variables the computer may run out of memory and/or it might take until the end of the universe for the algorithm that fits SVM to stop.

We can use cross-validation to (attempt) to solve the problem of overfitting. There is not an immediately obvious way to fix the computational problem. Fortunately, there is a mathematical trick that can significantly reduce the computational cost.

Kernel (implicit variable transformation)

See section 9.3.2 from ISLR for the details about Kernel methods.

Kernel methods (also known as the kernel trick) is a trick that reformulates the mathematical problem of fitting a linear classifier to many non-linear variables into an easier to solve problem. This many not sound like much, but it makes more sophisticated algorithms usable in many settings where they would otherwise not be feasible. The key idea is that

  • many algorithms (such as SVM) rely only on the distance between each pair of data points.

This means that in order to fit SVM we only need to compute the distance between each pair of data points. There are two important consequences

  1. If we can compute the distance between points cheaply then we can fit SVM more quickly.
  2. If we have the ability to compute a “distance” between pairs of objects then we can use SVM.

The first point is the focus of this lecture: kernels = easier to compute non-linear version of SVM.

The second point means we can apply SVM to non-standard data such as: stirngs, graphs and [images](https://en.wikipedia.org/wiki/Digital_image_processing. For exmaple, there are ways of measuring how similar two strings are (called string kernels). If we have a bunch of strings (e.g. Yelp restaurant reviews) and a bunch of labels (e.g. good/bad) we can use a SVM with a string kernel to build a string classifier. This idea can be applied in many settings such as protien classification (DNA can be viewed as a long string). Other kernel examples include image kernels and graph kernels.

Polynomial kernel

Suppose we have \(n\) data points \(x_1, \dots, x_n\) and \(d\) variables (i.e. \(x_i \in \mathbb{R}^d\)). A kernel is a function that takes two data vectors and computes a measure of distance (or equivalently a measure of similarity) between the two points. For example, the degree \(m\) polynomial kernel is defined as

\[K(a, b) = (a^T b + 1)^m\]

where \(a, b \in \mathbb{R}^d\) are two vectors and \(a^Tb\) is the dot product between the two vectors. Sometimes the polynomial kernel is defined with another parameter \(c \in \mathbb{R}\) that replaces the 1 i.e. \((a^T b + c)^m\). Or even a third parameter \(\gamma \in \mathbb{R}\) that scales the dot product i.e. \((\gamma a^T b + c)^m\).

Using a degree \(m\) polynomial kernel on the original data set is equivalent to adding all polynomial transformations up to degree \(m\) of the original data. It’s not obvious this is true, but the wikipedia page on polynomial kernels illustrates this fact with a degree two example.

Computational complexity

This brief section requires some knowledge of computational complexity (e.g. bit O notation) to appreciate. You can skip this section if you want.

Why does this help? Suppose we have \(d\) variables. Suppose we add a ton of transformed variables into the model and now have \(D\) variables in the model where \(d < D\). For example, if we add all quadratic transformations of our variables we will have \(D = \frac{d(d+1)}{2}\) variables in the model. Computing the distance between two data points in the transformed space requires \(O(D)=O(d^2)\) computations (e.g. a dot product in \(\mathbb{R}^D\)). However, computing the polynomial above requires \(O(d)\) computations (e.g. a dot product in \(\mathbb{R}^d\)).

For a quadratic kernel we go from \(O(d^2)\) to \(O(d)\) to compute distances between data points. For larger degree polynomials \(D\) is an even worse function of \(d\) (not hard to work out exactly what it is). However, computing the polynomial kernel is always an order \(O(d)\) operations.

If you want to learn more about how the SVM optimization problem is solved with Kernels read sections 8 and 9 from Andrew Ng’s notes and the Sequentioa Minimization Optimization.

Kernel SVM in R

Thanks to the e1071 package it’s easy to use Kernel SVMs in R. Below we fit SVM with a degree 2 polynomial kernel.

# fit SVM
svm_kern2 <- svm(y ~ ., 
                  data=train,
                  cost=10,
                  kernel='polynomial', # use a polynomial kernel
                  degree = 2, # degree two polynomial
                  gamma=1, # other kernel parameters
                  coef0 =1, # other kernel parameters
                  scale=FALSE,
                  type='C-classification',
                  shrinking=FALSE)

kern2_predictions <- predict(svm_kern2, newdata = test_grid)

The above plot should look the same as the previous plot when we explicitly added degree two variable transformations into the data. What happens when we try a degree 100 polynomial kernel?

There are many kernel functions you might use. Probably the most popular kernel is the radial basis (aka Gaussian kernel) which you will try on the homework. Different kernels correspond to different explicit variable transformations.

So how do you select the parameters for the kernel? The basic polynomial kernel has one parameter (the degree). It could have up to three parameters (i.e. \(\gamma\) and \(C\)). You still need to select the cost value parameters \(C\). Typically, cross-validation (specifically grid-search cross-validation) is used to pick all parameters. In particular

  1. Select a sequence of \(C\) values (e.g. \(C = 1e-5, 1e-4, \dots, 1e5\)).
  2. Select a sequence of degrees (e.g. \(d = 1, 2, 5, 10, 20, 50, 100\)).
  3. For each pair of \(C, d\) values (think a grid) use cross-validation to estimate the test error (originally cross validation had 2 for loops, now it has 3 for loops).
  4. Select the pair \(C, d\) values that give the best cross-validation error

This should feel some what clunky – because it is. There are other fancier ways of selecting many tuning parameters (e.g. Bayesian optimization), but cross-validation is used fairly often.

The code to do this will start to get a little uglier. Luckily someone made an R package that tunes many machine learning models.

Tuning SVM with the caret package

The caret package is built to make fitting models much easier (caret stands for Classification And REgression Training). The package even comes with book on how to use it (if you want to learn more about the packaged I’d start with chapter 5).

caret works as an interface to functions that have already been implemented in other packages (e.g. e1071::svm()).

Before going forward spend a minute outlining how you would code up cross-validation for the polynomial kernel SVM described above with two tuning parameters (e.g. what goes into the three loops, what variables do you have to create ahead of time, etc).

The code below uses the caret package to do this procedure much more succinctly.

# break the data frame up into separate x and y data
train_x <- train %>% select(-y)
train_y <- train$y

The two caret functions to pay attention to are trControl() and train().

# specify tuning procedure
trControl <- trainControl(method = "cv", # perform cross validation
                          number = 5) # use 5 folds

# the values of tuning parameters to look over in cross validation
    # C: cost parameters
    # degree: polynomial degree
    # scale: another polynomial kernel paramter -- we don't care about today
tune_grid <- expand.grid(C=c(.01, .1, 1, 10, 100),
                         degree=c(1, 5, 10, 20),
                         scale=1)

# fit the SVM model
tuned_svm <- train(x=train_x,
                   y=train_y,
                   method = "svmPoly", # use linear SVM from the e1071 package
                   tuneGrid = tune_grid, # tuning parameters to look at
                   trControl = trControl, # tuning precedure defined above
                   metric='Accuracy') # what classification metric to use

tuned_svm
## Support Vector Machines with Polynomial Kernel 
## 
## 401 samples
##   2 predictor
##   2 classes: '-1', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 320, 321, 321, 321, 321 
## Resampling results across tuning parameters:
## 
##   C      degree  Accuracy   Kappa    
##   1e-02   1      0.9252160  0.8504163
##   1e-02   5      0.9476235  0.8952559
##   1e-02  10      0.9401235  0.8802318
##   1e-02  20      0.9376235  0.8752318
##   1e-01   1      0.9301543  0.8602973
##   1e-01   5      0.9326543  0.8652823
##   1e-01  10      0.9376852  0.8753659
##   1e-01  20      0.9251852  0.8503659
##   1e+00   1      0.9301852  0.8603659
##   1e+00   5      0.9401235  0.8802318
##   1e+00  10      0.9376852  0.8753478
##   1e+00  20      0.9177160  0.8354374
##   1e+01   1      0.9276852  0.8553659
##   1e+01   5      0.9525926  0.9051784
##   1e+01  10      0.9202160  0.8403952
##   1e+01  20      0.9152160  0.8304374
##   1e+02   1      0.9276852  0.8553659
##   1e+02   5      0.9575926  0.9151784
##   1e+02  10      0.9078086  0.8156098
##   1e+02  20      0.9127469  0.8254878
## 
## Tuning parameter 'scale' was held constant at a value of 1
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were degree = 5, scale = 1 and C = 100.

The trainControl object from the caret package sets the tuning procedure. In our case we are using 5 fold cross-validation.

tune_grid is the grid of tuning parameters we want to search over. Warning: the names of the columns for the tune_grid data frame are important; they need to match the tuning parameter names from underlying function. For example, svmPoly calls e1071::svm so the columns names need to match the parameter names from e1071::svm (in this case:C, degree, scale).

You can find more information about the tuning paramters in this section of the book.

Main arguments to train

  • method = "svmPoly" says use SVM with a polynomial kernel. caret then uses the ksvm() function from the kernlab package.

  • tuneGrid = tune_grid tells train what tuning parameters to search over (defined above)

  • trControl = trControl sets the tuning procedure (defined above)

  • metric='Accuracy' tells caret to use the cross-validation accuracy to pick the optimal tuning parameters (this equivalent to using error rate).

The train function returns a fully trained model that also has some meta data about the tuning procedure (namely the average Accuracy for each value of the tuning parameter). For example, to see the optimal value of the tuning parameters that the tuning procedure choose

tuned_svm$bestTune
##    degree scale   C
## 18      5     1 100

We can now use the predict function to get predictions out of tuned_svm

test_grid_pred <- predict(tuned_svm, newdata = test_grid)

See this post for some more example code using the caret package.

Why caret?

The caret package implements over 200 machine learning models from many different R packages. These models tend to follow similar patterns, for example most models have:

  • x data
  • y data
  • tuning parameters.

There are many ways of selecting the tuning parameters for a model that don’t depend on the model (for example cross-validation). The caret package does a couple things

  • standardizes the functions (sometimes called a “uniform API”)
    • for example, e1071::svm was called like svm(y~., train) while class::knn(train=train_x, test=train_y...) wants you to split the data up into separate x and y objects
  • functionalizes common procedures
    • instead of you having to hand code cross-validation for every model you want to use you can just use one function from caret
    • the caret package has many bells and whistles already implemented (different flavors of CV, bootstrapping, different error metrics, etc)

Basically the caret package makes use of modularity and abstraction which makes your code: more likely to be correct, faster to write, and faster to run.

As always, there are trade offs. For example, caret might do things you don’t want or expect. It also encourages you to think of model tuning as a blackbox. You might get a better model if you look more deeply at the tuning process (for example, plotting the train and cross-validation error curves).

Classification metrics

The obvious way of measuring how well a classifier is doing is the misclassification error rate i.e. \(1 - \frac{\text{number classified correctly}}{\text{total}}\). There are a number of other measures of classifier performance. For example, in some applications the classes are very unbalanced e.g. 99.9% of the data are in one class and .1% of the data are in the other class. The naive classifier that assigns every point to the larger class would be a great classifier as judged by miclassification error.

Here are some other standard classification error metrics

Other references

CS109’s classification lab