A semi-technical look at Support Vector Machines

Support Vector Machines (SVM) are supervised learning models that can separate data by hyperplanes. This analysis will attempt to bridge the gap between intuition and rigor by covering a bit of the linear algebra behind SVM. For presentation and speed purposes, we are going to stick with SVM in 2D, with our hyperplane (line) passing through the origin. This way, we can create some nice visuals, and be less rigorous about the math.

the intuition behind SVM is that we separate data with a hyperplane, as is illustrated on the left. Here, we will review some linear algebra, define SVM and its error function, and lastly bridge the two with a semi-rigorous analysis.



Linear algebra review

First, let's consider a few equations. Eq (1) and (2) are just definitions of our vectors- each component represents a magnitude in the direction of the corresponding basis vector. We define the vector dot product as <u,v> or Eq (3) .

The norm or modulo of a vector in R_2 is defined by Eq (4). It represents the length of a vector.


The vector projection of V on to U is defined as Eq (5), and is related to the vector dot product by EQ (6).

Intuitively, we can think of it as follows. First, drop a perpendicular from V on to U

Then, the projection is the distance from the origin to where our perpendicular lies. In other words, it's the component of V in the direction of U.




Support Vector Machines

Now, we can talk about SVM. The goal is to maximize the margin between two known classes, and the hyperplane. Let's consider a random hyperplane separating two classes of data.

And, let's consider one of the training examples in particular.

We define θ, the error function as the norm squared of the perpendicular from the plane. Let us also draw the vector representing our training example.


So the error function becomes (note, we add a 1/2 to facilitate subsequent computations)


And the projection becomes


Using the equations from before, we can rearrange the conditions on the error function

And it's that last line that is really exciting! We can define the conditions on the error functions as the product of two values that we can calculate. Now, we of course want to minimize θ, since it's the error function. BUT we want the product to be large (greater than 1, at least). So, we have to maximize the projection of x onto θ. And the best way to do that is point θ in the direction that maximizes the margin between the training variables and the hyperplane!


To summarize, we made a quick linear algebra review, defined the SVM error function, and then applied linear algebra to show that we can minimize the error by maximizing the margin.

Keep in mind this analysis was only for 2D data, with our hyperplane in the origin. However, with more rigor, the intuition would still work.