8. Support Vector Machines Lineárně separabilní a neseparabilní případ, diskuse řešení

Key Concepts: Discriminant Function, Decision Boundary, and Margin¶

Discriminant Function: The discriminant function is responsible for assigning input feature vectors to target classes. In a binary classification scenario, it can assign vectors to either class 1 or class 2.
Decision Boundary: The decision boundary is a hypersurface that divides the space into two or more regions, each signifying a different class. The discriminant function defines this boundary. A linear discriminant function creates a hyperplane as a decision boundary, while a non-linear discriminant function could establish a more complex shape as the decision boundary. The side of the boundary on which a new data point lies determines its classification.
Margin: Margin refers to the distance between the decision boundary and the nearest data points from each class (known as support vectors), in the context of specific classifiers like SVM. SVM aims to maximize this margin. A larger margin implies a higher level of confidence in the decision boundary as it is as far away as possible from the nearest examples of each class. This often enhances the classifier's generalization, i.e., its performance on unseen data.

Here's how these concepts connect¶

The discriminant function is used to define the decision boundary: points for which $f(x) = 0$ .
The margin is defined in relation to this decision boundary: it is the distance to the nearest points on either side of the boundary.
In SVM, the parameters of the discriminant function are chosen to maximize this margin, helping the model to generalize better.

Discriminant Function¶

In classification tasks, a discriminant function maps an input vector to one of several predetermined classes. Specifically, this function takes an input vector—typically a feature vector encapsulating different characteristics or attributes of a certain item or event—and assigns it to one of $K$ classes. By doing this, the discriminant function helps to divide the feature space into decision regions, with each region corresponding to a specific class.

Linear discriminant functions, as used in linear classifiers (e.g., linear regression, logistic regression, and support vector machines), are among the simplest forms of discriminant functions. They take the form:

$f(x) = w^T x + w_0$

In this formula: - $x$ is the feature vector, - $w$ is the weight vector, responsible for determining the decision boundary's orientation, - $w^T$ is the transpose of $w$ , and - $w_0$ is the bias term, adjusting the position of the decision boundary.

This function delineates a hyperplane that separates the input space. In a binary classification task, points where $f(x) > 0$ might be designated to class 1, whereas points where $f(x) < 0$ could be classified as class 2. Learning algorithms (like SVM or logistic regression) seek to identify the optimal parameters $w$ and $w_0$ that allow the discriminant function to classify new inputs accurately based on the training data.

In cases where the data is not linearly separable or the relationship between features and classes is complex, non-linear discriminant functions can be employed. These often involve transforming the feature space with a function $\phi(x)$ to create a higher-dimensional space where data can be separated more easily. Kernel methods, such as SVM with the kernel trick, are examples of this approach.

Large Margin Principle¶

The aim of the large margin principle is to train the unknown parameters and . Assuming that the training points in the feature space are linearly separable, the solution is chosen to have the largest margin—the smallest distance between the decision boundary and any of the samples. Furthermore, the principle ensures each point is on the correct side of the decision boundary, i.e., . The corresponding optimization problem is formulated as follows: $\min_{w,w_0} \frac{1}{2}\|w\|^2 \ \ \ \text{, subject to } \ \ \ Y_i(w^T \phi(x_i) + w_0) \geq 1 \ \ \ \text{for all: } \ i$ - $\min_{w,w_0} \frac{1}{2}||w||^2$ : This part is the objective function we aim to minimize. The term $\frac{1}{2}||w||^2$ is used to maximize the margin between the two classes. The margin is inversely proportional to the norm of $w$ , so by minimizing $||w||^2$ , we effectively maximize the margin.

$Y_i(w^T \phi(x_i) + w_0) \geq 1 \ \ \ \text{for all: } \ i$ : This constraint ensures that all data points are classified correctly. Each data point $x_i$ should fall on the correct side of the decision boundary, as determined by the sign of the decision function $Y_i(w^T \phi(x_i) + w_0)$ . The requirement that this be greater than or equal to 1 is a normalization condition, ensuring a minimum distance from the decision boundary for all points, thus maximizing the margin.

The overall aim is to find the values of $w$ and $w_0$ that satisfy the constraints and minimize the objective function. This results in a hyperplane that maximizes the margin between the classes, providing the best possible separation, assuming that the data is linearly separable.

Support Vector Machines (SVMs)¶

When the data is not linearly separable, SVM relaxes the hard constraints and introduces a penalty for distance from the boundary in the wrong direction. This is done through the introduction of slack variables $\xi_i \geq 0$ for $i = 1, . . . , N$ . The behavior of these slack variables is as follows:

If $\xi_i = 0$ , then $Y_if(x_i)\geq0$
If $0 < \xi_i \leq 0$ , then the point lies in the margin but on the correct side of the decision boundary, i.e., $Y_if(x_i) > 0$
If $\xi_i > 1$ , then the point lies on the wrong side of the decision boundary and is misclassified, i.e., $Y_if(x_i) < 0$

The optimization problem incorporating these slack variables is: $\min_{w,w_0,\xi} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{N}\xi_i \ \ \ \text{, subject to} \ \ \ \xi_i \geq 0 \ \ \ \text{and} \ \ \ Y_i(w^T \phi(x_i) + w_0) + \xi_i - 1 \geq 0 \ \ \ \text{for all: } \ i$ - $\min_{w,w_0,\xi} \frac{1}{2}||w||^2 + C\sum_{i=1}^{N}\xi_i$ : This part is the objective function we wish to minimize. The term $\frac{1}{2}||w||^2$ is used to maximize the margin (since the margin is inversely proportional to the norm of $w$ . The term $C\sum_{i=1}^{N}\xi_i$ is a penalty term that penalizes misclassification or violations of the margin. The parameter $C$ controls the trade-off between maximizing the margin and minimizing classification errors. A higher value of $C$ places more emphasis on minimizing errors, while a lower value of $C$ emphasizes maximizing the margin.

$\xi_i \geq 0$ : This constraint ensures that the slack variables are non-negative. The slack variables $\xi_i$ are introduced to allow some degree of misclassification, especially in cases where the data is not linearly separable. If $\xi_i = 0$ , the $i$ -th instance is correctly classified and falls outside the margin. If $\xi_i > 0$ , the instance is within the margin. If $\xi_i > 1$ , the instance is misclassified.
$Y_i(w^T \phi(x_i) + w_0) + \xi_i - 1 \geq 0$ : This constraint ensures that each instance is either correctly classified or falls within the margin. It represents the conditions that each instance must fulfill with respect to the decision boundary and the margin.

The overall goal is to find the values of $w$ , $w_0$ , and $\xi_i$ that satisfy these constraints and minimize the objective function. This will result in a hyperplane (and corresponding margin) that best separates the classes in the feature space, given the allowance for some misclassification defined by the slack variables.

Summary:¶

Discriminant functions partition the feature space into separate half-spaces.
The large margin principle selects the solution with the greatest margin.
Support Vector Machines (SVMs) are a binary classification technique using a hyperplane.
SVM incorporates a regularization parameter $C$ and slack variables $\xi_i$ to manage non-linearly separable data.
The resultant SVM has sparse properties.