3. Principles and algorithms for Anomaly Detection

Anomaly detection is a crucial aspect of data analysis, aiming to identify patterns or occurrences that deviate significantly from expected behavior. These deviations are often indicative of systemic issues, fraud, or rare events of interest.

Types of Anomalies¶

Anomalies can typically be categorized into three types:

Point Anomalies: When an individual data instance can be considered anomalous with respect to the rest of the data.
Contextual Anomalies: If a data instance is anomalous in a specific context (but not otherwise), then it is termed a contextual anomaly.
Collective Anomalies: A collection of data instances collectively help in detecting anomalies.

Anomaly Detection Techniques¶

Different techniques are suitable for different types of data, and these include:

Statistical Anomaly Detection: Statistical techniques are typically used when the data is univariate and has a Gaussian distribution. These methods model the statistical properties of the regular data and then use a test statistic to infer the anomaly score.
Distance-Based Anomaly Detection: These techniques are suitable for both univariate and multivariate datasets. They work on the principle of proximity, where objects that are far away from others may be considered more likely to be anomalies.
Clustering-Based Anomaly Detection: Clustering is a popular technique for anomaly detection. The underlying assumption is that normal data instances belong to clusters in the dataset, while anomalies do not belong to any clusters or belong to small or sparse clusters.

Key Algorithms¶

Several algorithms exist to handle anomaly detection:

In lecture¶

Threshold-based & Top-n Anomalies Detection: These straightforward approaches involve defining a threshold or limit on the measurement of interest. Data instances that breach the threshold (threshold-based) or that are in the top 'n' instances (top-n based) are considered anomalies.
Mahalanobis Distance: The Mahalanobis distance measures the distance of a point from a distribution. It considers the covariance and mean of the distribution, thus providing a more accurate identification of anomalies when compared to simpler measures like Euclidean distance. It's effectively a measure of how many standard deviations away P is from the mean of D.
1. $M(x, \mu) = \sqrt{(x - \mu)^T \Sigma^{-1}(x - \mu)}$
2. where $\mu$ is the center and $\sum$ is the covariance matrix of the distribuition
k-Nearest Neighbors (kNN): kNN can be adapted for anomaly detection by considering the distance to the kth nearest neighbor as the anomaly score. Data instances with larger kNN distances are considered to be more anomalous.

Other¶

Support Vector Machine-Based Anomaly Detection (One-Class SVM): This is used for high-dimensional datasets. The algorithm trains a model in an unsupervised way to identify the instances that belong to some class, treating all other instances as anomalies.
Isolation Forest: This algorithm is specifically designed for anomaly detection. It works by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The logic argument is that isolating anomaly observations is easier because only a few conditions are needed to separate those cases from the normal observations.
Autoencoders: Autoencoders, especially those using deep learning architectures, can be effective for anomaly detection. The idea is that the autoencoder will learn to accurately reconstruct normal data, while struggling to reconstruct anomalous data, leading to a high reconstruction error for anomalies.

Outputs of Anomaly Detection¶

The outputs of anomaly detection algorithms can generally be classified into two categories:

Label-based: This is akin to classification, where each data instance is assigned an 'anomaly' or 'non-anomaly' label.
Score-based: In this method, each data instance is assigned a score, usually representing the 'degree of being an anomaly'. High scores indicate strong anomalies.

Evaluation of Anomaly Detection¶

Evaluating the performance of an anomaly detection system is crucial, and it can be done using several metrics:

Precision: This measures the accuracy of detected anomalies - in other words, the percentage of correctly identified anomalies among all detected anomalies.
Recall (Sensitivity): This measures the completeness of detected anomalies - the percentage of all actual anomalies that were detected.
F1-Score: The harmonic mean of Precision and Recall. This is useful when you want to balance Precision and Recall.
Area Under the ROC Curve (AUC-ROC): The ROC curve plots the true positive rate (Recall) against the false positive rate at various threshold settings. The AUC measures the entire two-dimensional area underneath the entire ROC curve and provides a good summary of the model's performance across all possible thresholds.

Challenges in Anomaly Detection¶

Anomaly detection has its own set of challenges. They are often compounded by the nature of the data, which can be high-dimensional and noisy. Here are a few common challenges:

Definition of Normal: In many contexts, it's challenging to define what constitutes "normal" behavior. For anomaly detection to be effective, a clear baseline or threshold needs to be established.
Imbalanced Data: Anomalies are by nature rare events. This can result in highly imbalanced data, with many normal instances and very few anomalies. This imbalance can make it difficult for algorithms to accurately identify anomalies.
Evolving Anomalies: Over time, what constitutes an anomaly might change. This means that an anomaly detection system must be adaptable and able to learn from new data.