Machine Learning | Anomaly detection

Using probability and statistics

3 min readMay 6, 2019

In this article I will explain how a basic anomaly detection algorithm works. For educational purpose I will use a simple n dimensional training dataset :

There is no correlation between the features x 1,…,x n here. In a future post I will show how to deal with correlations and multivariate gaussian distributions, that will involve matrix inverses. I will also give an example of application in python.

We can create a probabilistic model that fit our training data. In this simple example we can use a gaussian (or normal) distribution because it is really often that a dataset is normally distributed.

To do so, we need the following parameters : the mean and the variance corresponding to the distribution of our training dataset. We can calculate the mean (mu) and the variance (sigma squared) of each feature :

Now that we have a model that can fit the training data, we can use it to determine the probability of happening for new unseen test data. Our goal is to classify new data point with low value of p(x test) as anomalous. p(x test) is the product of each feature’s probability given our model’s parameters :

By defining a threshold we can give more or less flexibility to classify new data as normal (y=0) or anomalous (y=1) :

So our algorithm looks like :

For(j = 0; j < n; j++){
   mu_model[j] = sum(x_train[j]) /m
   var_model[j] = sum((x_train[j] - mu_model[j])**2) /m
}p_test = 1;For(j = 0; j < n; j++){
   p_test = p_test * p(x_test[j],mu_model[j],var_model[j])
}return p_test < epsilon

Machine Learning | Anomaly detection

Using probability and statistics

Written by Pierre Portal