@ -54,7 +54,7 @@ Another factor for these models is the network topology. In a non-static WSN, a
\subsection{Problem definition}
An anomaly is a collection of one or more temporally correlated measurements in a given dataset that seem to be inconsistent with expected results. These measurements can originate from different sensors, and in the context of WSNs even from different nodes. Bosman et al. \cite{bosman2013} and others distinguish between four different kinds of anomalies (c.f. Figure~\ref{fig:noisetypes}):
An anomaly is a collection of one or more temporally correlated measurements in a given dataset that seem to be inconsistent with expected results. These measurements can originate from different sensors, and in the context of WSNs even from different nodes. Bosman et al. \cite{bosman2013} and others distinguish between four different kinds of anomalies relevant in WSNs (c.f. Figure~\ref{fig:noisetypes}):
\begin{itemize}
\item\emph{Spikes} are short changes with a large amplitude
@ -69,12 +69,20 @@ Detecting constant type anomalies isn't very difficult, as they can simply be cl
A Noise anomaly is not the same as a noisy sensor, working with noisy data is a problem in WSN, but we will not focus on methods of cleaning noisy data, as it is not in the scope of this survey. Elnahrawy et al. \cite{elnahrawy2003} and Barcelo et al. \cite{barcelo2019} are great places to start a survey in this direction.
In the general field of anomaly detection, more advanced definitions of anomalies can include patterns (c.f. Figure~\ref{fig:patternanomaly}) and other contextual phenomena, but these are much more rare in WSN, due to the nature of such networks measuring mostly less complex data, such as vibrations, temperature, etc. Therefore most approaches discussed in this survey won't take these anomalies into account and instead focus on the ones discussed above.
\caption{Pattern based anomaly corresponding to an Atrial Premature Contraction in an electrodiagram, image from Chandola et al. \cite{chandola2009}}
\label{fig:patternanomaly}
\end{figure}
The term outlier and anomaly are often used interchangeably, but often actually mean slightly different phenomena. While an anomaly falls into one of these four categories, only spikes, noise, and some types of drifts are considered \emph{outliers}\cite{chandola2009}, as they are the only ones that produce data outside of the considered ''norm''.
@ -91,15 +99,16 @@ The problem of outlier detection in WSNs is the creation of a model which can us
\subsection{Structure}
At first we will look into sensor self-calibration, a method of improving sensor accuracy. Calibrating a sensor will remove constant offsets, enabling nodes to compare measurements between one another more easily. If a sensor is in use for a prolonged length of time, it might needs to be recalibrated, to remove sensor drift.
Then we will look into conventional, model based approaches to outlier detection, such as statistical models, or density based models.
After the introduction and coverage of related work, we will look into sensor self-calibration, a method of improving sensor accuracy. Calibrating a sensor will remove constant offsets, enabling nodes to compare measurements between one another more easily. If a sensor is in use for a prolonged length of time, it might needs to be recalibrated, to remove sensor drift.
We will first look into sensor self-calibration, which aims to remove or reduce drift and constant offsets. Then we will look into conventional model based techniques for outlier detection, such as probabilistic models, or density based models. At last we will look into machine learning based approaches to building these models.
Then we will look into conventional, model based approaches to outlier detection, such as statistical, or density based models, followed by the more recent machine learning based models. Finally, all presented models are summarized in a table and evaluated based on their properties and requirements.
\section{Related work}
\section{Related Work}
Chandola et al. \cite{chandola2009} provide a very comprehensive survey on outlier detection in general, not just focused on WSN. They introduce many key concepts and definitions, but focus more on outliers than anomalies in general.
McDonald et al. \cite{mcdonald2013} survey methods of finding outliers in WSN, with a focus on distributed solutions. They go into a moderate amount of detail on most solutions, but skip over a lot of methods such as principal component ananlysis, and support vector machines, which were already maturing at that point in time.
@ -199,11 +208,11 @@ After the update phase, we obtain $\hat{x}_{k|k}$, which is our best approximati
Sirisanwannakul et al. takes the computed Kalman gain and compares its bias. In normal operation, the gain is biased towards the measurement. If the sensor malfunctions, the bias is towards the prediction. But if the gains bias is between prediction and measurement, the system assumes sensor drift and corrects automatically. Since this approach lacks a ground truth measurement it cannot recalibrate the sensor, but the paper shows that accumulative error can be reduced by more than 50\%.
A centralized WSN is defined by the existence of a central entity, called the \emph{base station} or \emph{fusion centre}, where all data is delivered to and analyzed. It is often assumed, that the base station does not have limits on its processing power or storage. Centralized approaches are not optimal in hostile environments, but that is not our focus here. Since central anomaly detection is closely related to the general field of anomaly detection, we will not go into much detail on these solution, instead focusing on covering solutions more specific to the field of WSN.
We consider a classical approach to be anything that uses conventional (non-machine learning) models or algorithms to perform outlier detection. This chapter will first look at
\subsection{Statistical Analysis}
Classical Statistical analysis is done by creating a model of the expected data and then finding the probability for each recorded data point. Improbable data points are then deemed outliers. The problem for many statistical approaches is finding this model of the expected data, as it is not always feasible to create it in advance. It also bears the problem of bad models or changes in the environment \cite{mcdonald2013}, requiring frequent update of the existing model.
Classical Statistical analysis is done by creating a statistical model of the expected data and then finding the probability for each recorded data point. Improbable data points are then deemed outliers. The problem for many statistical approaches is finding this model of the expected data, as it is not always feasible to create it in advance, when the nature of the phenomena is not well known in advance, or if the expected data is too complex. It is also not very robust to changes in the environment \cite{mcdonald2013}, requiring frequent updates to the model if the environment changes in ways not forseen by the model.
Sheng et al. \cite{sheng2007} propose an approach to global outlier detection, meaning a data point is only regarded as an outlier, if their value differs significantly from all values collected over a given time, not just from local sensors near the measured one. They propose that the base station requests bucketed histograms of each nodes sensors data distribution to reduce the data transmitted. These histograms are polled, combined, and then used to analyze outliers by looking at the maximum distance a data point can be away from his nearest neighbors. This method bears some problems, as it fails to account for non gaussian distribution. Another problem is the use of fixed parameters for outlier detection, requiring prior knowledge of the data collected and anomaly density. These fixed parameters also require an update, whenever these parameters change. Due to the histograms used, this method cannot be used in a shifting network topology.
@ -223,23 +232,46 @@ Outliers can be selected by looking at the density of points as well. Breuning e
Papadimitriou et al. \cite{papadimitriou2003} introduces a parameterless approach. They formulate a method using a local correlation integral (LOCI), which does not require parametrization. It uses a multi-granularity deviation factor (MDEF), which is the relative deviation for a point $p$ in a radius $r$. The MDEF is simply the number of nodes in an $r$-neighborhood divided by the sum of all points in the same neighborhood. LOCI provides an automated way to select good parameters for the MDEF and can detect outliers and outlier-clusters with comparable performance to other statistical approaches. They also formulate aLOCI, a linear approximation of LOCI, which also gives accurate results while reducing runtime. This approach can be used centralized, decentralized or clustered, depending on the scale of the event of interest. aLOCI seems great for even running on the sensor nodes itself, as it has relatively low computational complexity.
\caption{An Example of reducing a 3 Dimensional dataset to two dimensions using PCA to minimize loss of information. The PCA Vectors are marked red.}
\label{fig:pca}
\end{figure*}
Principal components of a point cloud in $\R^n$ are $n$ vectors $p_i$, where $p_i$ defines a line with minimal average square distance to the point cloud while lying orthogonal to all $p_j, j<i$. These $p_i$ define an orthogonal basis of $\R^n$. The length of each $p_i$ is directly proportionate to the variance of the data in that direction. Principal Component Analysis (PCA) uses these $p_i$ to perform a change of basis of each given data point. The most common algorithm to perform PCA relies on centering the data set around the mean and then finding the eigenvectors of the covariance matrix of the point cloud \cite{jolliffee2002, macua2010}.
When using $\{p_1, \dots, p_k\}, k < n$ as the new orthogonal basis, the dimensional complexity can be reduced from $n$ to $k$ while retaining as much data as possible, as the dimensions with the lowest variance are discarded. PCA is rather complex, given a data matrix $X_{[n\times j]}$ ($j$ collections of $n$ measurements), the complexity is $\mathcal{O}(n^3)$, meaning it grows cubic with the number of measured attributes \cite{yu2017}. Most of this complexity stems from the eigenvalue decomposition used in PCA.
When using $\{p_1, \dots, p_k\}, k < n$ as the new orthogonal basis of the data set, the dimensional complexity of the data can be reduced from $n$ to $k$ while retaining as much data as possible (c.f. Figure~\ref{fig:pca}), as the dimensions with the lowest variance are discarded. PCA is rather complex, given a data matrix $X_{[n\times j]}$ ($j$ collections of $n$ measurements), the complexity is $\mathcal{O}(n^3)$, meaning it grows cubic with the number of measured attributes \cite{yu2017}. Most of this complexity stems from the eigenvalue decomposition used in PCA.
Chan et al. \cite{chan2012} propose a solution to this problem, they develop two methods to approximate the eigenvalue decomposition by updating the state recursively and reusing large parts of the already done calculation, which reduces the computational complexity. They simulate this algorithm on existing data sets and find it outperforms existing PCA based solutions such as \cite{li2000, tien2004}.
Yu et al. \cite{yu2017} recognize that this solution is performs well, but is to expensive to run on each individual node in a network. They propose a clustered and iterative way of doing PCA that reduces the complexity on each cluster head down to $\Oc(n^2t)$ where $t$ is recursion depth. They propose clustering the nodes into groups with cluster heads which have more processing power. The leaf nodes send their samples to the cluster head, which then reorganizes and splits the sensor data, and after an initial PCA, can update his measured principal components and covariance matrices more efficiently. During this process, outliers are can be identified with relative ease using the known covariance of the data and the calculated principal components. Furthermore PCA is used to decrease the dimensional complexity of the sensor data. This compressed data is transmitted to the base station, together with the principal component vectors and covariance matrix. This allows for later reconstruction of data with high accuracy, with errors usually below 1\%, while reducing the amount of information send.
Macua et al. \cite{macua2010} propose a truly decentralized approach: Using consensus algorithms to calculate the sample mean, and then approximating the global data covariance matrix. Once a good enough approximation is found, each node can do PCA individually. This approach is not suited for deployment in low-power WSN, as it incurs considerable cost in forms of communication and especially processing power required.
Macua et al. \cite{macua2010} propose a truly decentralized approach: Using consensus algorithms to calculate the sample mean, and then approximating the global data covariance matrix. Once a good enough approximation is found, each node can do PCA individually. This approach is not suited for deployment in low-power WSN, as it incurs considerable cost in forms of communication and especially processing power required. This work is still mentioned, as it proved
\subsection{Generalized Hebbian Algorithm}
Ali et al. \cite{ali2015} propose an approach to detect and identify events using Generalized Hebbian Algorithm (GHA). Event detection is important in anomaly detection, but event identification is almost equally as important, especially when a sensor network is used to detect an event spanning multiple nodes. They propose a combined algorithm to detect, identify and communicate events in a WSN to detect local and global events. This is achieved by calculating identification ratios, i.e. the percentage each attribute contributed to the event, before broadcasting the detected event.
They start off with an outlier detection scheme using hyper-ellipsoids fitted around 98\% of their data points to detect outliers, using an iterative boundary estimation model based on the model formulated by by Moshtaghi et al. \cite{moshtaghi2011} called Forgetting Factor Iterative Data Capture Anomaly Detection (FFIDCAD). It can compute multidimensional boundaries of of the local model online in an iterative fashion, reducing the amount of required computation immensely, while also working in non-stationary environments and changing network topology due to the forgetting factor. A local event is declared, after observing more than $q$ outliers in a row, where $q$ is chosen depending on sampling rate and required temporal resolution.
Once an event is detected, Ali et al. propose using a Generalized Hebbian Algorithm (GHA) to replace the Eigenvalue Decomposition (EVD) commonly used in offline identification schemes such as PCA. EVD requires large batches of measurements to accurately compute principal components, while GHA can work online in a streaming fashion. They further show, that their online GHA bases approach has similar accuracy to offline EVD based techniques, while vastly reducing computational complexity. Once the eigenvectors are calculated, the last measurement is projected onto the calculated eigenvectors and whitened, creating a vector containing the identification ratios for each attribute.
Ali et al. claim that their algorithm has complexity of $\mathcal{O}(nd^2)$, compared to $\mathcal{O}(n^2+nd^2)$ of common SVM based approaches \cite{shahid2012a,shahid2012b}. Here $n$ is the number of measurements and $d$ is the number of attributes. Furthermore, due to the online nature of this approach, communication overhead is much lower, as only detected local events have to be broadcast, instead of the ongoing exchange of support vectors that have to be broadcast in the SVM approaches mentioned in Chapter~\ref{cap:svm}.
Most machine learning approaches focus on outlier detection, which is a common problem in WSN, as an outlier is inherently an anomaly. Outlier detection is largely unable to detect drift and has difficulties wih noise, but excels at detecting data points or groups which appear to be inconsistent with the other data (spikes, noise, sometimes drift). A common problem is finding outliers in data with an inherently complex structure.
Most machine learning approaches focus either on outlier detection through data classification, or outlier detection through data prediction. The former trains a model to distinguish anomalous from benign data by identifying key features in it, while the other uses machine learning to build a model of the observed process that is able to predict future measurements. While the second class of model seems to be desireable, it
Supervised learning is the process of training a neural network on a set of labeled data. Acquiring labeled data sets that are applicable to the given situation is often difficult, as it requires the existence of another classification method, or labeling by hand. Furthermore, even if a data set would exist, the class imbalance (total number of positive labels vs number of negative labels) would render such training data sub-optimal. And lastly, the data generated by a WSN might change over time without being anomalous, requiring frequent retraining \cite{ramotsoela2018}. Out of these circumstances arises the need for unsupervised or semi-supervised anomaly detection methods.
@ -254,14 +286,6 @@ They used the fact, that they could normalize numerical input data to lay in the
The QSSVM was further improved in 2012 by Shahid et al. \cite{shahid2012a, shahid2012b}, proposing three schemes that reduce communication overhead while maintaining detection performance. His propositions make use of the spatio-temporal and attribute (STA) correlations in the measured data. These propositions accept worse consensus about the placement of the hypersphere among neighboring nodes in order to reduce the communication overhead. They then show, that these approaches are comparable in performance to the QSSVM proposed by Rajasegarar et al. if the data correlates well enough inside each neighborhood. It is important to note, that this neighborhood information does not rely on nodes being stationary and is therefore usable in a shifting network topology.
\subsection{Generalized Hebbian Algorithm}
Ali et al. \cite{ali2015} propose an approach to detect and identify events using Generalized Hebbian Algorithm (GHA). Event detection is important in anomaly detection, but event identification is almost equally as important, especially when a sensor network is used to detect an event spanning multiple nodes. They propose a combined algorithm to detect, identify and communicate events in a WSN to detect local and global events. This is achieved by calculating identification ratios, i.e. the percentage each attribute contributed to the event, before broadcasting the detected event.
They start off with an outlier detection scheme using hyper-ellipsoids fitted around 98\% of their data points to detect outliers, using an iterative boundary estimation model based on the model formulated by by Moshtaghi et al. \cite{moshtaghi2011} called Forgetting Factor Iterative Data Capture Anomaly Detection (FFIDCAD). It can compute multidimensional boundaries of of the local model online in an iterative fashion, reducing the amount of required computation immensely, while also working in non-stationary environments and changing network topology due to the forgetting factor. A local event is declared, after observing more than $q$ outliers in a row, where $q$ is chosen depending on sampling rate and required temporal resolution.
Once an event is detected, Ali et al. propose using a Generalized Hebbian Algorithm (GHA) to replace the Eigenvalue Decomposition (EVD) commonly used in offline identification schemes such as PCA. EVD requires large batches of measurements to accurately compute principal components, while GHA can work online in a streaming fashion. They further show, that their online GHA bases approach has similar accuracy to offline EVD based techniques, while vastly reducing computational complexity. Once the eigenvectors are calculated, the last measurement is projected onto the calculated eigenvectors and whitened, creating a vector containing the identification ratios for each attribute.
Ali et al. claim that their algorithm has complexity of $\mathcal{O}(nd^2)$, compared to $\mathcal{O}(n^2+nd^2)$ of common SVM based approaches \cite{shahid2012a,shahid2012b}. Here $n$ is the number of measurements and $d$ is the number of attributes. Furthermore, due to the online nature of this approach, communication overhead is much lower, as only detected local events have to be broadcast, instead of the ongoing exchange of support vectors that have to be broadcast in the SVM approaches mentioned in Chapter~\ref{cap:svm}.