There are many different approaches to anomaly detection, we will differentiate between centralized and decentralized approaches. An approach is considered centralized, when a large chunk of the computation is done at a single point, or at a later stage during analysis. A decentralized approach implies that a considerable amount of processing is done on the individual nodes, doing analysis on the fly. When analysis is done centralized, it is important to differentiate between online and offline detection. Online detection can run while the WSN is operating, while offline detection is done after the data is collected. Online detection often reduces mission duration due to increased power consumption, but can have the opposite effect, if it can be used to eliminate a large amount of communication.
\subsection{Anomaly types}
Furthermore we need to clarify the different kinds of anomalies that can occur in WSN data sets. Bosman et al. \cite{bosman2017} proposes four different kinds of anomalies that occur in WSN:
Furthermore we need to clarify the different kinds of anomalies that can occur in WSN data sets. Bosman et al. \cite{bosman2017} proposes four different kinds of anomalies that occur in WSN (see also figure \ref{fig:noisetypes}):
\begin{itemize}
\item\emph{Spikes or outliers} are short changes with a large amplitude
\item\emph{Noise} is (an increase of) variance over time
\item\emph{Spikes} are short changes with a large amplitude
\item\emph{Noise} is (an increase of) variance over a given time
\item\emph{Constant} is a the sudden absence of noise
\item\emph{Drift} is an offset which increases over time
\item\emph{Constant} is a constant offset
\end{itemize}
No method can account for all four types of anomalies at once. Therefore we will look into sensor self-calibration, which removes drift and constant anomalies, followed by outlier detection to detect spikes. Working with noisy data is a problem in WSN, but we will not focus on methods of cleaning noisy data, as it is not in the scope of this survey. Elnahrawy et al. \cite{elnahrawy2003} and Barcelo et al. \cite{barcelo2019} are a great places to start, if you are interested in this topic.
\caption{Spike, noise, constant and drift type anomalies in noisy linear data, image from Bosmal et al. \cite{bosman2013}}
\label{fig:noisetypes}
\end{figure}
We will look into sensor self-calibration, which often removes or reduces drift and constant offsets, followed by outlier detection to detect spikes, noise and drift type anomalies. A Noise anomaly is not the same as a noisy sensor, working with noisy data is a problem in WSN, but we will not focus on methods of cleaning noisy data, as it is not in the scope of this survey. Elnahrawy et al. \cite{elnahrawy2003} and Barcelo et al. \cite{barcelo2019} are a great places to start, if you are interested in this topic.
A fifth anomaly type, \emph{sensor failure}, is commonly added to anomaly detection \cite{rajasegarar2008,chandola2009}. Since sensor failure often manifests in these four different ways mentioned above, and we are not interested in sensor fault prediction, detection and management here, faulty sensors will not be discussed further.
@ -87,10 +97,11 @@ Non-blind, also known as reference-based calibration approached rely on known-go
Maag et al. \cite{maag2017} proposes a hybrid solution, where calibrated sensor arrays can be used to calibrate other non-calibrated arrays in a local network of air pollution sensors over multiple hops with minimal accumulative errors. They show 16-60\% lower error rates than other approaches currently in use.
When we speak of a centralized WSN, we mean, that there exists a central entity, called the \emph{base station}, where all data is delivered to. In our analysis, it is often assumed, that the base station does not have limits on its processing power. The base station will summarize the received data until it has a complete set and can then use this set to determine global outliers and other anomalies such as clock drift over the course of the whole operation, as it has a complete history for each given node. A centralized approach is not optimal in hostile environments, but that is not our focus here. Since this environment is closely related to the general field of anomaly detection, we will not go into much detail on these solution, instead focusing on covering just the basics.
When we speak of a centralized WSN, we mean, that there exists a central entity, called the \emph{base station} or \emph{fusion centre}, where all data is delivered to and analyzed. It is often assumed, that the base station does not have limits on its processing power or storage. Centralized approaches are not optimal in hostile environments, but that is not our focus here. Since central anomaly detection is closely related to the general field of anomaly detection, we will not go into much detail on these solution, instead focusing on covering solutions more specific to the field of WSN.
\subsection{Statistical analysis}
Classical Statistical analysis is done by creating a model of the expected data and then finding the probability for each recorded data point. Improbable data points are then deemed outliers. The problem for many statistical approaches is finding this model of the expected data, as it's not always feasible to create it in advance. It also bears the problem of bad models or slow changes in the environment \cite{mcdonald2013}.
@ -105,25 +116,38 @@ Böhm et al. \cite{böhm2008} proposes a solution not only to non gaussian distr
\label{fig:probdistböhm}
\end{figure}
While there are many statistical methods for outlier detection, most follow a similar approach to at least one of the two methods shown here. Most of these are generally not as useful for online detection, as they require
\subsection{Density based analysis}
Outliers can be selected by looking at the density of points as well. Breuning et al. \cite{breuning2000} proposes a method of calculating a local outlier factor (LOF) of each point based on the local density of its $n$ nearest neighbors. The problem lies in selecting good values for $n$. If $n$ is too small, clusters of outliers might not be detected, while a large $n$ might mark points as outliers, even if they are in a large cluster of $<n$ points. This problem is further exasperated when we try to use this in a WSN setting, for example by streaming through the last $k$ points, as cluster size will not stay constant as incoming data might be delayed or lost in transit.
Papadimitriou et al. \cite{papadimitriou2003} introduces a parameterless approach. The paper formulates a method using a local correlation integral (LOCI), which does not require parametrization. It uses a multi-granularity deviation factor (MDEF), which is the relative deviation for a point $p$ in a radius $r$. The MDEF is simply the number of nodes in an $r$-neighborhood divided by the sum of all points in the same neighborhood. LOCI provides an automated way to select good parameters for the MDEF and can detect outliers and outlier-clusters with comparable performance to other statistical approaches. They also formulate aLOCI, a linear approximation of LOCI, which also gives accurate results while reducing runtime.
\subsection{Principal component analysis}
Principal components of a point cloud in $\R^n$ are $n$ vectors $p_i$, where $p_i$ defines a line with minimal average square distance to the point cloud while lying orthogonal to all $p_j, j<i$. These $p_i$ define an orthogonal basis of $\R^n$. The length of each $p_i$ is directly proportionate to the variance of the data in that direction. Principal Component Analysis (PCA) uses these $p_i$ to perform a change of basis of each given data point. The most common algorithm to perform PCA relies on centering the data set around the mean and then finding the eigenvectors of the covariance matrix of the point cloud \cite{jolliffee2002, macua2010}.
When using $\{p_1, \dots, p_k\}, k < n$ as the new orthogonal basis, the dimensional complexity can be reduced from $n$ to $k$ while retaining as much data as possible, as the dimensions with the lowest variance are discarded. PCA is rather complex, given a data matrix $X_{[n\times j]}$ ($j$ collections of $n$ measurements), the complexity is $\mathcal{O}(n^3)$, meaning it grows cubic with the number of measured attributes \cite{yu2017}. Most of this complexity stems from the eigenvalue decomposition used in PCA.
Chan et al. \cite{chan2012} proposes a solution to this problem, he develops two methods to approximate the eigenvalue decomposition by updating the state recursively and reusing large parts of the already done calculation, which reduces the computational complexity. He simulates this algorithm on existing data sets and finds it outperforming existing PCA based solutions such as \cite{li2000, tien2004}.
Yu et al. \cite{yu2017} recognizes that this solution is performs well, but not well enough to run on each individual node in a network. They propose a clustered and iterative way of doing PCA that reduces the complexity on each cluster head down to $\Oc(n^2t)$ where $t$ is recursion depth. He proposes clustering the nodes into groups with cluster heads which have more processing power. The leaf nodes send their samples to the cluster head, which then reorganizes and splits the sensor data, and after an initial PCA, can update his measured principal components and covariance matrices more efficiently. During this process, outliers are can be identified with relative ease using the known covariance of the data and the calculated principal components. Furthermore PCA is used to decrease the dimensional complexity of the sensor data. This compressed data is transmitted to the base station, together with the principal component vectors and covariance matrix. This allows for later reconstruction of data with high accuracy as shown in the paper, with errors usually below 1\%, while reducing the amount of information send.
Macua et al. \cite{macua2010} propose a truly decentralized approach: Using consensus algorithms to calculate the sample mean, and then approximating the global data covariance matrix. Once a good enough approximation is found, each node can do PCA individually. This approach is not suited for deployment in low-power WSN, as it incurs considerable cost in forms of communication and especially processing power required.
Most machine learning approaches focus on outlier detection, which is a common problem in WSN, as an outlier is inherently an anomaly. Outlier detection is largely unable to detect drift and has difficulties wih noise, but excels at detecting data points or groups which appear to be inconsistent with the other data (spikes). A common problem is finding outliers in data with an inherently complex structure.
Most machine learning approaches focus on outlier detection, which is a common problem in WSN, as an outlier is inherently an anomaly. Outlier detection is largely unable to detect drift and has difficulties wih noise, but excels at detecting data points or groups which appear to be inconsistent with the other data (i.e. spikes). A common problem is finding outliers in data with an inherently complex structure.
It is impossible to create an exhaustive list of classifiers to define what is and isn't an anomaly. Therefore it is difficult to generate labeled training data for machine learning. Furthermore, the data generated by a WSN might change over time without being anomalous, requiring frequent retraining. Out of these circumstances arises the need for unsupervised anomaly detection methods.
Supervised learning is the process of training a neural network on a set of labeled data. Acquiring labeled data sets that are applicable to the given situation is often difficult, as it requires the existence of another classification method, or labeling by hand. Furthermore, even if a data set would exist, the class imbalance (total number of positive labels vs number of negative labels) would render such training data sub-optimal. And lastly, the data generated by a WSN might change over time without being anomalous, requiring frequent retraining\cite{ramotsoela2018}. Out of these circumstances arises the need for unsupervised or semi-supervised anomaly detection methods.
We will look into a couple different approaches to outlier detection:
\subsection{Support vector machines (SVMs)}
Rajasegarar et al. \cite{rajasegarar2010} uses SVMs, which leverage a kernel function to map the input space to a higher dimensional feature space. This allows the SVM to then model highly nonlinear patterns of normal behavior in a flexible manner. This means, that patterns that are difficult to classify in the problem space, become more easily recognizable and therefore classifiable in the feature space. Once the data is mapped into the feature space, hyperelipsoids are fitted to the data points to define regions of the feature space that classify the data as normal.
While this approach works well to find outliers in the data, it is also computationally expensive and incurs a large communication overhead. In an attempt to To decrease computational complexity, only a single hyperelipsoid is fitted to the data set. This method is called a one-class support vector machine. Originally Wang et al. \cite{wang2006} created a model of a one-class SVM (OCSVM), however the solution required the solution of a computationally complex second-order cone pro-gramming problem, making it unusable for distributed usage. Rajasegarar et al. \cite{rajasegarar2007, rajasegarar2010} improved on this OCSVM in a couple of ways.
While this approach works well to find outliers in the data, it is also computationally expensive and incurs a large communication overhead. In an attempt to To decrease computational complexity, only a single hyperelipsoid is fitted to the data set. This method is called a one-class support vector machine. Originally Wang et al. \cite{wang2006} created a model of a one-class SVM (OCSVM), however the solution required the solution of a computationally complex second-order cone programming problem, making it unusable for distributed usage. Rajasegarar et al. \cite{rajasegarar2007, rajasegarar2010} improved on this OCSVM in a couple of ways.
They used the fact, that they could normalize numerical input data to lay in the vicinity of the origin inside the feature space, and furthermore the results of Laskov et al. \cite{laskov2004} which showed, that normalized numerical data is one-sided, always lying in the positive quadrants. This lead to the formulation of a centered-hyperelipsoidal SVM (CESVM) model, which vastly reduces computational complexity to a linear problem. Furthermore they introduce a one-class quarter-sphere SVM (QSSVM) which reduced the communication overhead. They conclude however, that the technique ist still unfit for decentralized use because of the large remaining communication overhead, as a consensus for the radiuses and other parameters is still required.
@ -140,17 +164,28 @@ Ali et al. claims that his algorithm has complexity of $\mathcal{O}(nd^2)$, comp
\subsection{Extreme learning}
When working decentralized with no additional overhead, it is still possible to obtain additional data, just by listening to other nodes broadcasts. This data can be fed into various prediction models which can then be used to calculate a confidence level for the nodes own measurements.
When working decentralized in an environment, where data is funneled into sinks, it is still possible to obtain additional data without additional overhead just by listening to other nodes broadcasts. This data can be fed into various prediction models-
Bosman et al. \cite{bosman2017} looks at the performance of recursive last squares (RLS) and the online sequential extreme learning machine (OS-ELM) approach to train a single-layer feed-forward neural network (SLFN). These are compared to first degree polynomial function approximation (FA) and sliding window mean prediction. The article shows, that incorporation neighborhood information improves anomaly detection only in cases where the data set is well-correlated and shows low spatial entropy, as is common in most natural monitoring applications. When the data set does not correlate well, or there is too much spatial entropy, the methods described in this paper fail to predict anomalies. It concludes, that neighborhood aggregation is not useful beyond 5 neighbors, as such a large data set will fail to meet the aforementioned conditions. The exact size of the optimal neighborhood will vary with network topology and sensor modality.
Here, all four types of anomalies were accounted for in the data set, but there was no analysis, how good the detection was for each kind of anomaly.
\subsection{Deep learning}
Supervised learning is the process of training a neural network on a set of labeled data. Acquiring labeled data sets that are applicable to the given situation is often difficult, as it requires the existence of another classification method, or labeling by hand. Furthermore, even if a data set would exist, the class imbalance (total number of positive labels vs number of negative labels) would render such training data sub-optimal. These restrictions prove prohibitively when compared to semi-supervised or unsupervised learning approaches and won't be covered in this survey.
\subsubsection{Semi-Supervised deep anomaly detection}
\subsection{Deep learning approaches}
Deep learning techniques for solving anomaly detection in WSN aim at solving a slightly different problem than other methods mentioned thus far. As the amount of data increases that WSN produce, either by increasing node count, sensor count, or adding high output sensors such as cameras, traditional outlier detection algorithms might not be capable of keeping up \cite{chalapathy2019}.
In such environments, the analysis part is often moved to the cloud \cite{yu2017}, removing some of the restrictions originally introduced by WSN. While this paper will not discuss topics such as image recognition or anomaly detection in video \cite{kiran2018}, we will highlight some interesting results using deep neural networks to predict or detect anomalies in neural networks.
Zhang et al. \cite{zhang2018} uses LSTM neural networks to analyze and predict working condition of a water turbine. A Long-Short-Term-Memory (LSTM) neural network is a kind of recurring neural network that contans short-term memory blocks consisting of memory cells which can hold on to state information, making it possible to analyze time series such as stock market data or perform natural language processing. The downside of LSTM models and machine learning in general is the amount of data required to train them. Zhang et al. collected sufficient data including anomalies over the span of three months. They removed noise and labeled outliers and then used this as training data.
They found, that they can not only predict future sensor measurements with high accuracy (root mean square error below $0.01$, even for complex sensor patterns) but can also identify and to en extend predict failures with their model (Figure \ref{fig:zhangpump}).