@ -47,7 +47,7 @@ A Wireless Sensor Network (WSN) is commonly defined as a collection of battery p
The power required to transmit data is often the largest contributing factor to the lifetime of each node, as it drains the battery \cite{sheng2007}. Especially if the network collects large amounts of data, or spans large areas, a lot of energy can be saved by reducing the number and size of the transmissions. An Ideal solution would be to not send the unimportant data at all, thus arises the need for anomaly detection in WSNs, enabling nodes to identify important data themselves. This is however not the only factor why anomaly detection is interesting. Some WSN are deployed to detect phenomena such as forest fires \cite{hefeeda2007wireless}, or monitor active volcanos \cite{werner2006deploying}. In these cases, anomaly detection is not only used to limit the required communication, but also to fulfill the core purpose of the network.
The power required to transmit data is often the largest contributing factor to the lifetime of each node, as it drains the battery \cite{sheng2007}. Especially if the network collects large amounts of data, or spans large areas, a lot of energy can be saved by reducing the number and size of the transmissions. An Ideal solution would be to not send the unimportant data at all, thus arises the need for anomaly detection in WSNs, enabling nodes to identify important data themselves. This is however not the only factor why anomaly detection is interesting. Some WSN are deployed to detect phenomena such as forest fires \cite{hefeeda2007wireless}, or monitor active volcanos \cite{werner2006deploying}. In these cases, anomaly detection is not only used to limit the required communication, but also to fulfill the core purpose of the network.
Not all approaches to anomaly detection in WSN are able to run directly on the node, therefore this survey will differentiate between \emph{decentralized} (algorithms running directly on the node) and \emph{centralized} (running at a central location) methods. It's not always beneficial to have a decentralized approach, as some networks are less restricted by their energy (for example by having a power supply) and would rather use greater computational power and a complete set of data (meaning data from all sensors, not just ones in a local area) to improve their detection and/or prediction accuracy. This is often encountered in industrial settings \cite{ramotsoela2018}.
Not all approaches to anomaly detection in WSN are able to run directly on the node, therefore this survey will differentiate between \emph{decentralized} (algorithms running directly on the node) and \emph{centralized} (running at a central location) methods. It's not always beneficial to have a decentralized approach, as some networks are less restricted by their energy (for example by having a power supply or being frequently serviced by personal) and would rather use greater computational power and a complete set of data (meaning data from all sensors, not just ones in a local area) to improve their detection and/or prediction accuracy. This is often encountered in industrial settings \cite{ramotsoela2018}.
Another factor for these models is the network topology. In a non-static WSN, a model using neighborhood information has to account for changes in the network topology surrounding it, as the number of neighbors changes, or the data they measured previously is not actually belonging to the current neighborhood. If the node keeps track of previous measurements, it also needs to take into account how it's changes in position might influence the measured data.
Another factor for these models is the network topology. In a non-static WSN, a model using neighborhood information has to account for changes in the network topology surrounding it, as the number of neighbors changes, or the data they measured previously is not actually belonging to the current neighborhood. If the node keeps track of previous measurements, it also needs to take into account how it's changes in position might influence the measured data.
@ -73,13 +73,13 @@ In the general field of anomaly detection, more advanced definitions of anomalie
\caption{Pattern based anomaly corresponding to an Atrial Premature Contraction in an electrodiagram, image from Chandola et al. \cite{chandola2009}}
\caption{Pattern based anomaly corresponding to an Atrial Premature Contraction in an electrodiagram, image from Chandola et al. \cite{chandola2009}.}
\label{fig:patternanomaly}
\label{fig:patternanomaly}
\end{figure}
\end{figure}
@ -101,29 +101,24 @@ The problem of outlier detection in WSNs is the creation of a model which can us
After the introduction and coverage of related work, we will look into sensor self-calibration, a method of improving sensor accuracy. Calibrating a sensor will remove constant offsets, enabling nodes to compare measurements between one another more easily. If a sensor is in use for a prolonged length of time, it might needs to be recalibrated, to remove sensor drift.
After the introduction and coverage of related work, we will look into sensor self-calibration, a method of improving sensor accuracy. Calibrating a sensor will remove constant offsets, enabling nodes to compare measurements between one another more easily. If a sensor is in use for a prolonged length of time, it might needs to be recalibrated, to remove sensor drift.
Then we will look into conventional, model based approaches to outlier detection, such as statistical, or density based models, followed by the more recent machine learning based models. Finally, all presented models are summarized in a table and evaluated based on their properties and requirements.
Afterwards we will look into a collection of different outlier detection methods, ranging from statistical methods to machine learning.
\section{Related Work}
\section{Related Work}
Chandola et al. \cite{chandola2009} provide a very comprehensive survey on outlier detection in general, not just focused on WSN. They introduce many key concepts and definitions, but focus more on outliers than anomalies in general.
Chandola et al. \cite{chandola2009} provide a very comprehensive survey on anomaly detection in general, not just focused on WSN. They introduce many key concepts and definitions, but focus more on outliers than anomalies in general.
O'Reilly et al. \cite{oreilly2014} look into anomaly detection in WSN in the specific context of non-stationary environments, meaning environments where the ''normal'' state evolves over time, and isn't static. Due to the nature of the problem, almost all approaches presented there had some machine-learning aspects to them, as they needed to first detect when a change of model was required, and then create a new model that conforms to the new data sensed by the network.
McDonald et al. \cite{mcdonald2013} survey methods of finding outliers in WSN, with a focus on distributed solutions. They go into a moderate amount of detail on most solutions, but skip over a lot of methods such as principal component ananlysis, and support vector machines, which were already maturing at that point in time. Instead they only present distance and density based approaches.
McDonald et al. \cite{mcdonald2013} survey methods of finding outliers in WSN, with a focus on distributed solutions. They go into a moderate amount of detail on most solutions, but skip over a lot of methods such as principal component ananlysis, and support vector machines, which were already maturing at that point in time.
Barcelo-Ordinas et al. \cite{barcelo2019} provide a very in-depth reference study for sensor self-calibration, they analyzes 39 different approaches in several different categories. This survey is covered further in the section covering sensor self-calibration.
Ramotsoela et al. \cite{ramotsoela2018} survey anomaly detection in industrial settings, where machine learning is preferred due to the observed phenomena being more complex. The survey covers both intrusion detection and outlier detection methods, and compiles a table of 17 different approaches to anomaly detection. They look at six fundamentally different approaches and score them based on accuracy, prior knowledge, complexity and data prediction. They look more closely at k-nearest neighbor models but find similar problems as mentioned in chapter \ref{sec:distance} and \ref{sec:density}.
Ramotsoela et al. \cite{ramotsoela2018}
Further information concerning advanced machine learning models such as Deep Learning techniques are covered by Chalapathy et al. \cite{chalapathy2019} and Kakanakova et el. \cite{kakanakova2017}. Both of these surveys do not focus on WNS, but propose methods which are applicable to the general field.
Chalapathy et al. \cite{chalapathy2019}
Kakanakova et el. \cite{kakanakova2017}
Barcelo-Ordinas et al. \cite{barcelo2019} survey self-calibation methods for WSNs,
@ -208,8 +203,16 @@ After the update phase, we obtain $\hat{x}_{k|k}$, which is our best approximati
Sirisanwannakul et al. takes the computed Kalman gain and compares its bias. In normal operation, the gain is biased towards the measurement. If the sensor malfunctions, the bias is towards the prediction. But if the gains bias is between prediction and measurement, the system assumes sensor drift and corrects automatically. Since this approach lacks a ground truth measurement it cannot recalibrate the sensor, but the paper shows that accumulative error can be reduced by more than 50\%.
Sirisanwannakul et al. takes the computed Kalman gain and compares its bias. In normal operation, the gain is biased towards the measurement. If the sensor malfunctions, the bias is towards the prediction. But if the gains bias is between prediction and measurement, the system assumes sensor drift and corrects automatically. Since this approach lacks a ground truth measurement it cannot recalibrate the sensor, but the paper shows that accumulative error can be reduced by more than 50\%.
We consider a classical approach to be anything that uses conventional (non-machine learning) models or algorithms to perform outlier detection. This chapter will first look at
This chapter will analyse a couple of fundamentally different approaches to outlier detection. The approaches are roughly ordered by age, where newer approaches come last. We will start with basic methods that are used outside of WSN and transition to more specific applications. All approaches covered here are listed in Table~\ref{tbl:comparison} at the end of the survey and analyzed by a couple of key metrics:
\begin{itemize}
\item\emph{Prior knowledge}: Does an approach require any prior knowledge, for example for constructing models beforehand, or training machine learning models.
\item\emph{Centralized/Decentralized}: Is the outlier detection performed on individual nodes, or at a centralized sink. Some methods work both ways, and some work in a clustered approach
\item\emph{Required topology}: If an approach requires a static topology, nodes must be stationary.
\item\emph{Communication}: How much communication is required by this approach. ''Normal'' means about the same as streaming all data to the sink, ''Prohibitive'' means it that the approach is not usable and requires some optimization.
\item\emph{Recalibration}: Does the model need recalibration or updates when the environment changes around it.
\end{itemize}
\subsection{Statistical Analysis}
\subsection{Statistical Analysis}
Classical Statistical analysis is done by creating a statistical model of the expected data and then finding the probability for each recorded data point. Improbable data points are then deemed outliers. The problem for many statistical approaches is finding this model of the expected data, as it is not always feasible to create it in advance, when the nature of the phenomena is not well known in advance, or if the expected data is too complex. It is also not very robust to changes in the environment \cite{mcdonald2013}, requiring frequent updates to the model if the environment changes in ways not forseen by the model.
Classical Statistical analysis is done by creating a statistical model of the expected data and then finding the probability for each recorded data point. Improbable data points are then deemed outliers. The problem for many statistical approaches is finding this model of the expected data, as it is not always feasible to create it in advance, when the nature of the phenomena is not well known in advance, or if the expected data is too complex. It is also not very robust to changes in the environment \cite{mcdonald2013}, requiring frequent updates to the model if the environment changes in ways not forseen by the model.
@ -226,59 +229,68 @@ Since this process not only detects outliers, but does a complete clustering of
\label{fig:probdistböhm}
\label{fig:probdistböhm}
\end{figure}
\end{figure}
\subsection{Density Based Analysis}
Outliers can be selected by looking at the density of points as well. Breuning et al. \cite{breuning2000} propose a method of calculating a local outlier factor (LOF) of each point based on the local density of its $n$ nearest neighbors. The problem lies in selecting good values for $n$. If $n$ is too small, clusters of outliers might not be detected, while a large $n$ might mark points as outliers, even if they are in a large cluster of less than $n$ points. This problem is further exasperated when we try to use this in a WSN setting, for example by streaming through the last $k$ points, as cluster size will not stay constant as incoming data might be delayed or lost in transit.
Papadimitriou et al. \cite{papadimitriou2003} introduces a parameterless approach. They formulate a method using a local correlation integral (LOCI), which does not require parametrization. It uses a multi-granularity deviation factor (MDEF), which is the relative deviation for a point $p$ in a radius $r$. The MDEF is simply the number of nodes in an $r$-neighborhood divided by the sum of all points in the same neighborhood. LOCI provides an automated way to select good parameters for the MDEF and can detect outliers and outlier-clusters with comparable performance to other statistical approaches. They also formulate aLOCI, a linear approximation of LOCI, which also gives accurate results while reducing runtime. This approach can be used centralized, decentralized or clustered, depending on the scale of the event of interest. aLOCI seems great for even running on the sensor nodes itself, as it has relatively low computational complexity.
\subsection{Distance Based Analysis}\label{sec:distance}
An older solution to finding outliers in data is the distance based approach, it assigns an anomaly score to each data point, based on the distance to it's $k$ nearest neighbors \cite{zhang2006detecting}. This approach however fails at detecting outliers in a system with two or more clusters that do not have the same density. Figure \ref{fig:densityproblem} shows two clusters $C_1$, $C_2$ with varying density. The point $p_1$ will either be incorrectly identified as a non-outlier, or the whole set of $C_1$ will be identified as outliers together with $p_1$.
\caption{Two sets of clusters $C_1$ and $C_2$, and two outliers $p_1$ and $p_2$. Image from Chandola et al. \cite{chandola2009}.}
\label{fig:densityproblem}
\end{figure}
\subsection{Distance Based Approaches}
\subsection{Density Based Analysis}\label{sec:density}
Outliers can be selected by looking at the density of points as well. If done correctly, the problem described above can be prevented. Breuning et al. \cite{breuning2000} propose a method of calculating a local outlier factor (LOF) of each point based on the local density of its $n$ nearest neighbors. The problem lies in selecting good values for $n$. If $n$ is too small, clusters of outliers might not be detected, while a large $n$ might mark points as outliers, even if they are in a large cluster of less than $n$ points. This problem is further exasperated when we try to use this in a WSN setting, for example by streaming through the last $k$ points, as cluster size will not stay constant when incoming data is delayed or lost in transit.
Papadimitriou et al. \cite{papadimitriou2003} introduces a parameterless approach. They formulate a method using a local correlation integral (LOCI), which does not require parametrization. It uses a multi-granularity deviation factor (MDEF), which is the relative deviation for a point $p$ in a radius $r$. The MDEF is simply the number of nodes in an $r$-neighborhood divided by the sum of all points in the same neighborhood. LOCI provides an automated way to select good parameters for the MDEF and can detect outliers and outlier-clusters with comparable performance to other statistical approaches. They also formulate aLOCI, a linear approximation of LOCI, which also gives accurate results while reducing runtime. This approach can be used centralized, decentralized or clustered, depending on the scale of the event of interest. aLOCI seems great for even running on the sensor nodes itself, as it has relatively low computational complexity.
\subsection{Principal Component Analysis}
\subsection{Principal Component Analysis}
Another way of detecting outliers is by computing the Principal Component Analysis (PCA) of the collected data. This way one can find the variance of the collected data in each axis. If a measured data point is far outside the expected variance ranges, it can be flagged as anomalous. PCA can also be used to reduce the number of dimensions a set of data contains while minimizing the loss of meaningful information.
\caption{An Example of reducing a 3 Dimensional dataset to two dimensions using PCA to minimize loss of information. The PCA Vectors are marked red.}
\caption{An example of reducing a 3 dimensional dataset to two dimensions using PCA to minimize loss of information. PCA Vectors are marked red.}
\label{fig:pca}
\label{fig:pca}
\end{figure*}
\end{figure*}
Principal components of a point cloud in $\R^n$ are $n$ vectors $p_i$, where $p_i$ defines a line with minimal average square distance to the point cloud while lying orthogonal to all $p_j, j<i$. These $p_i$ define an orthogonal basis of $\R^n$. The length of each $p_i$ is directly proportionate to the variance of the data in that direction. Principal Component Analysis (PCA) uses these $p_i$ to perform a change of basis of each given data point. The most common algorithm to perform PCA relies on centering the data set around the mean and then finding the eigenvectors of the covariance matrix of the point cloud \cite{jolliffee2002, macua2010}.
The principal components of a point cloud in $\R^n$ are $n$ vectors $p_i$, where $p_i$ defines a line with minimal average square distance to the point cloud while lying orthogonal to all $p_j, j<i$. These $p_i$ define an orthogonal basis of $\R^n$. The length of each $p_i$ is directly proportionate to the variance of the data in that direction. The $p_i$ obtained from the PCA can be used to perform a change of basis of each given data point. The most common algorithm to perform PCA relies on centering the data set around the mean and then finding the eigenvectors of the covariance matrix of the point cloud \cite{jolliffee2002, macua2010}.
When using $\{p_1, \dots, p_k\}, k < n$ as the new orthogonal basis of the data set, the dimensional complexity of the data can be reduced from $n$ to $k$ while retaining as much data as possible (c.f. Figure~\ref{fig:pca}), as the dimensions with the lowest variance are discarded. PCA is rather complex, given a data matrix $X_{[n\times j]}$ ($j$ collections of $n$ measurements), the complexity is $\mathcal{O}(n^3)$, meaning it grows cubic with the number of measured attributes \cite{yu2017}. Most of this complexity stems from the eigenvalue decomposition used in PCA.
When using $\{p_1, \dots, p_k\}, k < n$ as the new orthogonal basis of the data set, the dimensional complexity of the data can be reduced from $n$ to $k$ while retaining as much data as possible (c.f. Figure~\ref{fig:pca}), as the dimensions with the lowest variance are discarded. PCA is rather complex, given a data matrix $X_{[n\times j]}$ ($j$ collections of $n$ measurements), the complexity is $\mathcal{O}(n^3)$, meaning it grows cubic with the number of measured attributes \cite{yu2017}. Most of this complexity stems from the eigenvalue decomposition used in PCA.
Chan et al. \cite{chan2012} propose a solution to this problem, they develop two methods to approximate the eigenvalue decomposition by updating the state recursively and reusing large parts of the already done calculation, which reduces the computational complexity. They simulate this algorithm on existing data sets and find it outperforms existing PCA based solutions such as \cite{li2000, tien2004}.
Chan et al. \cite{chan2012} propose a solution to this problem, they develop two methods to approximate the eigenvalue decomposition by updating the state recursively and reusing large parts of the already done calculation, which reduces the computational complexity. They simulate this algorithm on existing data sets and find it outperforms existing PCA based solutions such as \cite{li2000, tien2004}.
Yu et al. \cite{yu2017} recognize that this solution is performs well, but is to expensive to run on each individual node in a network. They propose a clustered and iterative way of doing PCA that reduces the complexity on each cluster head down to $\Oc(n^2t)$ where $t$ is recursion depth. They propose clustering the nodes into groups with cluster heads which have more processing power. The leaf nodes send their samples to the cluster head, which then reorganizes and splits the sensor data, and after an initial PCA, can update his measured principal components and covariance matrices more efficiently. During this process, outliers are can be identified with relative ease using the known covariance of the data and the calculated principal components. Furthermore PCA is used to decrease the dimensional complexity of the sensor data. This compressed data is transmitted to the base station, together with the principal component vectors and covariance matrix. This allows for later reconstruction of data with high accuracy, with errors usually below 1\%, while reducing the amount of information send.
Yu et al. \cite{yu2017} recognize that this solution performs well, but is to expensive to run on each individual node in a network. They propose a clustered and iterative way of doing PCA that reduces the complexity on each cluster head down to $\Oc(n^2t)$ where $t$ is recursion depth. They propose clustering the nodes into groups with cluster heads which have more processing power. The leaf nodes send their samples to the cluster head, which then reorganizes and splits the sensor data, and after an initial PCA, can update his measured principal components and covariance matrices more efficiently. During this process, outliers are can be identified with relative ease using the known covariance of the data and the calculated principal components. Furthermore PCA is used to decrease the dimensional complexity of the sensor data. This compressed data is transmitted to the base station, together with the principal component vectors and covariance matrix. This allows for later reconstruction of data with high accuracy, with errors usually below 1\%, while reducing the amount of information send.
Macua et al. \cite{macua2010} propose a truly decentralized approach: Using consensus algorithms to calculate the sample mean, and then approximating the global data covariance matrix. Once a good enough approximation is found, each node can do PCA individually. This approach is not suited for deployment in low-power WSN, as it incurs considerable cost in forms of communication and especially processing power required. This work is still mentioned, as it proved
Macua et al. \cite{macua2010} propose a truly decentralized approach: Using consensus algorithms to calculate the sample mean, and then approximating the global data covariance matrix. Once a good enough approximation is found, each node can do PCA individually. This approach is not suited for deployment in low-power WSN, as it incurs considerable cost in forms of communication and especially processing power required. This work is still mentioned, as it proved
\subsection{Generalized Hebbian Algorithm}
\subsection{Generalized Hebbian Algorithm}
Ali et al. \cite{ali2015} propose an approach to detect and identify events using Generalized Hebbian Algorithm (GHA). Event detection is important in anomaly detection, but event identification is almost equally as important, especially when a sensor network is used to detect an event spanning multiple nodes. They propose a combined algorithm to detect, identify and communicate events in a WSN to detect local and global events. This is achieved by calculating identification ratios, i.e. the percentage each attribute contributed to the event, before broadcasting the detected event.
Ali et al. \cite{ali2015} propose an approach to detect and identify events using Generalized Hebbian Algorithm (GHA). Event detection is important in anomaly detection, but event identification is almost equally as important, especially when a sensor network is used to detect an event spanning multiple nodes and sensors. They propose a combined algorithm to detect, identify and communicate events in a WSN to detect local and global events. This is achieved by calculating identification ratios, i.e. the percentage each attribute contributed to the event, before broadcasting the detected event.
They start off with an outlier detection scheme using hyper-ellipsoids fitted around 98\% of their data points to detect outliers, using an iterative boundary estimation model based on the model formulated by by Moshtaghi et al. \cite{moshtaghi2011} called Forgetting Factor Iterative Data Capture Anomaly Detection (FFIDCAD). It can compute multidimensional boundaries of of the local model online in an iterative fashion, reducing the amount of required computation immensely, while also working in non-stationary environments and changing network topology due to the forgetting factor. A local event is declared, after observing more than $q$ outliers in a row, where $q$ is chosen depending on sampling rate and required temporal resolution.
They start off with an outlier detection scheme using hyper-ellipsoids fitted around 98\% of their data points to detect outliers, using an iterative boundary estimation model based on the model formulated by by Moshtaghi et al. \cite{moshtaghi2011} called Forgetting Factor Iterative Data Capture Anomaly Detection (FFIDCAD). It can compute multidimensional boundaries of of the local model online in an iterative fashion, reducing the amount of required computation immensely, while also working in non-stationary environments and changing network topology due to the forgetting factor. The forgetting factor enables the model to forget older data points that do not fit into the newer data. A local event is declared, after observing more than $q$ outliers in a row, where $q$ is chosen depending on sampling rate and required temporal resolution.
Once an event is detected, Ali et al. propose using a Generalized Hebbian Algorithm (GHA) to replace the Eigenvalue Decomposition (EVD) commonly used in offline identification schemes such as PCA. EVD requires large batches of measurements to accurately compute principal components, while GHA can work online in a streaming fashion. They further show, that their online GHA bases approach has similar accuracy to offline EVD based techniques, while vastly reducing computational complexity. Once the eigenvectors are calculated, the last measurement is projected onto the calculated eigenvectors and whitened, creating a vector containing the identification ratios for each attribute.
Once an event is detected, Ali et al. propose using a Generalized Hebbian Algorithm (GHA) to replace the Eigenvalue Decomposition (EVD) commonly used in offline identification schemes such as PCA. EVD requires large batches of measurements to accurately compute principal components, while GHA can work online in a streaming fashion. They further show, that their online GHA bases approach has similar accuracy to offline EVD based techniques, while vastly reducing computational complexity. Once the eigenvectors are calculated, the last measurement is projected onto the calculated eigenvectors and whitened, creating a vector containing the identification ratios for each attribute.
Ali et al. claim that their algorithm has complexity of $\mathcal{O}(nd^2)$, compared to $\mathcal{O}(n^2+nd^2)$ of common SVM based approaches \cite{shahid2012a,shahid2012b}. Here $n$ is the number of measurements and $d$ is the number of attributes. Furthermore, due to the online nature of this approach, communication overhead is much lower, as only detected local events have to be broadcast, instead of the ongoing exchange of support vectors that have to be broadcast in the SVM approaches mentioned in Chapter~\ref{cap:svm}.
Ali et al. claim that their algorithm has complexity of $\mathcal{O}(nd^2)$, compared to $\mathcal{O}(n^2+nd^2)$ of common SVM based approaches \cite{shahid2012a,shahid2012b}. Here $n$ is the number of measurements and $d$ is the number of attributes. Furthermore, due to the online nature of this approach, communication overhead is much lower, as only detected local events have to be broadcast.
Most machine learning approaches focus either on outlier detection through data classification, or outlier detection through data prediction. The former trains a model to distinguish anomalous from benign data by identifying key features in it, while the other uses machine learning to build a model of the observed process that is able to predict future measurements. While the second class of model seems to be desireable, it
Most machine learning approaches focus either on outlier detection through data classification, or outlier detection through data prediction. The former trains a model to distinguish anomalous from benign data by identifying key features in it, while the other uses machine learning to build a model of the observed process that is able to predict future measurements. While the second class of model seems to be desireable, it also adds additional complexity, making it difficult to implement well in a distributed fashion.
Supervised learning is the process of training a neural network on a set of labeled data. Acquiring labeled data sets that are applicable to the given situation is often difficult, as it requires the existence of another classification method, or labeling by hand. Furthermore, even if a data set would exist, the class imbalance (total number of positive labels vs number of negative labels) would render such training data sub-optimal. And lastly, the data generated by a WSN might change over time without being anomalous, requiring frequent retraining \cite{ramotsoela2018}. Out of these circumstances arises the need for unsupervised or semi-supervised anomaly detection methods.
Supervised learning is the process of training a neural network on a set of labeled data. Acquiring labeled data sets that are applicable to the given situation is often difficult, as it requires the existence of another classification method, or labeling by hand. Furthermore, even if a data set would exist, the class imbalance (total number of positive labels vs number of negative labels) would render such training data sub-optimal. And lastly, the data generated by a WSN might change over time without being anomalous, requiring frequent retraining \cite{ramotsoela2018}. Out of these circumstances arises the need for unsupervised or semi-supervised anomaly detection methods.
We will look into a couple different approaches to outlier detection:
We will look into a couple different approaches to outlier detection using machine learning techniques:
SVMs leverage a kernel function to map the input space to a higher dimensional feature space. This allows the modeling highly nonlinear patterns of normal behavior in a flexible manner. This means, that patterns that are difficult to classify in the problem space, become more easily recognizable and therefore classifiable in the feature space. Once the data is mapped into the feature space, hyperelipsoids or other shapes are fitted to the data points to define regions of the feature space that classify the data as normal or anomalous.
SVMs leverage a kernel function to map the input space to a higher dimensional feature space. This allows the modeling highly nonlinear patterns of normal behavior in a flexible manner. This means, that patterns that are difficult to classify in the problem space, become more easily recognizable and therefore classifiable in the feature space. Once the data is mapped into the feature space, hyperelipsoids or other shapes are fitted to the data points to define regions of the feature space that classify the data as normal or anomalous. This allows SVM based models to even find pattern-based anomalies.
While this approach works well to find outliers in the data, it is also computationally expensive and incurs a large communication overhead. In an attempt to decrease computational complexity, only a single hyperelipsoid is fitted to the data set. This method is called a one-class support vector machine. Originally Wang et al. \cite{wang2006} created a model of a one-class SVM (OCSVM), however it required the solution of a computationally complex second-order cone programming problem, making it unusable for distributed usage. Rajasegarar et al. \cite{rajasegarar2007, rajasegarar2010} improved on this OCSVM in a couple of ways.
While this approach works well to find outliers in the data, it is also computationally expensive and incurs a large communication overhead. In an attempt to decrease computational complexity, only a single hyperelipsoid is fitted to the data set. This method is called a one-class support vector machine. Originally Wang et al. \cite{wang2006} created a model of a one-class SVM (OCSVM), however it required the solution of a computationally complex second-order cone programming problem, making it unusable for distributed usage. Rajasegarar et al. \cite{rajasegarar2007, rajasegarar2010} improved on this OCSVM in a couple of ways.
@ -289,38 +301,37 @@ The QSSVM was further improved in 2012 by Shahid et al. \cite{shahid2012a, shahi
\subsection{Extreme Learning}
\subsection{Extreme Learning}
When working decentralized in an environment, where data is funneled into sinks, it is still possible for nodes to obtain additional data without additional overhead just by listening to other nodes broadcasts.
Extreme learning machines (ELM) are machine learning models consisting of nodes organized in layers, connected by edges. The first layer is called \emph{input layer} and the last one is called the \emph{output layer}. All layers in between are called \emph{hidden layers}. An ELM is a so called a \emph{feed-forward} network, meaning the nodes and edges are non-cyclic. Huang et al. \cite{huang2011extreme,huang2015extreme} shows, that ELM can outperform SVM in classification applications.
Bosman et al. \cite{bosman2017,bosman2013} looks at the performance of recursive last squares (RLS) and the online sequential extreme learning machine (OS-ELM) approach to train a single-layer feed-forward neural network (SLFN). These are compared to first degree polynomial function approximation (FA) and sliding window mean prediction. They show, that incorporation neighborhood information improves anomaly detection only in cases where the data set is well-correlated and shows low spatial entropy, as is common in most natural monitoring applications. When the data set does not correlate well, or there is too much spatial entropy, the methods described in this paper fail to predict anomalies. It concludes, that neighborhood aggregation is not useful beyond 5 neighbors, as such a large data set will fail to meet the aforementioned conditions. The exact size of the optimal neighborhood will vary with network topology and sensor modality.
Bosman et al. \cite{bosman2013, bosman2017} looks at the performance of recursive last squares (RLS) and the online sequential extreme learning machine (OS-ELM) approach to train a single-layer feed-forward neural network (SLFN). This decreases the computational complexity and enables this approach to run online on each node. They further incorporate first degree polynomial function approximation (FA) and sliding window mean prediction into their model. They show, that incorporation neighborhood information improves anomaly detection only in cases where the data set is well-correlated and shows low spatial entropy, as is common in most natural monitoring applications. When the data set does not correlate well, or there is too much spatial entropy, the methods described in this paper fail to predict anomalies. It concludes, that neighborhood aggregation is not useful beyond 5 neighbors, as such a large data set will fail to meet the aforementioned conditions. The exact size of the optimal neighborhood will vary with network topology and sensor modality.
Here, all four types of anomalies were accounted for in the data set, but there was no analysis how good the detection was for each kind of anomaly.
Here, all four types of anomalies were accounted for in the data set, but there was no analysis how good the detection was for each kind of anomaly.
\caption{LSTM prediction results of water pump sensor data from Zhang et al. \cite{zhang2018}}
\label{fig:zhangpump}
\end{figure}
\subsection{Deep Learning Approaches}
\subsection{Deep Learning Approaches}
Deep learning techniques for solving anomaly detection in WSN aim at solving a slightly different problem than other methods mentioned thus far. As the amount of data increases that WSN produce, either by increasing node count, sensor count, or adding high output sensors such as cameras, traditional outlier detection algorithms might not be capable of keeping up \cite{chalapathy2019}.
Deep learning techniques for anomaly detection in WSN aim at solving a slightly different problem than other methods mentioned thus far. As the amount of data increases that WSN produce, either by increasing node count, sensor count, or adding high output sensors such as cameras, traditional outlier detection algorithms might not be capable of keeping up \cite{chalapathy2019}.
In such environments, the analysis part is often moved to the cloud \cite{yu2017}, removing some of the restrictions originally introduced by WSN. While this paper will not discuss topics such as image recognition or anomaly detection in video \cite{kiran2018}, we will highlight some interesting results using deep neural networks to predict or detect anomalies in neural networks.
In such environments, the analysis part is often moved to the cloud \cite{yu2017}, removing some of the restrictions originally introduced by WSN. While this paper will not discuss topics such as image recognition or anomaly detection in video \cite{kiran2018}, we will highlight some interesting results using deep neural networks to predict or detect anomalies in neural networks.
Zhang et al. \cite{zhang2018} uses LSTM neural networks to analyze and predict working condition of a water turbine. A Long-Short-Term-Memory (LSTM) neural network is a kind of recurring neural network that contans short-term memory blocks consisting of memory cells which can hold on to state information, making it possible to analyze time series such as stock market data or perform natural language processing. The downside of LSTM models and machine learning in general is the amount of data required to train them. Zhang et al. collected sufficient data including anomalies over the span of three months. They removed noise and labeled outliers and then used this as training data.
Zhang et al. \cite{zhang2018} uses LSTM neural networks to analyze and predict working condition of a water turbine. A Long-Short-Term-Memory (LSTM) neural network is a kind of recurring neural network that contains short-term memory blocks consisting of memory cells which can hold on to state information, making it possible to analyze time series such as stock market data or perform natural language processing. The downside of LSTM models and machine learning in general is the amount of data required to train them. Zhang et al. collected sufficient data including anomalies over the span of three months. They removed noise and labeled outliers and then used this as training data.
They found, that they can not only predict future sensor measurements with high accuracy (root mean square error below $0.01$, even for complex sensor patterns) but can also identify and to en extend predict failures with their model (Figure~\ref{fig:zhangpump}).
They found, that they can not only predict future sensor measurements with high accuracy (root mean square error below $0.01$, even for complex sensor patterns) but can also identify and to en extend predict failures with their model (Figure~\ref{fig:zhangpump}).
Kakanakova et al. \cite{kakanakova2017} looks at a more generalized form of outlier detection using deep neural networks called Deep Belief Networks (DBN). DBN consist of a composition of so called Restricted Boltzmann Machines (RBM), where the output of each RBM serves as the input for the next. The input of the first RBM serves as the input of the DBN, and the last RBMs output is the output of the whole DBN.
Kakanakova et al. \cite{kakanakova2017} looks at a more generalized form of outlier detection using deep neural networks called Deep Belief Networks (DBN). DBN consist of a composition of so called Restricted Boltzmann Machines (RBM), where the output of each RBM serves as the input for the next. The input of the first RBM serves as the input of the DBN, and the last RBMs output is the output of the whole DBN.
A RBM is a graph on nodes connected by weights, consisting of two types of nodes, visible and hidden nodes. Weighted connections only span from hidden to visible nodes. The RBM has
A RBM is a graph of nodes connected by weights, consisting of two types of nodes, visible and hidden nodes. Weighted connections only span from hidden to visible nodes, meaning there are no connections between the hidden nodes (similar to a one-layer neural network). The RBM has
an input node for each dimension of the input vector, plus two nodes for outlier flags and bias. During training of a RBM, the weights of the connections and values of the hidden nodes are changed to best fit the training data.
an input node for each dimension of the input vector, plus two nodes for outlier flags and bias. During training of a RBM, the weights of the connections and values of the hidden nodes are changed to best fit the training data.
Training an DBN is done by training the first RBM, freezing it's weights and using the values of the hidden nodes as inputs for the next RBM. Kakanakova et al. proves, that this type of Deep Neural Network can learn behavior that is to complex even for SVM approaches, and shows that DBM outperforms SVM approaches on their synthetical data sets. They note, that while a DBM can outperform these other methods in complex tasks, DBM are not suited for simpler problems, as training becomes less effective with lower complexity problems.
Training an DBN is done by training the first RBM, freezing it's weights and using the values of the hidden nodes as inputs for the next RBM. Kakanakova et al. proves, that this type of Deep Neural Network can learn behavior that is to complex even for SVM approaches, and shows that DBM outperforms SVM approaches on their synthetical data sets. They note, that while a DBM can outperform these other methods in complex tasks, DBM are not suited for simpler problems, as training becomes less effective with lower complexity problems.
\caption{LSTM prediction results of water pump sensor data from Zhang et al. \cite{zhang2018}.}
\label{fig:zhangpump}
\end{figure}
\section{Conclusion}
\section{Conclusion}
@ -351,7 +362,7 @@ Training an DBN is done by training the first RBM, freezing it's weights and usi
\label{tbl:comparison}
\label{tbl:comparison}
\end{table*}
\end{table*}
Anomaly detection in WSN is a relatively new addition to the general field of anomaly detection, but has already become a rather complex landscape of solutions, as many experts in their respective fields have used their knowledge to find solutions to these new problems. This survey attempts to capture this diversity in methods and introduces many fundamentally different approaches. In order to organize approaches, we first defined the four anomaly types that are expected in WSN, and then looked at methods that detect or remove these.
Anomaly detection in WSN is a relatively new addition to the general field of anomaly detection, but has already become a rather complex landscape of solutions, as many experts in their respective fields have used their knowledge to find solutions to these new problems. This survey attempts to capture this diversity in methods and introduces many fundamentally different approaches.
First we looked at solutions for sensor drift and offset and found that while sensor calibration is an important step in preventing these, non-blind calibration adds a considerable amount of work in either extrapolating results into the WSN, or calibrating by using another high-quality sensor that often needs to be brought into the close proximity of the sensor. We looked into a real-world application of blind sensor calibration and confirmed the problems with this approach, as accumulative errors cannot be corrected without ground truth.
First we looked at solutions for sensor drift and offset and found that while sensor calibration is an important step in preventing these, non-blind calibration adds a considerable amount of work in either extrapolating results into the WSN, or calibrating by using another high-quality sensor that often needs to be brought into the close proximity of the sensor. We looked into a real-world application of blind sensor calibration and confirmed the problems with this approach, as accumulative errors cannot be corrected without ground truth.