A Wireless Sensor Network (WSN) is commonly defined as a collection of battery powered nodes, which communicate using a low-bandwidth and low-power wireless transceiver. Each node contains an array of sensors and collects data on its surroundings. This offers a versatile platform, that can be deployed to perform various tasks, such as monitoring a wide range of physical or environmental conditions, e.g. temperature, humidity, pollution, noise, motion and more \cite{xie2011anomaly}. They can also be deployed over large areas at a comparatively low cost and even track the behavior of animals \cite{cassens2017automated}. The environment they are deployed in also imposes restrictions on nodes, for example to be lightweight and/or relatively cheap. In most cases, it is preferable to prolong the lifetime of each node as long as possible.
A Wireless Sensor Network (WSN) is commonly defined as a collection of battery powered nodes, which communicate using a low-bandwidth and low-power wireless transceiver. Each node contains an array of sensors and collects data on its surroundings. This offers a versatile platform, that can be deployed to perform various tasks, such as monitoring a wide range of physical or environmental conditions, e.g. temperature, humidity, pollution, noise, motion and more \cite{xie2011anomaly}. They can also be deployed over large areas at a comparatively low cost and even track the behavior of animals \cite{cassens2017automated}. The environment they are deployed in also imposes restrictions on nodes, for example to be lightweight and/or relatively cheap. In most cases, it is preferable to prolong the lifetime of each node as long as possible.
The power required to transmit data is often the largest contributing factor to the lifetime of each node, as it drains the battery \cite{sheng2007}. Especially if the network collects large amounts of data, or spans large areas, a lot of energy can be saved by reducing the number and size of the transmissions. An Ideal solution would be to not send the unimportant data at all, thus arises the need for anomaly detection in WSN, enabling nodes to identify important data themselves. This is however not the only factor why anomaly detection is interesting. Some WSN are deployed to detect phenomena such as forest fires \cite{hefeeda2007wireless}, or monitor active volcanos \cite{werner2006deploying}. In these cases, anomaly detection is not only used to limit the required communication, but also to fulfill the core purpose of the network.
The power required to transmit data is often the largest contributing factor to the lifetime of each node, as it drains the battery \cite{sheng2007}. Especially if the network collects large amounts of data, or spans large areas, a lot of energy can be saved by reducing the number and size of the transmissions. An Ideal solution would be to not send the unimportant data at all, thus arises the need for anomaly detection in WSN, enabling nodes to identify important data themselves. This is however not the only factor why anomaly detection is interesting. Some WSN are deployed to detect phenomena such as forest fires \cite{hefeeda2007wireless}, or monitor active volcanoes \cite{werner2006deploying}. In these cases, anomaly detection is not only used to limit the required communication, but also to fulfill the core purpose of the network.
Not all approaches to anomaly detection in WSN are able to run directly on the node, therefore this survey will differentiate between \emph{decentralized} (algorithms running directly on the node) and \emph{centralized} (running at a central location) methods. It's not always beneficial to have a decentralized approach, as some networks are less restricted by their energy (for example by having a power supply or being frequently serviced by personnel) and would rather use greater computational power and a complete set of data (meaning data from all sensors, not just ones in a local area) to improve their detection and/or prediction accuracy. This is often encountered in industrial settings \cite{ramotsoela2018}.
Not all approaches to anomaly detection in WSN are able to run directly on the node, therefore this survey will differentiate between \emph{decentralized} (algorithms running directly on the node) and \emph{centralized} (running at a central location) methods. It's not always beneficial to have a decentralized approach, as some networks are less restricted by their energy (for example by having a power supply or being frequently serviced by personnel) and would rather use greater computational power and a complete set of data (meaning data from all sensors, not just ones in a local area) to improve their detection and/or prediction accuracy. This is often encountered in industrial settings \cite{ramotsoela2018}.
@ -54,12 +54,12 @@ Another factor for these models is the network topology. In a non-static WSN, a
\subsection{Problem definition}
\subsection{Problem definition}
An anomaly is a collection of one or more temporally correlated measurements in a given dataset that seem to be inconsistent with expected results. These measurements can originate from different sensors, and in the context of WSN even from different nodes. Bosman et al. \cite{bosman2013} and others distinguish between four different kinds of anomalies relevant in WSN (c.f. Figure~\ref{fig:noisetypes}):
An anomaly is a collection of one or more temporally correlated measurements in a given dataset that seem to be inconsistent with expected results. These measurements can originate from different sensors, and in the context of WSN even from different nodes. Bosman et al. \cite{bosman2013} and others distinguish between four different kinds of anomalies relevant in WSN (c.f. Figure~\ref{fig:noisetypes}):
\begin{itemize}
\begin{itemize}
\item\emph{Spikes} are short changes with a large amplitude
\item\emph{Spikes} are short changes with a large amplitude
\item\emph{Noise} is (an increase of) variance over a given time
\item\emph{Noise} is (an increase of) variance over a given time
\item\emph{Constant} is a the sudden absence of noise
\item\emph{Constant} is the sudden absence of noise
\item\emph{Drift} is an offset which increases over time
\item\emph{Drift} is an offset which increases over time
\end{itemize}
\end{itemize}
@ -109,7 +109,7 @@ Chandola et al. \cite{chandola2009} provide a very comprehensive survey on anoma
O'Reilly et al. \cite{oreilly2014} look into anomaly detection in WSN in the specific context of non-stationary environments, meaning environments where the ''normal'' state evolves over time, and isn't static. Due to the nature of the problem, almost all approaches presented there had some machine-learning aspects to them, as they needed to first detect when a change of model was required, and then create a new model that conforms to the new data sensed by the network.
O'Reilly et al. \cite{oreilly2014} look into anomaly detection in WSN in the specific context of non-stationary environments, meaning environments where the ''normal'' state evolves over time, and isn't static. Due to the nature of the problem, almost all approaches presented there had some machine-learning aspects to them, as they needed to first detect when a change of model was required, and then create a new model that conforms to the new data sensed by the network.
McDonald et al. \cite{mcdonald2013} survey methods of finding outliers in WSN, with a focus on distributed solutions. They go into a moderate amount of detail on most solutions, but skip over a lot of methods such as principal component analysis (see chapter \ref{cap:pca}), and support vector machines (see chapter \ref{cap:svm}), which were already maturing at that point in time. Instead they only present distance and density based approaches.
McDonald et al. \cite{mcdonald2013} survey methods of finding outliers in WSN, with a focus on distributed solutions. They go into a moderate amount of detail on most solutions, but skip over a lot of methods such as principal component analysis (see chapter \ref{cap:pca}), and support vector machines (see chapter \ref{cap:svm}), which were already maturing at that point in time. Instead, they only present distance and density based approaches.
Barcelo-Ordinas et al. \cite{barcelo2019} provide a very in-depth reference study for sensor self-calibration, they analyze 39 different approaches in several different categories. This survey is covered further in the section covering sensor self-calibration.
Barcelo-Ordinas et al. \cite{barcelo2019} provide a very in-depth reference study for sensor self-calibration, they analyze 39 different approaches in several different categories. This survey is covered further in the section covering sensor self-calibration.
@ -118,7 +118,7 @@ Barcelo-Ordinas et al. \cite{barcelo2019} provide a very in-depth reference stud
Ramotsoela et al. \cite{ramotsoela2018} survey anomaly detection in industrial settings, where machine learning is preferred due to the observed phenomena being more complex. The survey covers both intrusion detection and outlier detection methods, and compiles a table of 17 different approaches to anomaly detection. They look at six fundamentally different approaches and score them based on accuracy, prior knowledge, complexity and data prediction. They look more closely at k-nearest neighbor models but find similar problems as mentioned in chapter \ref{sec:distance} and \ref{sec:density}.
Ramotsoela et al. \cite{ramotsoela2018} survey anomaly detection in industrial settings, where machine learning is preferred due to the observed phenomena being more complex. The survey covers both intrusion detection and outlier detection methods, and compiles a table of 17 different approaches to anomaly detection. They look at six fundamentally different approaches and score them based on accuracy, prior knowledge, complexity and data prediction. They look more closely at k-nearest neighbor models but find similar problems as mentioned in chapter \ref{sec:distance} and \ref{sec:density}.
Further information concerning advanced machine learning models such as Deep Learning techniques are covered by Chalapathy et al. \cite{chalapathy2019} and Kakanakova et el. \cite{kakanakova2017}. Both of these surveys do not focus on WNS, but propose methods which are applicable to the general field.
Further information concerning advanced machine learning models such as Deep Learning techniques are covered by Chalapathy et al. \cite{chalapathy2019} and Kakanakova et al. \cite{kakanakova2017}. Both of these surveys do not focus on WNS, but propose methods which are applicable to the general field.
@ -148,19 +148,19 @@ This survey will focus on self-calibration techniques that can be used during de
\subsection{Problems in Blind Self-Calibration Approaches}
\subsection{Problems in Blind Self-Calibration Approaches}
The central problem in self-calibration is predicting the error of a given sensor. This is done by comparing the sensor output to so called ground truth data. If no ground truth data is available to the node, it has to be approximated.
The central problem in self-calibration is predicting the error of a given sensor. This is done by comparing the sensor output to so called ground truth data. If no ground truth data is available to the node, it has to be approximated.
Kumar et al. \cite{kumar2013} proposes a solution that uses no ground-truth sensors and can be used online in a distributed fashion. It uses spatial Kriging (gaussian interpolation) and Kalman filtering (a linear approximation model accounting for noise, explained in detail in \ref{sec:kalman}) on neighborhood data in order to reduce noise and remove drift. It assumes, that sensor drift over a large number of sensors will cancel out. This solution suffers from accumulative error due to a missing ground truth, as the system has no point of reference or general model to rely on. The uncertainty of the model, and thereby the accumulative error can be reduced by increasing the number of sensors which are used. A common method for gaining more measurements is increasing network density \cite{wang2016}, or switching from a single-sensor approach to sensor fusion. Barcelo-Ordinas et al. \cite{barcelo2018} explores the possibility of adding multiple copies of the same kind of sensor to each node. All of these approaches are shown to reduce the accumulative error inherent in blind self-calibration approaches but cannot completely negate it. This is a problem for networks who are planned to operate over large time span (e.g. multiple years). In those cases, non-blind calibration might be a better suited solution.
Kumar et al. \cite{kumar2013} proposes a solution that uses no ground-truth sensors and can be used online in a distributed fashion. It uses spatial Kriging (Gaussian interpolation) and Kalman filtering (a linear approximation model accounting for noise, explained in detail in \ref{sec:kalman}) on neighborhood data in order to reduce noise and remove drift. It assumes, that sensor drift over a large number of sensors will cancel out. This solution suffers from accumulative error due to a missing ground truth, as the system has no point of reference or general model to rely on. The uncertainty of the model, and thereby the accumulative error can be reduced by increasing the number of sensors which are used. A common method for gaining more measurements is increasing network density \cite{wang2016}, or switching from a single-sensor approach to sensor fusion. Barcelo-Ordinas et al. \cite{barcelo2018} explores the possibility of adding multiple copies of the same kind of sensor to each node. All of these approaches are shown to reduce the accumulative error inherent in blind self-calibration approaches but cannot completely negate it. This is a problem for networks who are planned to operate over large time span (e.g. multiple years). In those cases, non-blind calibration might be a better suited solution.
Non-blind, also known as reference-based calibration approaches rely on known-good reference information. This data is often gathered from much more expensive sensors, which often come with restrictions on their use, e.g. local weather stations not reporting continuous data, and not at the exact location of the WSN.
Non-blind, also known as reference-based calibration approaches rely on known-good reference information. This data is often gathered from much more expensive sensors, which often come with restrictions on their use, e.g. local weather stations not reporting continuous data, and not at the exact location of the WSN.
One of the easier method is simply calibrating the sensors in a laboratory setting (e.g. \cite{ramanathan2006}). A recently calibrated sensor is used for calibration within a controlled environment pre and/or post deployment as per the manufacturers specifications and the calibration parameters are applied to the collected data. While this improves the accuracy of the measured data, this is of limited usefulness if live readings from the network need to be accurate, or data is compared to neighbors but high sensor drift is expected.
One of the easier method is simply calibrating the sensors in a laboratory setting (e.g. \cite{ramanathan2006}). A recently calibrated sensor is used for calibration within a controlled environment pre and/or post deployment as per the manufacturers specifications and the calibration parameters are applied to the collected data. While this improves the accuracy of the measured data, this is of limited usefulness if live readings from the network need to be accurate, or data is compared to neighbors but high sensor drift is expected.
An approach by Hasenfratz et al. \cite{hasenfratz2012} can calibrate low-cost gas sensors instantly with a calibrated sensor nearby, enabling calibration in the field without the need of an controlled environment or laboratory setting. They make use of the fact that air measurements are continuous and vary only slightly over short distances. This of course comes with a tradeoff in accuracy, but they show that the calibration is as good as the manufacturers. An ozone sensor calibrated using this scheme is only off by $\pm2$ppb (parts per billion) when compared to a high-quality calibrated ozone sensor, despite the manufacturers claimed accuracy of $\pm20$ppb. While these results are remarkable, it is not always feasible to visit every sensor in a WSN.
An approach by Hasenfratz et al. \cite{hasenfratz2012} can calibrate low-cost gas sensors instantly with a calibrated sensor nearby, enabling calibration in the field without the need of a controlled environment or laboratory setting. They make use of the fact that air measurements are continuous and vary only slightly over short distances. This of course comes with a tradeoff in accuracy, but they show that the calibration is as good as the manufacturers. An ozone sensor calibrated using this scheme is only off by $\pm2$ppb (parts per billion) when compared to a high-quality calibrated ozone sensor, despite the manufacturers claimed accuracy of $\pm20$ppb. While these results are remarkable, it is not always feasible to visit every sensor in a WSN.
Maag et al. \cite{maag2017} propose a solution to this problem. They formulate a hybrid solution, where calibrated sensor arrays can be used to calibrate other non-calibrated arrays in a local network of air pollution sensors over multiple hops with minimal accumulative errors. They show 16-60\% lower error rates than other iterative approaches currently in use.
Maag et al. \cite{maag2017} propose a solution to this problem. They formulate a hybrid solution, where calibrated sensor arrays can be used to calibrate other non-calibrated arrays in a local network of air pollution sensors over multiple hops with minimal accumulative errors. They show 16-60\% lower error rates than other iterative approaches currently in use.
\subsection{An Example for Blind Calibration}\label{sec:kalman}
\subsection{An Example for Blind Calibration}\label{sec:kalman}
Sirisanwannakul et al \cite{Sirisanwannakul2021} uses a blind centralized approach, where humidity sensors are calibrated using Kalman filtering in combination with a neural network to detect and counteract sensor drift. Kalman filtering consists of two phases, prediction and update. A Kalman filter can, given the previous state of knowledge at step $k-1$ consisting of an estimated system state and uncertainty, calculate a prediction for the next system state and its uncertainty. This is called the prediction phase. Then, a new (possibly skewed) measurement is observed and used to compute a prediction of the actual current state and uncertainty. This is called the update phase. The filter is recursive in nature and can be calculated with limited hardware in real-time, making it useful for many different anomaly detection applications.
Sirisanwannakul et al.\cite{Sirisanwannakul2021} uses a blind centralized approach, where humidity sensors are calibrated using Kalman filtering in combination with a neural network to detect and counteract sensor drift. Kalman filtering consists of two phases, prediction and update. A Kalman filter can, given the previous state of knowledge at step $k-1$ consisting of an estimated system state and uncertainty, calculate a prediction for the next system state and its uncertainty. This is called the prediction phase. Then, a new (possibly skewed) measurement is observed and used to compute a prediction of the actual current state and uncertainty. This is called the update phase. The filter is recursive in nature and can be calculated with limited hardware in real-time, making it useful for many different anomaly detection applications.
Kalman filters are based on a linear dynamical system on a discrete time domain. It represents the system state as vectors and matrices of real numbers. In order to use Kalman filters, the observed process must be modeled in a specific structure:
Kalman filters are based on a linear dynamical system on a discrete time domain. It represents the system state as vectors and matrices of real numbers. In order to use Kalman filters, the observed process must be modeled in a specific structure:
@ -216,17 +216,17 @@ This chapter will analyze a couple of fundamentally different approaches to outl
\end{itemize}
\end{itemize}
\subsection{Statistical Analysis}
\subsection{Statistical Analysis}
Classical Statistical analysis is done by creating a statistical model of the expected data and then finding the probability for each recorded data point (similar to Kalman Filters). Improbable data points are then deemed outliers. The problem for many statistical approaches is finding this model of the expected data, as it is not always feasible to create it in advance, when the nature of the phenomena is not wellknown, or if the expected data is too complex. It is also not very robust to changes in the environment \cite{mcdonald2013}, requiring frequent updates to the model if the environment changes in ways not forseen by the model.
Classical Statistical analysis is done by creating a statistical model of the expected data and then finding the probability for each recorded data point (similar to Kalman Filters). Improbable data points are then deemed outliers. The problem for many statistical approaches is finding this model of the expected data, as it is not always feasible to create it in advance, when the nature of the phenomena is not well-known, or if the expected data is too complex. It is also not very robust to changes in the environment \cite{mcdonald2013}, requiring frequent updates to the model if the environment changes in ways not forseen by the model.
Sheng et al. \cite{sheng2007} propose an approach to global outlier detection, meaning a data point is only regarded as an outlier, if their value differs significantly from all values collected over a given time, not just from local sensors near the measured one. They propose that the base station requests bucketed histograms of each nodes sensors data distribution to reduce the data transmitted. These histograms are polled, combined, and then used to analyze outliers by looking at the maximum distance a data point can be away from its nearest neighbors. This method bears some problems, as it fails to account for non gaussian distribution. Another problem is the use of fixed parameters for outlier detection, requiring prior knowledge of the data collected and anomaly density. These fixed parameters also require an update, whenever these parameters change. Due to the histograms used, this method cannot be used in a shifting network topology.
Sheng et al. \cite{sheng2007} propose an approach to global outlier detection, meaning a data point is only regarded as an outlier, if their value differs significantly from all values collected over a given time, not just from local sensors near the measured one. They propose that the base station requests bucketed histograms of each nodes sensors data distribution to reduce the data transmitted. These histograms are polled, combined, and then used to analyze outliers by looking at the maximum distance a data point can be away from its nearest neighbors. This method bears some problems, as it fails to account for non Gaussian distribution. Another problem is the use of fixed parameters for outlier detection, requiring prior knowledge of the data collected and anomaly density. These fixed parameters also require an update, whenever these parameters change. Due to the histograms used, this method cannot be used in a shifting network topology.
Böhm et al. \cite{böhm2008} propose a solution not only to non gaussian distributions, but also to noisy data. They define a general probability distribution function (PDF) with an exponential distribution function (EDF) as a basis, which is better suited to fitting around non gaussian data as seen in Figure~\ref{fig:probdistböhm}. They then outline an algorithm where the data is split into clusters, for each cluster an EDF is fitted and outliers are discarded. This method does not require any prior parametrization and is therefore more robust to configuration error.
Böhm et al. \cite{böhm2008} propose a solution not only to non Gaussian distributions, but also to noisy data. They define a general probability distribution function (PDF) with an exponential distribution function (EDF) as a basis, which is better suited to fitting around non Gaussian data as seen in Figure~\ref{fig:probdistböhm}. They then outline an algorithm where the data is split into clusters, for each cluster an EDF is fitted and outliers are discarded. This method does not require any prior parametrization and is therefore more robust to configuration error.
Since this process not only detects outliers, but does a complete clustering of the given data, it is computationally much more expensive than other methods for detecting outliers. However, since this is a complete clustering algorithm, it can be used in offline analysis for clustering and will produce good results quicker than PCA or similar algorithms. Outlier detection is more a byproduct of clustering, than the end result.
Since this process not only detects outliers, but does a complete clustering of the given data, it is computationally much more expensive than other methods for detecting outliers. However, since this is a complete clustering algorithm, it can be used in offline analysis for clustering and will produce good results quicker than PCA or similar algorithms. Outlier detection is more a byproduct of clustering, than the end result.
\caption{Difference of fitting a gaussian PDF and a customized exponential PDF. Image from \cite{böhm2008}.}
\caption{Difference of fitting a Gaussian PDF and a customized exponential PDF. Image from \cite{böhm2008}.}
\label{fig:probdistböhm}
\label{fig:probdistböhm}
\end{figure}
\end{figure}
@ -245,7 +245,7 @@ An older solution to finding outliers in data is the distance based approach, it
Outliers can be selected by looking at the density of points as well. If done correctly, the problem described above (Figure~\ref{fig:densityproblem}) can be prevented. Breuning et al. \cite{breuning2000} propose a method of calculating a local outlier factor (LOF) of each point based on the local density of its $n$ nearest neighbors. The problem lies in selecting good values for $n$. If $n$ is too small, clusters of outliers might not be detected, while a large $n$ might mark points as outliers, even if they are in a large cluster of less than $n$ points. This problem is further exasperated when we try to use this in a WSN setting, for example by streaming through the last $k$ points, as cluster size will not stay constant when incoming data is delayed or lost in transit.
Outliers can be selected by looking at the density of points as well. If done correctly, the problem described above (Figure~\ref{fig:densityproblem}) can be prevented. Breuning et al. \cite{breuning2000} propose a method of calculating a local outlier factor (LOF) of each point based on the local density of its $n$ nearest neighbors. The problem lies in selecting good values for $n$. If $n$ is too small, clusters of outliers might not be detected, while a large $n$ might mark points as outliers, even if they are in a large cluster of less than $n$ points. This problem is further exasperated when we try to use this in a WSN setting, for example by streaming through the last $k$ points, as cluster size will not stay constant when incoming data is delayed or lost in transit.
Papadimitriou et al. \cite{papadimitriou2003} introduces a parameterless approach. They formulate a method using a local correlation integral (LOCI), which does not require parametrization. It uses a multi-granularity deviation factor (MDEF), which is the relative deviation for a point $p$ in a radius $r$. The MDEF is simply the number of nodes in an $r$-neighborhood divided by the sum of all points in the same neighborhood. LOCI provides an automated way to select good parameters for the MDEF and can detect outliers and outlier-clusters with comparable performance to other statistical approaches. They also formulate aLOCI (approximate LOCI), a linear approximation of LOCI, which also gives accurate results while reducing runtime. This approach can be used centralized, decentralized or clustered, depending on the scale of the event of interest. aLOCI seems great for even running on the sensor nodes itself, as it has relatively low computational complexity and can adapt to shifting environments.
Papadimitriou et al. \cite{papadimitriou2003} introduces a parameterless approach. They formulate a method using a local correlation integral (LOCI), which does not require parametrization. It uses a multi-granularity deviation factor (MDEF), which is the relative deviation for a point $p$ in a radius $r$. The MDEF is simply the number of nodes in an $r$-neighborhood divided by the sum of all points in the same neighborhood. LOCI provides an automated way to select good parameters for the MDEF and can detect outliers and outlier-clusters with comparable performance to other statistical approaches. They also formulate aLOCI (approximate LOCI), a linear approximation of LOCI, which also gives accurate results while reducing runtime. This approach can be used centralized, decentralized or clustered, depending on the scale of the event of interest. aLOCI seems great for even running on the sensor nodes itself, as it has relatively low computational complexity and can adapt to shifting environments.
\caption{An example of reducing a 3 dimensional dataset to two dimensions using PCA to minimize loss of information. PCA Vectors are marked red.}
\caption{An example of reducing a 3 dimensional dataset to two dimensions using PCA to minimize loss of information. PCA Vectors are marked red.}
\label{fig:pca}
\label{fig:pca}
\end{figure*}
\end{figure*}
@ -269,13 +269,13 @@ Chan et al. \cite{chan2012} propose a solution to this problem, they develop two
Yu et al. \cite{yu2017} recognize that this solution performs well, but is too expensive in terms of computation to run on each individual node in a network. They propose a clustered and iterative way of doing PCA that reduces the complexity on each cluster head down to $\Oc(n^2t)$ where $t$ is recursion depth. They propose clustering the nodes into groups with cluster heads which have more processing power. The leaf nodes send their samples to the cluster head, which then reorganizes and splits the sensor data, and after an initial PCA, can update his measured principal components and covariance matrices more efficiently. During this process, outliers are can be identified with relative ease using the known covariance of the data and the calculated principal components. Furthermore PCA is used to decrease the dimensional complexity of the sensor data. This compressed data is transmitted to the base station, together with the principal component vectors and covariance matrix. This allows for later reconstruction of data with high accuracy, with errors usually below 1\%, while reducing the amount of information send.
Yu et al. \cite{yu2017} recognize that this solution performs well, but is too expensive in terms of computation to run on each individual node in a network. They propose a clustered and iterative way of doing PCA that reduces the complexity on each cluster head down to $\Oc(n^2t)$ where $t$ is recursion depth. They propose clustering the nodes into groups with cluster heads which have more processing power. The leaf nodes send their samples to the cluster head, which then reorganizes and splits the sensor data, and after an initial PCA, can update his measured principal components and covariance matrices more efficiently. During this process, outliers are can be identified with relative ease using the known covariance of the data and the calculated principal components. Furthermore PCA is used to decrease the dimensional complexity of the sensor data. This compressed data is transmitted to the base station, together with the principal component vectors and covariance matrix. This allows for later reconstruction of data with high accuracy, with errors usually below 1\%, while reducing the amount of information send.
Macua et al. \cite{macua2010} propose a truly decentralized approach: Using consensus algorithms to calculate the sample mean, and then approximating the global data covariance matrix. Once a good enough approximation is found, each node can do PCA individually. This approach is not suited for deployment in low-power WSN, as it incurs considerable cost in forms of communication and especially processing power required. While this approach is a good proff-of-concept, distributed approaches for PCA seem not yet possible to implement in WSN. Distributed PCA has found more use in database settings instead \cite{balcan2014improved}.
Macua et al. \cite{macua2010} propose a truly decentralized approach: Using consensus algorithms to calculate the sample mean, and then approximating the global data covariance matrix. Once a good enough approximation is found, each node can do PCA individually. This approach is not suited for deployment in low-power WSN, as it incurs considerable cost in forms of communication and especially processing power required. While this approach is a good proof-of-concept, distributed approaches for PCA seem not yet possible to implement in WSN. Distributed PCA has found more use in database settings instead \cite{balcan2014improved}.
\subsection{Generalized Hebbian Algorithm}
\subsection{Generalized Hebbian Algorithm}
Ali et al. \cite{ali2015} propose an approach to detect and identify events using Generalized Hebbian Algorithm (GHA). Event detection is important in anomaly detection, but event identification is almost equally as important, especially when a sensor network is used to detect an event spanning multiple nodes and sensors. They propose a combined algorithm to detect, identify and communicate events in a WSN to detect local and global events. This is achieved by calculating identification ratios, i.e. the percentage each attribute contributed to the event, before broadcasting the detected event.
Ali et al. \cite{ali2015} propose an approach to detect and identify events using Generalized Hebbian Algorithm (GHA). Event detection is important in anomaly detection, but event identification is almost equally as important, especially when a sensor network is used to detect an event spanning multiple nodes and sensors. They propose a combined algorithm to detect, identify and communicate events in a WSN to detect local and global events. This is achieved by calculating identification ratios, i.e. the percentage each attribute contributed to the event, before broadcasting the detected event.
They start off with an outlier detection scheme using hyper-ellipsoids fitted around 98\% of their data points to detect outliers, using an iterative boundary estimation model based on the model formulated by by Moshtaghi et al. \cite{moshtaghi2011} called Forgetting Factor Iterative Data Capture Anomaly Detection. It can compute multidimensional boundaries of of the local model online in an iterative fashion, reducing the amount of required computation immensely, while also working in non-stationary environments and changing network topology due to the forgetting factor. The forgetting factor enables the model to forget older data points that do not fit into the newer data. A local event is declared, after observing more than $q$ outliers in a row, where $q$ is chosen depending on sampling rate and required temporal resolution.
They start off with an outlier detection scheme using hyper-ellipsoids fitted around 98\% of their data points to detect outliers, using an iterative boundary estimation model based on the model formulated by Moshtaghi et al. \cite{moshtaghi2011} called Forgetting Factor Iterative Data Capture Anomaly Detection. It can compute multidimensional boundaries of the local model online in an iterative fashion, reducing the amount of required computation immensely, while also working in non-stationary environments and changing network topology due to the forgetting factor. The forgetting factor enables the model to forget older data points that do not fit into the newer data. A local event is declared, after observing more than $q$ outliers in a row, where $q$ is chosen depending on the sampling rate and required temporal resolution.
Once an event is detected, Ali et al. propose using a Generalized Hebbian Algorithm (GHA) to replace the Eigenvalue Decomposition (EVD) commonly used in offline identification schemes such as PCA. EVD requires large batches of measurements to accurately compute principal components, while GHA can work online in a streaming fashion. They further show, that their online GHA bases approach has similar accuracy to offline EVD based techniques, while vastly reducing computational complexity. Once the eigenvectors are calculated, the last measurement is projected onto the calculated eigenvectors and whitened, creating a vector containing the identification ratios for each attribute.
Once an event is detected, Ali et al. propose using a Generalized Hebbian Algorithm (GHA) to replace the Eigenvalue Decomposition (EVD) commonly used in offline identification schemes such as PCA. EVD requires large batches of measurements to accurately compute principal components, while GHA can work online in a streaming fashion. They further show, that their online GHA bases approach has similar accuracy to offline EVD based techniques, while vastly reducing computational complexity. Once the eigenvectors are calculated, the last measurement is projected onto the calculated eigenvectors and whitened, creating a vector containing the identification ratios for each attribute.
@ -291,13 +291,13 @@ We will look into a couple different approaches to outlier detection using machi
SVMs leverage a kernel function to map the input space to a higher dimensional feature space. This allows the modeling highly nonlinear patterns of normal behavior in a flexible manner. This means, that patterns that are difficult to classify in the problem space, become more easily recognizable and therefore classifiable in the feature space. Once the data is mapped into the feature space, hyperellipsoids or other shapes are fitted to the data points to define regions of the feature space that classify the data as normal or anomalous. This allows SVM based models to even find pattern-based anomalies.
SVMs leverage a kernel function to map the input space to a higher dimensional feature space. This allows the modeling highly nonlinear patterns of normal behavior in a flexible manner. This means, that patterns that are difficult to classify in the problem space, become more easily recognizable and therefore classifiable in the feature space. Once the data is mapped into the feature space, hyper-ellipsoids or other shapes are fitted to the data points to define regions of the feature space that classify the data as normal or anomalous. This allows SVM based models to even find pattern-based anomalies.
While this approach works well to find outliers in the data, it is also computationally expensive and incurs a large communication overhead. In an attempt to decrease computational complexity, only a single hyperellipsoid is fitted to the data set. This method is called a one-class support vector machine. Originally Wang et al. \cite{wang2006} created a model of a one-class SVM (OCSVM), however it required the solution of a computationally complex second-order cone programming problem, making it unusable for distributed usage. Rajasegarar et al. \cite{rajasegarar2007, rajasegarar2010} improved on this OCSVM in a couple of ways.
While this approach works well to find outliers in the data, it is also computationally expensive and incurs a large communication overhead. In an attempt to decrease computational complexity, only a single hyper-ellipsoid is fitted to the data set. This method is called a one-class support vector machine. Originally Wang et al. \cite{wang2006} created a model of a one-class SVM (OCSVM), however it required the solution of a computationally complex second-order cone programming problem, making it unusable for distributed usage. Rajasegarar et al. \cite{rajasegarar2007, rajasegarar2010} improved on this OCSVM in a couple of ways.
They used the fact, that they could normalize numerical input data to lay in the vicinity of the origin inside the feature space, and furthermore the results of Laskov et al. \cite{laskov2004} which showed, that normalized numerical data is one-sided, always lying in the positive quadrants. This lead to the formulation of a centered-hyperellipsoidal SVM (CESVM) model, which vastly reduces computational complexity to a linear problem. Furthermore they introduce a one-class quarter-sphere SVM (QSSVM) which reduced the communication overhead. They conclude however, that the technique is still unfit for decentralized use because of the large remaining communication overhead, as a consensus for the radiuses and other parameters is still required.
They used the fact, that they could normalize numerical input data to lay in the vicinity of the origin inside the feature space, and furthermore the results of Laskov et al. \cite{laskov2004} which showed, that normalized numerical data is one-sided, always lying in the positive quadrants. This lead to the formulation of a centered-hyper-ellipsoidal SVM (CESVM) model, which vastly reduces computational complexity to a linear problem. Furthermore they introduce a one-class quarter-sphere SVM (QSSVM) which reduced the communication overhead. They conclude however, that the technique is still unfit for decentralized use because of the large remaining communication overhead, as a consensus for the radiuses and other parameters is still required.
The QSSVM was further improved in 2012 by Shahid et al. \cite{shahid2012a, shahid2012b}, proposing three schemes that reduce communication overhead while maintaining detection performance. His propositions make use of the spatio-temporal and attribute (STA) correlations in the measured data. These propositions accept worse consensus about the placement of the hypersphere among neighboring nodes in order to reduce the communication overhead. They then show, that these approaches are comparable in performance to the QSSVM proposed by Rajasegarar et al. if the data correlates well enough inside each neighborhood. It is important to note, that this neighborhood information does not rely on nodes being stationary and is therefore usable in a shifting network topology.
The QSSVM was further improved in 2012 by Shahid et al. \cite{shahid2012a, shahid2012b}, proposing three schemes that reduce communication overhead while maintaining detection performance. His propositions make use of the spatio-temporal and attribute (STA) correlations in the measured data. These propositions accept worse consensus about the placement of the hyper-sphere among neighboring nodes in order to reduce the communication overhead. They then show, that these approaches are comparable in performance to the QSSVM proposed by Rajasegarar et al. if the data correlates well enough inside each neighborhood. It is important to note, that this neighborhood information does not rely on nodes being stationary and is therefore usable in a shifting network topology.
@ -317,14 +317,14 @@ In such environments, the analysis part is often moved to the cloud \cite{yu2017
Zhang et al. \cite{zhang2018} uses LSTM neural networks to analyze and predict working condition of a water turbine. A Long-Short-Term-Memory (LSTM) neural network is a kind of recurring neural network that contains short-term memory blocks consisting of memory cells which can hold on to state information, making it possible to analyze time series such as stock market data or perform natural language processing. The downside of LSTM models and machine learning in general is the amount of data required to train them. Zhang et al. collected sufficient data including anomalies over the span of three months. They removed noise and labeled outliers and then used this as training data.
Zhang et al. \cite{zhang2018} uses LSTM neural networks to analyze and predict working condition of a water turbine. A Long-Short-Term-Memory (LSTM) neural network is a kind of recurring neural network that contains short-term memory blocks consisting of memory cells which can hold on to state information, making it possible to analyze time series such as stock market data or perform natural language processing. The downside of LSTM models and machine learning in general is the amount of data required to train them. Zhang et al. collected sufficient data including anomalies over the span of three months. They removed noise and labeled outliers and then used this as training data.
They found, that they can not only predict future sensor measurements with high accuracy (root mean square error below $0.01$, even for complex sensor patterns) but can also identify and to en extend predict failures with their model (Figure~\ref{fig:zhangpump}).
They found, that they can not only predict future sensor measurements with high accuracy (root-mean-square error below $0.01$, even for complex sensor patterns) but can also identify and to en extend predict failures with their model (Figure~\ref{fig:zhangpump}).
Kakanakova et al. \cite{kakanakova2017} looks at a more generalized form of outlier detection using deep neural networks called Deep Belief Networks (DBN). DBN consist of a composition of so called Restricted Boltzmann Machines (RBM), where the output of each RBM serves as the input for the next. The input of the first RBM serves as the input of the DBN, and the last RBMs output is the output of the whole DBN.
Kakanakova et al. \cite{kakanakova2017} looks at a more generalized form of outlier detection using deep neural networks called Deep Belief Networks (DBN). DBN consist of a composition of so called Restricted Boltzmann Machines (RBM), where the output of each RBM serves as the input for the next. The input of the first RBM serves as the input of the DBN, and the last RBMs output is the output of the whole DBN.
A RBM is a graph of nodes connected by weights, consisting of two types of nodes, visible and hidden nodes. Weighted connections only span from hidden to visible nodes, meaning there are no connections between the hidden nodes (similar to a one-layer neural network). The RBM has
An RBM is a graph of nodes connected by weights, consisting of two types of nodes, visible and hidden nodes. Weighted connections only span from hidden to visible nodes, meaning there are no connections between the hidden nodes (similar to a one-layer neural network). The RBM has
an input node for each dimension of the input vector, plus two nodes for outlier flags and bias. During training of a RBM, the weights of the connections and values of the hidden nodes are changed to best fit the training data.
an input node for each dimension of the input vector, plus two nodes for outlier flags and bias. During training of an RBM, the weights of the connections and values of the hidden nodes are changed to best fit the training data.
Training an DBN is done by training the first RBM, freezing its weights and using the values of the hidden nodes as inputs for the next RBM. Kakanakova et al. proves, that this type of Deep Neural Network can learn behavior that is too complex even for SVM approaches, and shows that DBM outperforms SVM approaches on their synthetical data sets. They note, that while a DBM can outperform these other methods in complex tasks, DBM are not suited for simpler problems, as training becomes less effective with lower complexity problems.
Training an DBN is done by training the first RBM, freezing its weights and using the values of the hidden nodes as inputs for the next RBM. Kakanakova et al. proves, that this type of Deep Neural Network can learn behavior that is too complex even for SVM approaches, and shows that DBM outperforms SVM approaches on their synthetic data sets. They note, that while a DBM can outperform these other methods in complex tasks, DBM are not suited for simpler problems, as training becomes less effective with lower complexity problems.
\begin{figure}
\begin{figure}
@ -358,7 +358,7 @@ Training an DBN is done by training the first RBM, freezing its weights and usin
\hline
\hline
\end{tabular}
\end{tabular}
\end{adjustbox}
\end{adjustbox}
\caption{A comparison of approaches investigated in this survey. The column ''Prior knowledge'' marks wether information about the measured process is required beforehand to construct a model or train a neural network. ''Centralized/Decentralized'' marks where the algorithm is run. The ''Required topology'' indicates if nodes are able to move around or must be stationary. Communication cost is compared to "normal" behavior, where all data is transmitted to a base station. Low implies a reduction in communication, while "Prohibitive" marks approaches that require more communication than is feasible in most WSN. The column "Recalibration" indicates, if the model requires recalibration or retraining upon a change in the environment. The ''Basis'' indicates the name of the underlying algorithm.}
\caption{A comparison of approaches investigated in this survey. The column ''Prior knowledge'' marks whether information about the measured process is required beforehand to construct a model or train a neural network. ''Centralized/Decentralized'' marks where the algorithm is run. The ''Required topology'' indicates if nodes are able to move around or must be stationary. Communication cost is compared to "normal" behavior, where all data is transmitted to a base station. Low implies a reduction in communication, while "Prohibitive" marks approaches that require more communication than is feasible in most WSN. The column "Recalibration" indicates, if the model requires recalibration or retraining upon a change in the environment. The ''Basis'' indicates the name of the underlying algorithm.}
\label{tbl:comparison}
\label{tbl:comparison}
\end{table*}
\end{table*}
@ -369,7 +369,7 @@ First we looked at solutions for sensor drift and offset and found that while se
Then we looked at different ways of outlier detection using statistical or density based approaches in a centralized manner, followed by large number of decentralized approaches, using methods like PCA, SVM, GHA and ELM. We saw, that SVM are a great solution to more complex outlier detection, due to their ability to model highly nonlinear but normal behavior, but they require a lot more communication than other approaches such as PCA or GHA, while not performing much better in most common cases. We saw how neighborhood data can be used to detect local anomalies using ELM, with performance directly proportional to the correlation inside the neighborhood.
Then we looked at different ways of outlier detection using statistical or density based approaches in a centralized manner, followed by large number of decentralized approaches, using methods like PCA, SVM, GHA and ELM. We saw, that SVM are a great solution to more complex outlier detection, due to their ability to model highly nonlinear but normal behavior, but they require a lot more communication than other approaches such as PCA or GHA, while not performing much better in most common cases. We saw how neighborhood data can be used to detect local anomalies using ELM, with performance directly proportional to the correlation inside the neighborhood.
Finally we took a look at some deep learning approaches and the challenges that come with them. We saw great performance of LSTM and DBM based approaches in modeling and predicting much more complex data than is encountered conventional WSN. We saw and understood that their application is limited in conventional, low power WSN due to their large computational complexity and the requirements of some form of labeled data for training. In order to use deep learning techniques decentralized, a lot more research is still required in that field.
Finally, we took a look at some deep learning approaches and the challenges that come with them. We saw great performance of LSTM and DBM based approaches in modeling and predicting much more complex data than is encountered conventional WSN. We saw and understood that their application is limited in conventional, low power WSN due to their large computational complexity and the requirements of some form of labeled data for training. In order to use deep learning techniques decentralized, a lot more research is still required in that field.
All covered approaches are again summarized in Table~\ref{tbl:comparison} and categorized by the factors mentioned in the introduction of Chapter~\ref{cap:outlierdet}.
All covered approaches are again summarized in Table~\ref{tbl:comparison} and categorized by the factors mentioned in the introduction of Chapter~\ref{cap:outlierdet}.