diff --git a/References.bib b/References.bib index 706c3d4..e0bba1d 100644 --- a/References.bib +++ b/References.bib @@ -419,3 +419,38 @@ isbn="978-3-030-68133-3" year={2017}, publisher={ACM New York, NY, USA} } +@inproceedings{cassens2017automated, + title={Automated Encounter Detection for Animal-Borne Sensor Nodes.}, + author={Cassens, Bj{\"o}rn and Ripperger, Simon and Hierold, Martin and Mayer, Frieder and Kapitza, R{\"u}diger}, + booktitle={EWSN}, + pages={120--131}, + year={2017} +} +@inproceedings{hefeeda2007wireless, + title={Wireless sensor networks for early detection of forest fires}, + author={Hefeeda, Mohamed and Bagheri, Majid}, + booktitle={2007 IEEE International Conference on Mobile Adhoc and Sensor Systems}, + pages={1--6}, + year={2007}, + organization={IEEE} +} +@article{werner2006deploying, + title={Deploying a wireless sensor network on an active volcano}, + author={Werner-Allen, Geoffrey and Lorincz, Konrad and Ruiz, Mario and Marcillo, Omar and Johnson, Jeff and Lees, Jonathan and Welsh, Matt}, + journal={IEEE internet computing}, + volume={10}, + number={2}, + pages={18--25}, + year={2006}, + publisher={IEEE} +} +@article{xie2011anomaly, + title={Anomaly detection in wireless sensor networks: A survey}, + author={Xie, Miao and Han, Song and Tian, Biming and Parvin, Sazia}, + journal={Journal of Network and computer Applications}, + volume={34}, + number={4}, + pages={1302--1325}, + year={2011}, + publisher={Elsevier} +} diff --git a/paper.tex b/paper.tex index e9ef9c6..89dacd5 100644 --- a/paper.tex +++ b/paper.tex @@ -30,23 +30,31 @@ \affiliation{\institution{Universität Augsburg}} \begin{abstract} - Anomaly detection is an important problem in data science, which is encountered often when data is collected and analyzed. An anomaly is often defined as a measurement that is inconsistent with the expected results. Since anomaly detection can be applied to many different environments, a multitude of different research contexts and application domains exist in which anomaly detection is researched. Anomaly detection in wireless sensor networks (WSN) is a relatively new addition to the field. + Anomaly detection is an important problem in data science, which is encountered often when data is collected and analyzed. An anomaly is often defined as a measurement that is inconsistent with the expected results. Since anomaly detection can be applied to many different environments, a multitude of different research contexts and application domains exist in which anomaly detection is researched. Anomaly detection in wireless sensor networks (WSN) is a relatively new addition to the field, where a lot of active research is done and new methods are proposed regularly. - The context of WSN introduces a lot of interesting new challenges, as nodes are often small devices running on battery power and cannot do complex computation on their own. Furthermore, in WSNs communication is often not perfect and messages get lost during operation. Any protocols that incur additional communication must have a good justification, as communication is expensive. All these factors create a unique environment, in which not many existing solutions to the problem are applicable. + The context of WSN introduces a lot of interesting new challenges, as nodes are often small devices running on battery power and cannot do complex computation on their own. Furthermore, in WSNs, communication is often not perfect and messages get lost during operation. Any protocols that incur additional communication must have a good justification, as communication is expensive. All these factors create a unique environment, in which not many previously existing solutions to the problem are applicable without adaptation. - This paper will focus solely on anomaly detection in sensor data collected by the WSN. + This survey will define four different types of anomalies, and then look at two fundamental problems related to anomaly detection. First, sensor self-calibration is explored as a method to improve sensor reliability. Then, different methods of detecting outliers are looked at and evaluated. Here, conventional model-based approaches are looked at first, such as statistical or density based models. Afterwards we look at the newer approaches of machine learning based models to outlier detection. In the end, all approaches presented in this paper are tabulated and evaluated based on communication overhead, requirements of prior knowledge, centralization, required network topology and more. \end{abstract} \keywords{Wireless Sensor Networks, Anomaly detection, Outlier detection, Sensor calibration, Drift detection} \maketitle -\section{Overview} +\section{introduction} -There are many different approaches to anomaly detection, a common way to classify these is by their place of computation. An approach is considered centralized, when a large chunk of the computation is done at a single point, or at a later stage during analysis. A decentralized approach implies that a considerable amount of processing is done on the individual nodes, doing analysis while being deployed. It is also important to differentiate between online and offline detection. Online detection can run while the WSN is operating, while offline detection is done after the data is collected or during pauses of operation. Online detection often reduces mission duration due to increased power consumption, but can also have the opposite effect, if the analysis done can be used to reduce the amount of communication required for the WSN to function. +A Wireless Sensor Network (WSN) is commonly defined as a collection of battery powered nodes, which communicate using a low-bandwidth and low-power wireless transceiver. Each node contains an array of sensors and collects data on it's surroundings. This offers a versatile platform, that can be deployed to perform various tasks, such as monitoring a wide range of physical or environmental conditions, e.g. temperature, humidity, pollution, noise, motion and more \cite{xie2011anomaly}. They can also be deployed over large areas at a comparatively low cost and even track the behavior of animals \cite{cassens2017automated}. The environment they are deployed in also imposes restrictions on nodes, for example to be lightweight and/or relatively cheap. In most cases, it is preferable to prolong the lifetime of each node as long as possible. -\subsection{Anomaly Types} -We need to clarify the different kinds of anomalies that can occur in WSN data sets. Commonly, four different kinds of anomalies that occur in WSN are considered (c.f. Figure~\ref{fig:noisetypes}): +The power required to transmit data is often the largest contributing factor to the lifetime of each node, as it drains the battery \cite{sheng2007}. Especially if the network collects large amounts of data, or spans large areas, a lot of energy can be saved by reducing the number and size of the transmissions. An Ideal solution would be to not send the unimportant data at all, thus arises the need for anomaly detection in WSNs, enabling nodes to identify important data themselves. This is however not the only factor why anomaly detection is interesting. Some WSN are deployed to detect phenomena such as forest fires \cite{hefeeda2007wireless}, or monitor active volcanos \cite{werner2006deploying}. In these cases, anomaly detection is not only used to limit the required communication, but also to fulfill the core purpose of the network. + +Not all approaches to anomaly detection in WSN are able to run directly on the node, therefore this survey will differentiate between \emph{decentralized} (algorithms running directly on the node) and \emph{centralized} (running at a central location) methods. It's not always beneficial to have a decentralized approach, as some networks are less restricted by their energy (for example by having a power supply) and would rather use greater computational power and a complete set of data (meaning data from all sensors, not just ones in a local area) to improve their detection and/or prediction accuracy. This is often encountered in industrial settings \cite{ramotsoela2018}. + +Another factor for these models is the network topology. In a non-static WSN, a model using neighborhood information has to account for changes in the network topology surrounding it, as the number of neighbors changes, or the data they measured previously is not actually belonging to the current neighborhood. If the node keeps track of previous measurements, it also needs to take into account how it's changes in position might influence the measured data. + + + +\subsection{Problem definition} +An anomaly is a collection of one or more temporally correlated measurements in a given dataset that seem to be inconsistent with expected results. These measurements can originate from different sensors, and in the context of WSNs even from different nodes. Bosman et al. \cite{bosman2013} and others distinguish between four different kinds of anomalies (c.f. Figure~\ref{fig:noisetypes}): \begin{itemize} \item \emph{Spikes} are short changes with a large amplitude @@ -55,18 +63,61 @@ We need to clarify the different kinds of anomalies that can occur in WSN data s \item \emph{Drift} is an offset which increases over time \end{itemize} +A fifth anomaly type, \emph{sensor failure}, is commonly added to anomaly detection \cite{rajasegarar2008,chandola2009}. Since sensor failure often manifests itself in these four different ways mentioned above, and we are not interested in sensor fault prediction, detection and management here, faulty sensors will not be discussed further. + +Detecting constant type anomalies isn't very difficult, as they can simply be classified as areas of data for which the second (numerical) derivative is zero, and don't need complex models to identify. Therefore they won't be covered any further in this survey. Instead, we will focus on two separate problems: Self-calibration of sensors, and detecting outliers, as both of these are closely connected to anomaly detection in general. + +A Noise anomaly is not the same as a noisy sensor, working with noisy data is a problem in WSN, but we will not focus on methods of cleaning noisy data, as it is not in the scope of this survey. Elnahrawy et al. \cite{elnahrawy2003} and Barcelo et al. \cite{barcelo2019} are great places to start a survey in this direction. + \begin{figure} \includegraphics[width=8.5cm]{img/anomaly_types.png} \caption{Spike, noise, constant and drift type anomalies in noisy linear data, image from Bosmal et al. \cite{bosman2013}} \label{fig:noisetypes} \end{figure} -We will first look into sensor self-calibration, which often removes or reduces drift and constant offsets. Then we will look into model based techniques for outlier detection, and then into machine learning based approaches. Outlier detection is able to detect spikes, noise and drift type anomalies, while it has difficulties detecting constant type anomalies. + + +The term outlier and anomaly are often used interchangeably, but often actually mean slightly different phenomena. While an anomaly falls into one of these four categories, only spikes, noise, and some types of drifts are considered \emph{outliers} \cite{chandola2009}, as they are the only ones that produce data outside of the considered ''norm''. + + +\subsubsection{Self-calibration} + +The main problem of self-calibrating WSNs is obtaining ground-truth data for each node in the network. This data is required to calculate the offset between the nodes measurements and the ground truth in order to calibrate it. Since it is often infeasible to visit every node in a network, methods need to be formulated to approximate ground truth data from either non-calibrated sensors, or a calibrated sensor located some distance away. + +\subsubsection{Outlier detection} + +The problem of outlier detection in WSNs is the creation of a model which can use past data to either predict or classify measurements. If the model is able to predict future measurements, outliers are simply detected by their deviation from the predicted value, while models that can classify data can simply classify them as anomalous. Another aspect of outlier detection in WSN is the fact that each node does not possess perfect knowledge about all measurements. A method can collect information inside a neighborhood, which might increases it's detection accuracy, but this will also incur a considerable cost in the form of power consumption. + + +\subsection{Structure} + +At first we will look into sensor self-calibration, a method of improving sensor accuracy. Calibrating a sensor will remove constant offsets, enabling nodes to compare measurements between one another more easily. If a sensor is in use for a prolonged length of time, it might needs to be recalibrated, to remove sensor drift. + +Then we will look into conventional, model based approaches to outlier detection, such as statistical models, or density based models. + +We will first look into sensor self-calibration, which aims to remove or reduce drift and constant offsets. Then we will look into conventional model based techniques for outlier detection, such as probabilistic models, or density based models. At last we will look into machine learning based approaches to building these models. + + +\section{Related work} +Chandola et al. \cite{chandola2009} provide a very comprehensive survey on outlier detection in general, not just focused on WSN. They introduce many key concepts and definitions, but focus more on outliers than anomalies in general. +McDonald et al. \cite{mcdonald2013} survey methods of finding outliers in WSN, with a focus on distributed solutions. They go into a moderate amount of detail on most solutions, but skip over a lot of methods such as principal component ananlysis, and support vector machines, which were already maturing at that point in time. + + + + +Ramotsoela et al. \cite{ramotsoela2018} + + +Chalapathy et al. \cite{chalapathy2019} + + +Kakanakova et el. \cite{kakanakova2017} + + +Barcelo-Ordinas et al. \cite{barcelo2019} survey self-calibation methods for WSNs, + -A Noise anomaly is not the same as a noisy sensor, working with noisy data is a problem in WSN, but we will not focus on methods of cleaning noisy data, as it is not in the scope of this survey. Elnahrawy et al. \cite{elnahrawy2003} and Barcelo et al. \cite{barcelo2019} are a great places to start a survey in this direction. -A fifth anomaly type, \emph{sensor failure}, is commonly added to anomaly detection \cite{rajasegarar2008,chandola2009}. Since sensor failure often manifests in these four different ways mentioned above, and we are not interested in sensor fault prediction, detection and management here, faulty sensors will not be discussed further. - \section{Sensor Drift and Self-Calibration} Advancements in energy storage density, processing power and sensor availability have increased the possible length of deployment of many WSN. This increase in sensor lifetime, together with an increase in node count due to reduced part cost \cite{wang2016}, as well as the introduction of the Internet of Things (IoT) have brought forth new problems in sensor calibration and drift detection \cite{dehkordi2020}. Increasing the amount of collected data and the length of time over which it is collected introduces a need for better quality control of the sensors that data came from. Ni et al. \cite{ni2009} noticed drift as high as 200\% in soil CO$_2$ sensors, while Buonadonna et al. \cite{buonadonna2005} noticed that his light sensors (which were calibrated to the manufacturer's specification) were performing very poorly when measured against laboratory equipment. It is out of these circumstances, that the need arises for better and more frequent sensor calibration. @@ -148,11 +199,11 @@ After the update phase, we obtain $\hat{x}_{k|k}$, which is our best approximati Sirisanwannakul et al. takes the computed Kalman gain and compares its bias. In normal operation, the gain is biased towards the measurement. If the sensor malfunctions, the bias is towards the prediction. But if the gains bias is between prediction and measurement, the system assumes sensor drift and corrects automatically. Since this approach lacks a ground truth measurement it cannot recalibrate the sensor, but the paper shows that accumulative error can be reduced by more than 50\%. -\section{Outlier detection - model-based approaches} +\section{Anomaly detection - model-based approaches} A centralized WSN is defined by the existence of a central entity, called the \emph{base station} or \emph{fusion centre}, where all data is delivered to and analyzed. It is often assumed, that the base station does not have limits on its processing power or storage. Centralized approaches are not optimal in hostile environments, but that is not our focus here. Since central anomaly detection is closely related to the general field of anomaly detection, we will not go into much detail on these solution, instead focusing on covering solutions more specific to the field of WSN. \subsection{Statistical Analysis} -Classical Statistical analysis is done by creating a model of the expected data and then finding the probability for each recorded data point. Improbable data points are then deemed outliers. The problem for many statistical approaches is finding this model of the expected data, as it is not always feasible to create it in advance. It also bears the problem of bad models or slow changes in the environment \cite{mcdonald2013}. +Classical Statistical analysis is done by creating a model of the expected data and then finding the probability for each recorded data point. Improbable data points are then deemed outliers. The problem for many statistical approaches is finding this model of the expected data, as it is not always feasible to create it in advance. It also bears the problem of bad models or changes in the environment \cite{mcdonald2013}, requiring frequent update of the existing model. Sheng et al. \cite{sheng2007} propose an approach to global outlier detection, meaning a data point is only regarded as an outlier, if their value differs significantly from all values collected over a given time, not just from local sensors near the measured one. They propose that the base station requests bucketed histograms of each nodes sensors data distribution to reduce the data transmitted. These histograms are polled, combined, and then used to analyze outliers by looking at the maximum distance a data point can be away from his nearest neighbors. This method bears some problems, as it fails to account for non gaussian distribution. Another problem is the use of fixed parameters for outlier detection, requiring prior knowledge of the data collected and anomaly density. These fixed parameters also require an update, whenever these parameters change. Due to the histograms used, this method cannot be used in a shifting network topology. @@ -187,7 +238,7 @@ Macua et al. \cite{macua2010} propose a truly decentralized approach: Using cons -\section{Outlier Detection - Machine Learning Approaches} +\section{Anomaly Detection - Machine Learning Approaches} Most machine learning approaches focus on outlier detection, which is a common problem in WSN, as an outlier is inherently an anomaly. Outlier detection is largely unable to detect drift and has difficulties wih noise, but excels at detecting data points or groups which appear to be inconsistent with the other data (spikes, noise, sometimes drift). A common problem is finding outliers in data with an inherently complex structure. Supervised learning is the process of training a neural network on a set of labeled data. Acquiring labeled data sets that are applicable to the given situation is often difficult, as it requires the existence of another classification method, or labeling by hand. Furthermore, even if a data set would exist, the class imbalance (total number of positive labels vs number of negative labels) would render such training data sub-optimal. And lastly, the data generated by a WSN might change over time without being anomalous, requiring frequent retraining \cite{ramotsoela2018}. Out of these circumstances arises the need for unsupervised or semi-supervised anomaly detection methods.