\documentclass[review=true, screen]{ocsmnar}


% Only used to typeset XeLaTeX-Logo below.
\usepackage{metalogo}


% Adjust this to the language used.
\usepackage[british]{babel}


\begin{document}


\title{Anomaly detection in wireless sensor networks: A survey}


\seminar{SVS} % Selbstorganisation in verteilten Systemen


\semester{Sommersemester 2020}


\author{Anton Lydike}
\affiliation{\institution{Universität Augsburg}}

\begin{abstract}
  Anomaly detection is an important problem in data science, which is encountered every time when data is collected. Since this is done in so many different environments, many different research contexts and application domains in which anomaly detection was researched exist. Anomaly detection in wireless sensor networks (WSN) is a relatively new addition to anomaly detection in general, and this survey will focus on that context in particular. 

  The context of WSN introduces a lot of interesting new challenges, as nodes are often small devices running on battery power and cannot be do much computation on their own. Furthermore, in WSNs communication is often not perfect and messages can and will get lost during operation. Any protocols that incur additional communication must have a good justification, as communication is expensive. All these factors create a unique environment, in which not many existing solutions to the problem are applicable. 
  
  In this paper, we will not discuss anomaly detection in hostile environments, or intrusion detection, but rather focus solely on anomaly detection in sensor data collected by the WSN.
  

 % - no intrusion detection
 % - grobe übersicht
 % - begriffe klären
 % - methoden aufzählen (ca 5 bereiche)
 % - aufteilen nach methoden
 % - weitere sources
 % - ergebnisse kurz vorstellen
\end{abstract}

\keywords{Wireless Sensor Networks, Anomaly detection, Outlier detection, Centralized anomaly detection, Distributed anomaly detection}

\maketitle

\section{Overview}

There are many different approaches to anomaly detection, we will differentiate between centralized and decentralized approaches. An approach is considered centralized, when a large chunk of the computation is done at a single point, or at a later stage during analysis. A decentralized approach implies that a considerable amount of processing is done on the individual nodes, doing analysis on the fly. When analysis is done centralized, it is important to differentiate between online and offline detection. Online detection can run while the WSN is operating, while offline detection is done after the data is collected. Offline detection methods can often be modified to work online, but will require an existing dataset.

\subsection{Anomaly types}
Furthermore we need to clarify the different kinds of anomalies that can occur in WSN datasets:

\begin{itemize}
  \item \emph{Spikes} are short changes with a large amplitude
  \item \emph{Noise} is an increase of variance over time
  \item \emph{Drift} is an offset which increases over time
\end{itemize}

Not all methods can detect all three types of anomalies equally, therefore we will note down if this was accounted for in each method and how good the detection was, for each given type.


\section{Centralized approaches}
When we speak of a centralized WSN, we mean, that there exists a central entity, called the \emph{base station}, where all data is delivered to. In our analysis, it is often assumed, that the base station does not have limits on its processing power. The base station will summarize the received data until it has a complete set and can then use this set to determine global outliers and other anomalies such as clock drift over the course of the whole operation, as it has a complete history for each given node. A centralized approach is not optimal in hostile environments, but that is not our focus here. Since this environment is closely related to the general field of anomaly detection, we will not go into much detail on these solution, instead focusing on covering just the basics.

\subsection{Statistical analysis}
Classical Statistical analysis is done by creating a model of the expected data and then finding the probability for each recorded data point. Improbable data points are then deemed outliers. The problem for many statistical approaches is finding this model of the expected data, as it's not always feasible to create it in advance. It also bears the problem of bad models or slow changes in the environment \cite{mcdonald2013}.

Sheng et al. \cite{sheng2007} proposes a rather naive approach, where histograms of each node are polled, combined, and then analyzed for outliers by looking at the maximum distance a data point can be away from his nearest neighbors. This solution has several problems, as it incurs a considerable communication overhead and fails to account for non gaussian distribution. It also requires choosing new parameters every time the expected data changes suddenly. 

Böhm et al. \cite{böhm2008} proposes a solution not only to non gaussian distributions, but also to noisy data. He defines a general probability distribution function (PDF) with an exponential distribution function (EDF) as a basis, which is better suited to fitting around non gaussian data as seen in figure \ref{fig:probdistböhm}. He then outlines an algorithm where the data is split into clusters, for each cluster an EDF is fitted and outliers are discarded.

\begin{figure}
  \includegraphics[width=8.5cm]{img/probability-dist-böhm.png}
  \caption{Difference of fitting a gaussian probability PDF and a customized exponential PDF. Image from \cite{böhm2008}.}
  \label{fig:probdistböhm}
\end{figure}

While there are many statistical methods for outlier detection, most follow a similar approach to at least one of the two methods shown here. Most of these are generally not as useful for online detection. 

\subsection{Density based analysis}
Outliers can be selected by looking at the density of points as well. Breuning et al. \cite{breuning2000} proposes a method of calculating a local outlier factor (LOF) of each point based on the local density of its $n$ nearest neighbors. The problem lies in selecting good values for $n$. If $n$ is too small, clusters of outliers might not be detected, while a large $n$ might mark points as outliers, even if they are in a large cluster of $<n$ points. This problem is further exasperated when we try to use this in a WSN setting, for example by streaming through the last $k$ points, as cluster size will not stay constant as incoming data might be delayed or lost in transit.

Papadimitriou et al. \cite{papadimitriou2003} introduces a parameterless approach. The paper formulates a method using a local correlation integral (LOCI), which does not require parametrization. It uses a multi-granularity deviation factor (MDEF), which is the relative deviation for a point $p$ in a radius $r$. The MDEF is simply the number of nodes in an $r$-neighborhood divided by the sum of all points in the same neighborhood. LOCI provides an automated way to select good parameters for the MDEF and can detect outliers and outlier-clusters with comparable performance to other statistical approaches. They also formulate aLOCI, a linear approximation of LOCI, which also gives accurate results while reducing runtime.


\section{Machine learning approaches}
Most machine learning approaches focus on outlier detection, which is a common problem in WSN, as an outlier is inherently an anomaly. Outlier detection is largely unable to detect drift and has difficulties wih noise, but excels at detecting data points or groups which appear to be inconsistent with the other data (spikes). A common problem is finding outliers in data with an inherently complex structure.

It is impossible to create an exhaustive list of classifiers to define what is and isn't an anomaly. Therefore it is difficult to generate labeled training data for machine learning. Furthermore, the data generated by a WSN might change over time without being anomalous, requiring frequent retraining. Out of these circumstances arises the need for unsupervised anomaly detection methods.

We will look into a couple different approaches to outlier detection 

\subsection{Support vector machines (SVMs)}
Rajasegarar et al. \cite{rajasegarar2010} uses SVMs, which leverage a kernel function to map the input space to a higher dimensional feature space. This allows the SVM to then model highly nonlinear patterns of normal behavior in a flexible manner. This means, that patterns that are difficult to classify in the problem space, become more easily recognizable and therefore classifiable in the feature space. Once the data is mapped into the feature space, hyperelipsoids are fitted to the data points to define regions of the feature space that classify the data as normal.

While this approach works well to find outliers in the data, it is also computationally expensive. In an attempt to To decrease computational complexity, only a single hyperelipsoid is fitted to the dataset. This method is called a one-class support vector machine. Originally Wang et al. \cite{wang2006} created a model of a one-class SVM (OCSVM), however the solution required the solution of a computationally complex second-order cone pro-gramming problem, making it unusable for distributed usage. Rajasegarar et al. \cite{rajasegarar2010} improved on this OCSVM in a couple of ways.

They used the fact, that they could normalize numerical input data to lay in the vicinity of the origin inside the feature space. They furthermore used the results of Laskov et al. \cite{laskov2004} which showed, that normalized numerical data is one-sided, always lying in the positive quadrants, to formulate a centered-hyperelipsoidal SVM (CESVM) which vastly reduces computational complexity to a linear problem. Furthermore they introduce a one-class quarter-sphere SVM (QSSVM) with reduced communication. They conclude however, that the technique ist still unfit for decentralized use because of the large remaining communication overhead, as a consensus for the radiuses and other parameters is still required.

The QSSVM was improved in 2012 by Shahid et al. \cite{shahid2012a}, \cite{shahid2012b}, proposing three schemes that reduce communication overhead while maintaining detection performance. His propositions make use of the spatio-temporal \& attribute (STA) correlations in the measured data. These propositions accept worse consensus about the placement of the hypersphere among neighboring nodes in order to reduce the communication overhead. He then shows, that his approaches are comparable in performance to the QSSVM proposed by Rajasegarar et al. if the data correlates well enough inside each neighborhood. It is important to note, that neighborhood information relies on nodes being mostly stationary, as the assumptions on the STA correlations will fail in a shifting network topology.

As far as we are aware, there are no SVM solutions that solve the problem of a dynamic network topology.


\subsection{Generalized hebbian algorithm}
coming soon \cite{ali2015}

\subsection{Principal Component Analysis}
\cite{oreilly2016}


\section{Decentralized approaches using passive neighborhood data}
When working decentralized with no additional overhead, it is still possible to obtain additional data, just by listening to other nodes broadcasts. This data can be fed into various prediction models which can then be used to calculate a confidence level for the nodes own measurements. 

Bosman et al. \cite{bosman2017} looks at the performance of recursive last squares (RLS) and the online sequential extreme learning machine (OS-ELM) approach to train a single-layer feed-forward neural network (SLFN). These are compared to first degree polynomial function approximation (FA) and sliding window mean prediction. The article shows, that incorporation neighborhood information improves anomaly detection only in cases where the dataset is well-correlated and shows low spatial entropy, as is common in most natural monitoring applications. When the dataset does not correlate well, or there is too much spatial entropy, the methods described in this paper fail to predict anomalies. It concludes, that neighborhood aggregation is not useful beyond 5 neighbors, as such a large dataset will fail to meet the aforementioned conditions. The exact size of the optimal neighborhood will vary with network topology and sensor modality.

Here, all three types of anomalies were accounted for in the dataset, but there was no analysis, how good the detection was for each kind of anomaly.

\section{Non-stationary data} \cite{oreilly2014}


\section{Conclusion}


\bibliographystyle{alpha}
\bibliography{References}


\end{document}
%TODO: Generalized hebbian algorithm
%TODO: non stationary data
%TODO: for each failure mode, more descriptions
%TODO: sensor drift detection
%      - Blind:
%        - https://ieeexplore.ieee.org/abstract/document/8589548
%        - http://nicsefc.ee.tsinghua.edu.cn/media/publications/2016/IEEE%20JSEN_202.pdf
%      - Machine learning:
%        - https://dl.acm.org/doi/abs/10.1145/2736697 https://scihubtw.tw/10.1145/2736697 online/offline
%        - https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8884207 extreme learning - is trained on sensors that are drift-corrected by other algorithms
%TODO: sensor failure detection