Online Internet Traffic Monitoring System Using Spark Streaming
Owing to the explosive growth of Internet traffic, network operators must be able to monitor the entire network situation and efficiently manage their network resources. Traditional network analysis methods that usually work on a single machine are no longer suitable for huge traffic data owing to their poor processing ability. Big data frameworks, such as Hadoop and Spark, can handle such analysis jobs even for a large amount of network traffic. However, Hadoop and Spark are inherently designed for offline data analysis. To cope with streaming data, various stream-processing-based frameworks have been proposed, such as Storm, Flink, and Spark Streaming. In this study, we propose an online Internet traffic monitoring system based on Spark Streaming. The system comprises three parts, namely, the collector, messaging system, and stream processor. We considered the TCP performance monitoring as a special use case of showing how network monitoring can be performed with our proposed system. We conducted typical experiments with a cluster in standalone mode, which showed that our system performs well for large Internet traffic measurement and monitoring.
Cyberspace is dynamical and vulnerable to attacks. Therefore, it requires network providers to monitor the status of their network in real time. Online Internet traffic monitoring technologies have been extensively studied. In 1999, Paxson proposed the Bro system to detect network intruders in real time. Bro first captured a packet stream using libpcap and then reduced the incoming stream into a series of higher-level events using an event engine. They also proposed a custom scripting language called Bro scripts, which can be executed by the policy script interpreter to deal with events. Although Bro is single threaded, it can be set up in a high throughput cluster environment. Similar studies include Snort and Suricata, which are inherently based on single-machine computing. Various related studies have been conducted on online Internet traffic measurement and monitoring using Spark. Gupta et al. used Spark Streaming to analyze the network in real time.
In this study, we propose an online Internet traffic monitoring system based on Spark Streaming, which is a big data platform that can efficiently process a huge amount of traffic data so that we can monitor the network status in real time and is robust enough so as to suffer a failure without aborting the entire monitoring process. Big data platforms, such as Hadoop and Spark, provide an efficient way of processing a huge amount of data. For example, the MapReduce model and its open-source version, Hadoop, have been widely adopted by the big data analytics community due to their simplicity and ease of programming. Our contributions in this study are as follows: _ We propose a distributed architecture as an online Internet traffic measurement and monitoring system. _ We implement a parallel algorithm for monitoring TCP performance parameters, such as delay and retransmission ratio with a very short delay. _ We conduct typical experiments showing that the proposed system is feasible and efficient.
With the growth of Internet traffic, traditional network analysis methods that work on single machines are no longer suitable. Existing approaches take advantage of big data frameworks to improve processing efficiency. However, these approaches mainly focus on offline data analysis. In this study, we proposed an online Internet traffic monitoring system that utilizes Spark Streaming. We demonstrated that Internet measurement and monitoring can be treated as a stream analysis problem and can be handled via a streaming processing platform. Extensive experimental results show that our system achieved good performance and robustness. In future, we will implement collectors to capture packets from switches through port mirroring so that our system can analyze all the traffics passing through monitored networks. Finally, we will test its performance in practice and compare it with some traditional single server systems in terms of scalability and reliability.
 Cisco Visual Networking Index, Forecast and methodology, 2016-2021, White Paper, San Jose, CA, USA: Cisco, 2016.
 Y. Lee, W. Kang, and H. Son, An Internet traffic analysis method with MapReduce, in Proc. 2010 IEEE/IFIP Network Operations and Management Symposium Workshops (NOMS Wksps), Osaka, Japan, 2010, pp. 357–361.
 D. Brauckhoff, B. Tellenbach, A. Wagner, M. May, and A. Lakhina, Impact of packet sampling on anomaly detection metrics, in Proc. 6th ACM SIGCOMM Conf. Int. Measurement, Rio de Janeriro, Brazil, 2006, pp. 159–164.
 Y. Y. Qiao, Z. M. Lei, L. Yuan, and M. J. Guo, Offline traffic analysis system based on Hadoop, J . China Univ. Posts Telecommun., vol. 20, no. 5, pp. 97–103, 2013.
 Hadoop, http://hadoop.apache.org/, 2017
 K. Kambatla, G. Kollias, V. Kumar, and A. Grama, Trends in big data analytics, J . Parallel Distrib. Comput., vol. 74, no. 7, pp. 2561–2573, 2014.
 Apache Spark, http://spark.apache.org/, 2017.
 M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Spark: Cluster computing with working sets, in Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing, Boston, MA, USA, 2010, p. 10.
 J. Liu, F. Liu, and N. Ansari, Monitoring and analyzing big traffic data of a large-scale cellular network with Hadoop, IEEE Netw., vol. 28, no. 4, pp. 32–39, 2014.
 Y. Lee and Y. Lee, Toward scalable internet traffic measurement and analysis with Hadoop, ACM SIGCOMM Comput. Commun. Rev., vol. 43, no. 1, pp. 5–13, 2013.