Clustering Approach Based on Mini Batch Kmeans for Intrusion Detection System over Big Data
Intrusion Detection System (IDS) provides an important basis for the network defense. Due to the development of the cloud computing and social network, massive amounts of data are generated, which inevitably brings much pressure to IDS. And therefore, it becomes crucial to efficiently divide the data into different classes over big data according to data features. Moreover, we can further determine whether one is normal behavior or not based on the classes information. Although the clustering approach based on Kmeans for IDS has been well studied, unfortunately directly using it in big data environment may suffer from inappropriateness. On the one hand, the efficiency of data clustering needs to be improved. On the other hand, differ from the classification, there is no unified evaluation indicator for clustering issue, and thus, it is necessary to study which indicator is more suitable for evaluating the clustering results of IDS. In this study, we propose a clustering method for IDS based on Mini Batch Kmeans combined with Principal Component Analysis. Firstly, a preprocessing method is proposed to digitize the strings and then the dataset is normalized so as to improve the clustering efficiency. Secondly, the Principal Component Analysis method is used to reduce the dimension of the processed dataset aiming to further improve the clustering efficiency, and then Mini Batch Kmeans method is used for data clustering. More specifically, we use Kmeans++ to initialize the centers of cluster in order to avoid the algorithm getting into the local optimum, in addition, we choose the Calsski Harabasz indicator so that the clustering result is more easily determined. Compared with the other methods, the experimental results and the time complexity analysis show that our proposed method is effective and efficient. Above all, our proposed clustering method can be used for IDS over big data environment.
Data mining is an intelligent data analysis technique that finds useful knowledge from big data -. Data mining method for the first time was introduced to intrusion detection in 1998 , and then many researchers engaged in this issue. In general, research of IDS based on data mining is mainly focused on two aspects, thus clustering and classification. More specifically, there is no label for each data record in the initial dataset for clustering issue, and the object of the clustering algorithm is to put similar data records in the same class. That means the behavior of the packet will be marked as normal class or abnormal one according to the characteristics of the data, so as to achieve the purpose of clustering data.
we propose using Mini Batch Kmeans with PCA (PMBKM) for IDS clustering, the experimental result shows that the clustering efficiency has been greatly improved. Secondly, we use Kmeans++ to initialize the cluster centers so as to effectively avoid the algorithm getting into local optimum and further ensure the effectiveness of clustering results. Last but not the least, we consider CH indicator for the evaluation of clustering results. The experimental results show that it is easy to determine which K-value is the best one. Last but not the least, not only the 10% dataset, but also the full dataset of KDDCUP99 are tested in this study. Above all, our proposed PMBKM can be used for IDS clustering over big data.
In this study, we proposed a clustering method based on Mini Batch Kmeans with PCA (PMBKM) for Intrusion Detection System. Taking IDS classic dataset KDDCUP99 for example, both 10% dataset and full dataset are tested. Firstly, we preprocess the given dataset and then the PCA method is used to reduce the dimension so as to improve the clustering efficiency. Additionally, the Mini Batch Kmeans method is used for the clustering of the processed dataset. Compared with Kmeans (KM), Kmeans with PCA (PKM), as well as Mini Batch Kmeans (MBKM), the experimental results show that our proposed PMBKM is effective and efficient. Above all, PMBKM can be used for intrusion detection system over big data environment. In our future work, we will engage in the research of clustering method over fog computing.
 J. P. Anderson, “Computer security threat monitoring and surveillance, “James P. Anderson Company, Fort Washington, Pennsylvania, Tech. Rep, vol. 17, Apr.15, 1980.
 D. E. Denning. “An intrusion-detection model,” IEEE Trans. Softw. Eng., vol.2, pp.222-232, 1987.
 A. Milenkoski et al., “Evaluating Computer Intrusion Detection Systems: A Survey of Common Practices,” ACM Comput. Surv., vol. 48, no.1, pp.1-41, Sep.29, 2015.
 K. Peng et al., “Link Importance Evaluation of Data Center Network Based on Maximum Flow,” J. Internet Technol., vol.18, no.1, pp. 23-31, Jan.1, 2017.
 X. L. Xu et al., “Data placement for privacy-aware applications over big data in hybrid clouds,” Secur. Commun. Netw.,2017,https://doi.org/10.1155/2017/2376484.
 L. Y. Qi et al., “Structural Balance Theory-based Ecommerce Recommendation over Big Rating Data,” IEEE Trans. Big Data., DOI:10.1109/TBDATA.2016.2602849.
 H. Zhou et al., “Predicting Temporal Social Contact Patterns for Data Forwarding in Opportunistic Mobile Networks,” IEEE Trans. Veh. Technol., vol.66, no.11, pp. 10372-10383, Nov, 2017.
 S. G. Wang et al., “Service Composition in CyberPhysical-Social Systems,” IEEE Trans. Emerg. Top. Comput., 2017, no.99, pp.1-1.doi: 10.1109/TETC.2017.2675479
 L. Y. Qi et al., “A Context-aware Service Evaluation Approach over Big Data for Cloud Applications,” IEEE Trans. Cloud Comput., DOI: 10.1109/TCC.2015.2511764.
 H. Zhou, H. Wang, X. Li, and V. Leung. “A survey on Mobile Data Offloading Technologies,” IEEE Access., vol. pp, no. 99, pp. 1-11, Jan. 2018.
 R.L Deng, P. Zhuang, and H, Liang, “CCPA: Coordinated Cyber-Physical Attacks and Countermeasures in Smart Grid,” IEEE Trans. Smart Grid., vol.8, no.5, pp.2420-2430, Sept.2017.
 H. Zhou et al., “Analysis of Event-driven Warning Message Propagation in Vehicular Ad Hoc Networks,” Ad Hoc Netw., vo.55, pp.87-96, Feb.2017.
 M. Wall, “Big Data: Are you ready for blast-off,” BBC Business News. Mar. 2014.
 CSICO, “Forecast and Methodology. 2013-2018,” Cisco Systems, San Jose, CA, USA, 2013.
 X. W. Chen, X. Lin, “Big data deep learning: challenges and perspectives,” IEEE Access, vol.2, pp.514-525, May.16, 2014.