Open Access   Article Go Back

Comparative Study of Techniques for Alleviating Class Imbalance in Spam Classification

Gopalkrishna Waja1

Section:Research Paper, Product Type: Journal Paper
Volume-9 , Issue-8 , Page no. 38-45, Aug-2021

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v9i8.3845

Online published on Aug 31, 2021

Copyright © Gopalkrishna Waja . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: Gopalkrishna Waja, “Comparative Study of Techniques for Alleviating Class Imbalance in Spam Classification,” International Journal of Computer Sciences and Engineering, Vol.9, Issue.8, pp.38-45, 2021.

MLA Style Citation: Gopalkrishna Waja "Comparative Study of Techniques for Alleviating Class Imbalance in Spam Classification." International Journal of Computer Sciences and Engineering 9.8 (2021): 38-45.

APA Style Citation: Gopalkrishna Waja, (2021). Comparative Study of Techniques for Alleviating Class Imbalance in Spam Classification. International Journal of Computer Sciences and Engineering, 9(8), 38-45.

BibTex Style Citation:
@article{Waja_2021,
author = {Gopalkrishna Waja},
title = {Comparative Study of Techniques for Alleviating Class Imbalance in Spam Classification},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {8 2021},
volume = {9},
Issue = {8},
month = {8},
year = {2021},
issn = {2347-2693},
pages = {38-45},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=5376},
doi = {https://doi.org/10.26438/ijcse/v9i8.3845}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v9i8.3845}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=5376
TI - Comparative Study of Techniques for Alleviating Class Imbalance in Spam Classification
T2 - International Journal of Computer Sciences and Engineering
AU - Gopalkrishna Waja
PY - 2021
DA - 2021/08/31
PB - IJCSE, Indore, INDIA
SP - 38-45
IS - 8
VL - 9
SN - 2347-2693
ER -

VIEWS PDF XML
393 374 downloads 188 downloads
  
  
           

Abstract

Class Imbalance is inarguably one of the most significant and common problem faced while training supervised machine learning models to identify anomalies. In paradigms like spam filtering, medical diagnosis, intrusion detection etc. the amount of data available on negative class is much greater than that on the positive class and hence training traditional machine learning model on such data biases it in favor of the negative class at the cost of the positive class leading the model to give a false sense of accuracy and hence undermine its own purpose. Owing to the importance of this problem several techniques have been developed to tackle it and this paper is aimed to provide an empirical comparative evaluation of a gamut of these techniques to mitigate the adverse effect of class imbalance pertaining to spam classification. In this paper I have compared the effect of 8 resampling techniques including ROS, SMOTE, ADASYN, Near-Miss and TOMEK-LINKS on the performance of eight different learning classifiers which were selected cautiously to incorporate diverse strategies used for classification. In addition to this the performance of four Ensemble learning methods, including EasyEnsemble and SMOTEBoost, are contrasted when trained on an imbalanced dataset. The AUC-ROC performance metric calculated using a stratified 5-fold cross validation was used to evaluate the effect of different imbalance handling techniques. Furthermore, Statistical tests were performed on the results obtained to posit the best model for spam classification for the dataset used.

Key-Words / Index Term

Imbalance, spam classification, resampling, ensemble learners, statistical test

References

[1] A. D. R. F. Omar Saad, "A survey of machine learning techniques for Spam filtering," International Journal of Computer Science and Network Security (IJCSNS), Vol.12 No.2, p. 66, 2012.
[2] A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti and M. Alazab, "A Comprehensive Survey for Intelligent Spam Email Detection," in IEEE Access, vol. s7, pp. 168261-168295, 2019.
[3] S. Wang and X. Yao, "Multiclass Imbalance Problems: Analysis and Potential Solutions," in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 4, pp. 1119-1130, 2012.
[4] Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A. et a “A survey on addressing high-class imbalance in big data,” in Journal of Big Data, Vol 5, pp.42, 2018.
[5] E. M. Dogo, N. I. Nwulu, B. Twala and C. O. Aigbavboa, "Empirical Comparison of Approaches for Mitigating Effects of Class Imbalances in Water Quality Anomaly Detection," in IEEE Access, vol. 8, pp. 218015-218036, 2020.
[6] M. RAZA, N. D. Jayasinghe and M. M. A. Muslam, "A Comprehensive Review on Email Spam Classification using Machine Learning Algorithms," in the Proceedings of the 2021 International Conference on Information Networking (ICOIN), pp. 327-332, 2021.
[7] P. Ratadiya and R. Moorthy. "Spam filtering on forums: A synthetic oversampling based approach for imbalanced data classification," in CoRR 2019, abs/1909.04826.
[8] S. R. Gomes et al., "A comparative approach to email classification using Naive Bayes classifier and hidden Markov model," in the Proceedings of the 2017 4th International Conference on Advances in Electrical Engineering (ICAEE), pp. 482-487, 2017.
[9] A. Junnarkar, S. Adhikari, J. Fagania, P. Chimurkar and D. Karia, "E-Mail Spam Classification via Machine Learning and Natural Language Processing," in the Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), pp. 693-699, 2021.
[10] J. Fattahi and M. Mejri, "SpaML: a Bimodal Ensemble Learning Spam Detector based on NLP Techniques," in the Proceedings of the 2021 IEEE 5th International Conference on Cryptography, Security and Privacy (CSP), pp. 107-112, 2021.
[11] S. Rodda and U. S. R. Erothi, "Class imbalance problem in the Network Intrusion Detection Systems," in the Proceedings of the 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 2685-2688, 2016
[12] L. Zhang and W. Wang, "A Re-sampling Method for Class Imbalance Learning with Credit Data," in the Proceedings of the 2011 International Conference of Information Technology, Computer Engineering and Management Sciences, pp. 393-397, 2011.
[13] G. Lemaitre, F. Nogueira, and C. Aridas, ‘‘Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning,’’ in Journal of Machine Learning Research., vol. 18, no. 1, pp. 559–563, 2017.
[14] S. Sharma, C. Bellinger, B. Krawczyk, O. Zaiane and N. Japkowicz, "Synthetic Oversampling with the Majority Class: A New Perspective on Handling Extreme Imbalance," in the Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), pp. 447-456, 2018.
[15] Haibo He, Yang Bai, E. A. Garcia and Shutao Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," in the Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322-1328, 2008.
[16] F. Alberto, G. Salvador, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, “Learning from imbalanced data sets” Springer Science+Business Media, New York, pp. 19-46 2018.
[17] Y. Pristyanto, N. A. Setiawan and I. Ardiyanto, "Hybrid resampling to handle imbalanced class on classification of student performance in classroom" in the Proceedings of the 2017 1st International Conference on Informatics and Computational Sciences (ICICoS), pp. 207-212, 2017.
[18] Y. Pristyanto and A. Dahlan, "Hybrid Resampling for Imbalanced Class Handling on Web Phishing Classification Dataset," in the Proceedings of the 2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), pp. 401-406, 2019.
[19] S. Ahmed, A. Mahbub, F. Rayhan, R. Jani, S. Shatabda and D. M. Farid, "Hybrid Methods for Class Imbalance Learning Employing Bagging with Sampling Techniques," in the Proceedings of the 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), pp. 1-5, 2017.
[20] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse and A. Napolitano, "RUSBoost: A Hybrid Approach to Alleviating Class Imbalance," in IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 40, no. 1, pp. 185-197, 2010.
[21] A. Sarmanova and S. Albayrak, "Alleviating class imbalance problem in data mining," in the Proceedings of the 21st Signal Processing and Communications Applications Conference (SIU), pp. 1-4, 2013.
[22] S. R. a. V. F. a. N. J. Mirhoseini, "E-Mail phishing detection using natural language processing and machine learning techniques," in the Proceedings of the 7th National Congress of New Findings of in Electrical Engineering, Iran, 2021.
[23] B. Twala and F. Mekuria, "Ensemble multisensor data using state-of-the-art classification methods," in the Proceedings of the 2013 Africon, pp. 1-6, 2013.
[24] J. Demšar, ‘‘Statistical comparisons of classifiers over multiple data sets,’’ in Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.
[25] S. Shumaly, P. Neysaryan and Y. Guo, "Handling Class Imbalance in Customer Churn Prediction in Telecom Sector Using Sampling Techniques, Bagging and Boosting Trees," in the Proceedings of the 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE), pp. 082-087, 2020
[26] A. Abdullah ALFRHAN, R. Hamad ALHUSAIN and R. Ulah Khan, "SMOTE: Class Imbalance Problem In Intrusion Detection System," in the Proceedings of the 2020 International Conference on Computing and Information Technology (ICCIT-1441), pp. 1-5, 2020.
[25] P. Lim, C. K. Goh and K. C. Tan, "Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning," in IEEE Transactions on Cybernetics, vol. 47, no. 9, pp. 2850-2861, 2017.
[26] Z. Yuan and P. Zhao, "An Improved Ensemble Learning for Imbalanced Data Classification," in the Proceedings of the 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), pp. 408-41, 2019.
[27] S.S. Patil, S. P. Sonavane, “Handling of Class Imbalanced Problem in Big Data Sets: An Experimental Evaluation (UCPMOT),” International Journal of Computer Sciences and Engineering, Vol.06, Issue.01, pp.1-9, 2018.
[28] S. Rodda and U. S. R. Erothi, "Class imbalance problem in the Network Intrusion Detection Systems," in the Proceedings of the 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 2685-2688, 2016.
[29] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince and F. Herrera, "A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches," in IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463-484, 2012.
[30] H. He and E. A. Garcia, "Learning from Imbalanced Data," in IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263-1284, 2009.
[31] P. Yerawar, G. Pakle, “A Survey of Different Techniques to Handle An Unbalanced Dataset,” in International Journal of Computer Sciences and Engineering, Vol.6, Issue.12, pp.818-824, 2018.
[32] Y. Zhang, G. Liu, W. Luan, C. Yan and C. Jiang, "An approach to class imbalance problem based on stacking and inverse random under sampling methods," in the Proceedings of the 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC), pp. 1-6, 2018.