Machine Learning for Arabic Text Classification: A Comparative Study
Keywords:Arabic text classification, Natural language processing, Feature extraction, Machine Learning
The ultimate aim of Machine Learning (ML) is to make machine acts like a human. In particular, ML algorithms are widely used to classify texts. Text classification is the process of classifying texts into a predefined set of categories based on the texts’ content. It contributes to improving information retrieval on the Web. In this paper, we focus on the "Arabic" text classification since there is a large community in the world that uses this language. The Arabic text classification process consists of three main steps: preprocessing, feature extraction and ML algorithm. This paper presents a comparative empirical study to see which combination (feature extraction - ML algorithm) acts well when dealing with Arabic documents. So, we implemented one hundred sixty classifiers by combining 5 feature extraction techniques and 32 machine learning algorithms. Then, we made these classifiers open access for the benefit of the AI and NLP communities. Experiments were carried out using a huge open dataset. The comparison study reveals that TFIDF-Perceptron is the best performing combination of a classifier.
A. L. Samuel, "Some studies in machine learning using the game of checkers," IBM Journal of Research and Development, vol. 3, pp. 210-229, 1959.
UNESCO. (2020). World Arabic Language Day, December 18, 2020. Available: https://en.unesco.org/commemorations/worldarabiclanguageday. Last visited: June 2022.
M. Biniz, "DataSet for Arabic Classification," Mendeley Data, V2, doi: 10.17632/v524p5dhpj.2, 2018.
M. A. H. Madhfar and M. A. H. Al-Hagery, "Arabic text classification: A comparative approach using a big dataset," in 2019 International Conference on Computer and Information Sciences (ICCIS), 2019, pp. 1-5.
E. Hanandeh, "Arabic text categorization using three classifiers methods: A comparative study," International Journal of Computer Science Issues (IJCSI), vol. 15, pp. 49-52, 2018.
M. Al-Yahya, "A comparative study of machine learning methods for genre identification of classical arabic text," Comput. Mater. Contin, vol. 60, pp. 421--433, 2019.
M. Alrabiah, A. Al-Salman, E. S. Atwell, and N. Alhelewh, "KSUCCA: A key to exploring Arabic historical linguistics," International Journal of Computational Linguistics (IJCL), vol. 5, pp. 27--36, 2014.
R. Ayed, M. Labidi, and M. Maraoui, "Arabic text classification: New study," in 2017 International Conference on Engineering & MIS (ICEMIS), 2017, pp. 1-7.
A. Al-Thubaity, M. Khan, M. Al-Mazrua, and M. Al-Mousa, "New language resources for arabic: corpus containing more than two million words and a corpus processing tool," in International Conference on Asian Language Processing, 2013, pp. 67--70.
F. S. Al-Anzi and D. AbuZeina, "Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing," Journal of King Saud University-Computer and Information Sciences, vol. 29, pp. 189--195, 2017.
M. S. Khorsheed and A. O. Al-Thubaity, "Comparative evaluation of text classification techniques using a large diverse Arabic dataset," Language Resources and Evaluation, vol. 47, pp. 513-538, 2013.
F. Thabtah, M. Eljinini, M. Zamzeer, and W. Hadi, "Naïve Bayesian based on Chi Square to categorize Arabic data," in Proceedings of the 11th international business information management association conference (IBIMA) conference on innovation and knowledge management in twin track economies, Cairo, Egypt, 2009, pp. 4-6.
I. Hmeidi, M. Al-Ayyoub, N. A. Abdulla, A. A. Almodawar, R. Abooraig, and N. A. Mahyoub, "Automatic Arabic text categorization: A comprehensive comparative study," Journal of Information Science, vol. 41, pp. 114-124, 2015.
D. Abuaiadah, J. El Sana, and W. Abusalah, "On the impact of dataset characteristics on arabic document classification," International Journal of Computer Applications, vol. 101, pp. 31-38, 2014.
A. H. Mohammad, O. Al-Momani, and T. Alwada’n, "Arabic text categorization using k-nearest neighbour, Decision Trees (C4. 5) and Rocchio classifier: a comparative study," International Journal of Current Engineering and Technology, vol. 6, pp. 477-482, 2016.
A. H. Mohammad, T. Alwada‘n, and O. Al-Momani, "Arabic text categorization using support vector machine, Naïve Bayes and neural network," GSTF Journal on Computing (JoC), vol. 5, pp. 1-8, 2016.
E. Al-Thwaib, W. Al-Romimah, and et al., "Support vector machine versus k-nearest neighbor for Arabic text classification," International Journal of Sciences, vol. 3, pp. 1--5, 2014.
R. M. Sallam, H. Mousa, and M. Hussien, "A Comparative Study for Arabic Text Classification Based on BOW and Mixed Words Representations," IJCI. International Journal of Computers and Information, vol. 5, pp. 24-34, 2016.
K. Abidi, Z. Elberichi, and Y. Tlili Guissa, "Arabic text categorization: a comparative study of different representation modes," Journal of Theoretical and Applied Information Technology, vol. 38, pp. 1-5, 2012.
A. Moh'd A Mesleh, "Chi square feature extraction based svms arabic language text categorization system," Journal of Computer Science, vol. 3, pp. 430-435, 2007.
K. Al-Hindi and E. Al-Thwaib, "A Comparative Study of Machine Learning Techniques in Classifying Full-Text Arabic Documents versus Summarized Documents.," World of Computer Science & Information Technology Journal, vol. 3, 2013.
A. M. El-Halees, "A comparative study on Arabic text classification," Egyptian Computer Science Journal, vol. 30, 2008.
S. Alsaleem and et al., "Automated Arabic Text Categorization Using SVM and NB.," Int. Arab. J. e Technol., vol. 2, pp. 124--128, 2011.
S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh, "Automatic Arabic text classification," in JADT 2008: 9es Journées Internationales d’Analyse Statistique des Données Textuelles, 2008, pp. 77-83.
J. Ababneh, "Application of Naïve Bayes, Decision Tree, and K-Nearest Neighbors for Automated Text Classification," Modern Applied Science, vol. 13, p. 31, 2019.
M. A. R. Abdeen, S. AlBouq, A. Elmahalawy, and S. Shehata, "A closer look at arabic text classification," Int. J. Adv. Comput. Sci. Appl, vol. 10, pp. 677--688, 2019.
M. Ahmed and R. Elhassan, "Arabic text classification review," International Journal of Computer Science and Software Engineering, vol. 4, pp. 1--5, 2015.
A. M. F. Al-Sbou, "A survey of arabic text classification models," International Journal of Electrical and Computer Engineering (IJECE), vol. 8, pp. 4352--4355, 2018.
R. Ayadi, M. Maraoui, and M. Zrigui, "A Survey of Arabic Text Representation and Classification Methods.," Res. Comput. Sci., vol. 117, pp. 51--62, 2016.
A. H. Mohammad, "Arabic text classification: A review," Modern Applied Science, vol. 13, 2019.
M. Sayed, R. K. Salem, and A. E. Khder, "A survey of Arabic text classification approaches," International Journal of Computer Applications in Technology, vol. 59, pp. 236-251, 2019.
K. A. Wahdan, S. Hantoobi, S. A. Salloum, and K. Shaalan, "A systematic review of text classification research based ondeep learning models in Arabic language," Int. J. Electr. Comput. Eng, vol. 10, pp. 6629-6643, 2020.
A. Khatun, M. Mafiul Hasan, A. Miah, and R. Miah, "Comparative Study on Text Classification," Int. J. Eng. Sci. Invent, vol. 9, pp. 21-33, 2020.
Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, and L. He, "A survey on text classification: From shallow to deep learning," arXiv preprint arXiv:2008.00364, 2020.
P. Y. Pawar and S. Gawande, "A comparative study on different types of approaches to text categorization," International Journal of Machine Learning and Computing, vol. 2, p. 423, 2012.
S. Ramasundaram and S. Victor, "Algorithms for text categorization: A comparative study," World Applied Sciences Journal, vol. 22, pp. 1232-1240, 2013.
E. Selab and A. Guessoum, "Building TALAA, a Free General and Categorized Arabic Corpus," in ICAART (1), 2015, pp. 284-291.
O. Einea, Elnagar, A., & Al Debsi, R., "SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization.," Mendeley Data, V2, doi: 10.17632/57zpx667y9.2., 2019.
B. Al-Salemi, M. Ayob, G. Kendall, and S. A. Mohd Noah, "RTAnews: A Benchmark for Multi-label Arabic Text Categorization," Mendeley Data, V1, doi: 10.17632/322pzsdxwy.1, 2018.
R. E. Al-Debsi, Ashraf; Einea, Omar, "NADiA: News Articles Dataset in Arabic for Multi-Label Text Categorization," Mendeley Data, vol. 2, doi: 10.17632/hhrb7phdyx.2, 2019.
P. C. Team. (2021). Python 3.9.7 documentation. Python Software Foundation. Available: https://docs.python.org/3/. Last visited: June 2022.
S. Bird, E. Klein, and E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit: O'Reilly Media, Inc., 2009.
S. Loria. (2018). Textblob documentation. Release 0.15, 2, 269. Available: https://buildmedia.readthedocs.org/media/pdf/textblob/latest/textblob.pdf. Last visited: June 2022.
T. Zerrouki, "Tashaphyne, Arabic light stemmer.", 2019.
R. Rehurek and P. Sojka, "Software framework for topic modelling with large corpora," in Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, Universoty of Malta, 2010.
C. R. Harris, K. J. Millman, S. J. Van Der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, and N. J. Smith, "Array programming with NumPy," Nature, vol. 585, pp. 357-362, 2020.
L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, and et al., "API design for machine learning software: experiences from the scikit-learn project," presented at the European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases, Prague, Czech Republic., 2013.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg, "Scikit-learn: Machine learning in Python," the Journal of machine Learning research, vol. 12, pp. 2825-2830, 2011.
M. F. McTear, Z. Callejas, and D. Griol, "The conversational interface," vol. 6, p. 94, 2016.
G. Grefenstette, "Tokenization," in Syntactic Wordclass Tagging, ed: Springer, 1999, pp. 117-133.
C. R. Severance, Python for Everybody: Exploring Data in Python 3: CreateSpace Independent Publishing Platform, 2016.
A. Rajaraman and J. D. Ullman, Mining of massive datasets: Cambridge University Press, 2011.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in neural information processing systems, vol. 26, pp. 3111-3119, 2013.
Q. Le and T. Mikolov, "Distributed representations of sentences and documents," in International conference on machine learning, 2014, pp. 1188-1196.
D. M. Jurafsky, J. H. (Draft of September 21, 2021). Speech and Language Processing. Chapter 3: N-gram Language Models. Retrieved from: https://web.stanford.edu/~jurafsky/slp3/3.pdf. Last visited: June 2022.
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, "Feature hashing for large scale multitask learning," in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 1113-1120.
G. Kukreja, D. Bahl, and R. Gupta, "The Impact of FinTech on Financial Services in India: Past, Present, and Future Trends," in Innovative Strategies for Implementing FinTech in Banking, ed: IGI Global, 2021, pp. 191-200.
C. Cortes and V. Vapnik, "Support-vector networks," Machine learning, vol. 20, pp. 273-297, 1995.
D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant, Applied logistic regression vol. 398: John Wiley & Sons, 2013.
M. Arashi, A. M. E. Saleh, and B. G. Kibria, Theory of ridge regression estimation with applications: John Wiley & Sons, 2019.
L. Bottou and O. Bousquet, "The tradeoffs of large scale learning. Optimization for Machine Learning," ed, 2011.
M. N. Murty and R. Raghava, "Support Vector Machines and Perceptrons: Learning, Optimization, Classification, and Application to Social Networks," 2016.
J. Lu, P. Zhao, and S. C. Hoi, "Online passive-aggressive active learning," Machine learning, vol. 103, pp. 141-183, 2016.
A. M. Martinez and A. C. Kak, "Pca versus lda," IEEE transactions on pattern analysis and machine intelligence, vol. 23, pp. 228-233, 2001.
T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE transactions on information theory, vol. 13, pp. 21-27, 1967.
C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning vol. 2: MIT press Cambridge, MA, 2006.
J. Brownlee, "Master Machine Learning Algorithms: Discover How They Work and Implement Them From Scratch, 2016," URL https://books.google.com/, 2016. Last visited: June 2022.
S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms: Cambridge university press, 2014.
L. Breiman, "Bagging predictors," Machine learning, vol. 24, pp. 123--140, 1996.
L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5--32, 2001.
T. Hastie, S. Rosset, J. Zhu, and H. Zou, "Multi-class adaboost," Statistics and its Interface, vol. 2, pp. 349-360, 2009.
S. M. Piryonesi and T. E. El-Diraby, "Data analytics in asset management: Cost-effective prediction of the pavement condition index," Journal of Infrastructure Systems, vol. 26, p. 04019036, 2020.
A. I. Naimi and L. B. Balzer, "Stacked generalization: an introduction to super learning," European journal of epidemiology, vol. 33, pp. 459-464, 2018.
S. Abirami and P. Chitra, "Energy-efficient edge based real-time healthcare support system," in Advances in Computers. vol. 117, ed, 2020, pp. 339--368.
C. E. Metz, "Basic principles of ROC analysis," in Seminars in nuclear medicine, 1978, pp. 283-298.
T. Buckwalter, "Issues in Arabic orthography and morphology analysis," in proceedings of the workshop on computational approaches to Arabic script-based languages, 2004, pp. 31-34.
How to Cite
Copyright (c) 2022 Djelloul Bouchiha, Abdelghani Bouziane, Noureddine Doumi
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.