Semi-Automatic Classification and Duplicate Detection From Human Loss News Corpus

Automatic news repository collection systems involve a news crawler that extracts news from different news portals, subsequently, these news need to be processed to figure out the category of a news article e.g. sports, politics, showbiz etc. In this process there are two main challenges first one is to place a news article under the right category of news, while the second one is to detect a duplicate news, i.e. when the news are being extracted from multiple sources, it is highly probable to get the same news from many different portals, resulting into duplicate news; failing to which may result into inconsistent statistics obtained after pre-processing the news text. This problem becomes more pertinent when we deal with human loss news involving crime, accident etc. related news articles. As the system may count the same news many times resulting into misleading statistics. In order to address these problems, this research presents the following contributions. Firstly, a news corpus comprising of human loss news of different categories has been developed by gathering data from different sources of well-known and authentic news websites. The corpus also includes a number of duplicate news. Secondly, a comparison of different classification approaches has been conducted to empirically find out the best suitable text classifier for the categorization of different sub-categories of human loss news. Lastly, methods have been proposed and compared to detect duplicate news from the corpus by involving different pre-processing techniques and widely used similarity measures, cosine similarity, and Jaccard’s coefficient. The results show that conventional text classifiers are still relevant and perform well in text classification tasks as MNB has given 89.5% accurate results. While, Jaccard coefficient exhibits much better results than Cosine similarity for duplicate news detection with different pre-processing variations with an average accuracy of 83.16%.

[1]  Milad Mirbabaie,et al.  Sensemaking in Social Media Crisis Communication - a Case Study on the Brussels Bombings in 2016 , 2017, ECIS.

[2]  Changjun Jiang,et al.  Deep Representation Learning With Full Center Loss for Credit Card Fraud Detection , 2020, IEEE Transactions on Computational Social Systems.

[3]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[4]  Matias Garcia-Constantino,et al.  On the use of text classification methods for text summarisation , 2013 .

[5]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[6]  R. Ramya,et al.  Effective Pre-Processing Activities in Text Mining using Improved Porter's Stemming Algorithm , 2013 .

[7]  Ibrahim Abu El-Khair,et al.  Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study , 2017, ArXiv.

[8]  Anita Krishnakumar anita Building a kNN classifier for the Reuters-21578 collection , 2006 .

[9]  Iadh Ounis,et al.  Automatically Building a Stopword List for an Information Retrieval System , 2005, J. Digit. Inf. Manag..

[10]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[11]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[12]  Yan Guo,et al.  ECON: An Approach to Extract Content from Web News Page , 2010, 2010 12th International Asia-Pacific Web Conference.

[13]  Jatinderkumar R. Saini,et al.  Stop-Word Removal Algorithm and its Implementation for Sanskrit Language , 2016 .

[14]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[15]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[16]  Khushbu Khamar Short Text Classification Using kNN Based on Distance Function , 2013 .

[17]  ZAINAB A. KHALAF,et al.  FILTERING APPROACH AND SYSTEM COMBINATION FOR ARABIC NEWS CLASSIFICATION , 2018 .

[18]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[19]  Fathy E. Eassa,et al.  Near Duplicate Document Detection Survey , 2012 .

[20]  Shehzad Khalid,et al.  News Headlines Classification Using Probabilistic Approach , 2015 .

[21]  Shehzad Khalid,et al.  A probabilistic framework for short text classification , 2018, 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC).

[22]  Jiang Wu,et al.  Study on the Calculation of Text Similarity Based on Key-sentence , 2010, 2010 International Conference on E-Business and E-Government.

[23]  Shie-Jue Lee,et al.  A Similarity Measure for Text Classification and Clustering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[24]  John C. Platt,et al.  Learning Discriminative Projections for Text Similarity Measures , 2011, CoNLL.

[25]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[26]  Kohei Watanabe,et al.  Newsmap: A semi-supervised approach to geographical news classification , 2018 .

[27]  Upasana Pandey,et al.  A Survey on Text Classification Techniques for E-mail Filtering , 2010, 2010 Second International Conference on Machine Learning and Computing.

[28]  Bin Wang,et al.  A method of educational news classification based on emotional dictionary , 2018, 2018 Chinese Control And Decision Conference (CCDC).

[29]  Kohei Watanabe,et al.  Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches , 2020, Social Science Computer Review.

[30]  Mária Bieliková,et al.  News Recommending based on Text Similarity and user Behaviour , 2011, WEBIST.

[31]  Rasha Elhassan,et al.  Arabic Text Classification on Full Word , 2015 .

[32]  Rada Mihalcea,et al.  Text-to-Text Semantic Similarity for Automatic Short Answer Grading , 2009, EACL.

[33]  Wei Guan,et al.  Research and application of news-text similarity algorithm based on Chinese word segmentation , 2013, 2013 3rd International Conference on Consumer Electronics, Communications and Networks.

[34]  R. Nalawade,et al.  Improved Similarity Measure For Text Classification And Clustering , 2016 .