A K-Means Based Multi-level Text Clustering Algorithm for Retrieval of Research Information

Academic researchers in institutions of higher learning and research institutes use research outputs and metadata throughout their research work and to help in identifying research collaborators as well as getting to know existing research. Research outputs range from academic theses, journal and conference articles, books and book chapters, and datasets while research meta-data includes authors, affiliations, research areas, and projects, among others. However, access and retrieval of relevant research outputs and metadata remains a major challenge. As a result there is duplication of research, fewer opportunities for networking, and difficulty in detecting scientific fraud. Efforts need to be made to make academic research outputs and meta-data readily available and easy to retrieve. The main purpose of this work is to develop a tailor-made approach to information retrieval for the retrieval of research information and related meta-data. Therefore, the paper presents a multi-level text clustering algorithm for retrieval of scholarly research outputs and metadata from a central repository through a web based interface. The algorithm first clusters SQL data records that represents meta-data at the first level, then retrieves and clusters text documents representing research outputs at the second level. The algorithm was tested on retrieving information in the areas of text clustering, cloud computing, banking, HIV/AIDS, food security and cancer. The results show that it enables researchers to retrieve relevant information according to their information needs. To enable further enhancements and improvements, the algorithm will be released to the public domain for use in similar application domains or extension by other researchers.

[1]  Carlos Ordonez,et al.  A Clustering Algorithm Merging MCMC and EM Methods Using SQL Queries , 2014, BigMine.

[2]  Mansaf Alam,et al.  A survey on scholarly data: From big data perspective , 2017, Inf. Process. Manag..

[3]  Shreya Banerjee,et al.  Empirical evaluation of K-Means, Bisecting K-Means, Fuzzy C-Means and Genetic K-Means clustering algorithms , 2015, 2015 IEEE International WIE Conference on Electrical and Computer Engineering (WIECON-ECE).

[4]  Feng Xia,et al.  A Survey of Scholarly Data Visualization , 2018, IEEE Access.

[5]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[6]  John Davies,et al.  Information Retrieval: Searching in the 21st Century , 2009, Information Retrieval.

[7]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[8]  J. O. Oringo,et al.  Constraints on Research Productivity in Kenyan Universities: Case Study of University of Nairobi, Kenya , 2016 .

[9]  Petronilla Muriithi Computer mediated collaboration among the academic research community: A case study of Kenya: Doctoral consortium paper , 2013, IEEE 7th International Conference on Research Challenges in Information Science (RCIS).

[10]  Graham Cormode,et al.  People like us: mining scholarly data for comparable researchers , 2014, WWW.

[11]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[12]  Anoop Jain,et al.  Efficient Clustering Technique for Information Retrieval in Data Mining , 2012 .

[13]  Feng Xia,et al.  Big Scholarly Data: A Survey , 2017, IEEE Transactions on Big Data.

[14]  Raymond Wafula Ongus,et al.  Analysis of the Implementation of an Institutional Repository: A Case Study of Dedan Kimathi University of Technology, Kenya , 2016 .

[15]  Manoj Kumar,et al.  Analysis of various information retrieval models , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[16]  M Namratha,et al.  A Comprehensive Overview of Clustering Algorithms in Pattern Recognition , 2012 .

[17]  Carlos Ordonez,et al.  Integrating K-means clustering with a relational DBMS using SQL , 2006, IEEE Transactions on Knowledge and Data Engineering.

[18]  Marie Katsurai,et al.  Bursty research topic detection from scholarly data using dynamic Co-word networks: A preliminary investigation , 2017, 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(.

[19]  Horacio Saggion,et al.  Scholarly Data Mining: Making Sense of Scientific Literature , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[20]  Qiao Sun,et al.  An Efficient Distributed Database Clustering Algorithm for Big Data Processing , 2017 .

[21]  Mauricio Espinoza,et al.  Detecting Similar Areas of Knowledge Using Semantic and Data Mining Technologies , 2016, CLEI Selected Papers.

[22]  Juliet Erima,et al.  Preservation of digital research content in academic institutions: A case study of Moi University, Kenya , 2016, 2016 IST-Africa Week Conference.

[23]  Petronilla Muthoni Muriithi,et al.  Academic research collaborations in Kenya : structure, processes and information technologies , 2015 .