Term frequency – function of document frequency: a new term weighting scheme for enterprise information retrieval

In today's business environment, enterprises are increasingly under pressure to process the vast amount of data produced everyday within enterprises. One method is to focus on the business intelligence (BI) applications and increasing the commercial added-value through such business analytics activities. Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital task in enterprise Information Retrieval (IR), text categorisation, text analytics, etc. When determining term weight in a document, the traditional TF-IDF scheme sets weight value for the term considering only its occurrence frequency within the document and in the entire set of documents, which leads to some meaningful terms that cannot get the appropriate weight. In this article, we propose a new term weighting scheme called Term Frequency – Function of Document Frequency (TF-FDF) to address this issue. Instead of using monotonically decreasing function such as Inverse Document Frequency, FDF presents a convex function that dynamically adjusts weights according to the significance of the words in a document set. This function can be manually tuned based on the distribution of the most meaningful words which semantically represent the document set. Our experiments show that the TF-FDF can achieve higher value of Normalised Discounted Cumulative Gain in IR than that of TF-IDF and its variants, and improving the accuracy of relevance ranking of the IR results.

[1]  Lian Duan,et al.  A Local Density Based Spatial Clustering Algorithm with Noise , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[2]  Lida Xu,et al.  An Integrated Approach for Agricultural Ecosystem Management , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[3]  Li D. Xu,et al.  Integrating knowledge management and ERP in enterprise information systems , 2006 .

[4]  Jing Li,et al.  A Service-Based Framework for Pharmacogenomics Data Integration , 2009, 2009 IEEE International Conference on e-Business Engineering.

[5]  Rong Jin,et al.  Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall , 2001, SIGIR '01.

[6]  Thomas Roelleke A frequency-based and a poisson-based definition of the probability of being informative , 2003, SIGIR '03.

[7]  Marek Wermus,et al.  Development of an integrated medical supply information system , 2011, Enterp. Inf. Syst..

[8]  Li Xu Advances in intelligent information processing , 2006, Expert Syst. J. Knowl. Eng..

[9]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[10]  Marcel Worring,et al.  NIST Special Publication , 2005 .

[11]  Ali R. Hurson,et al.  TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[12]  Lida Xu,et al.  Enterprise Systems: State-of-the-Art and Future Trends , 2011, IEEE Transactions on Industrial Informatics.

[13]  Qing He,et al.  MSMiner - a developing platform for OLAP , 2007, Decis. Support Syst..

[14]  Hooshang M. Beheshti,et al.  Improving productivity and firm performance with enterprise resource planning , 2010, Enterp. Inf. Syst..

[15]  Sergio de Cesare,et al.  Please Scroll down for Article Enterprise Information Systems a Literature Review on Business Process Modelling: New Frontiers of Reusability a Literature Review on Business Process Modelling: New Frontiers of Reusability , 2022 .

[16]  Li Da Xu,et al.  Management: a scientific discipline for humanity , 2011, Inf. Technol. Manag..

[17]  Feng Shan,et al.  An object-oriented intelligent design tool to aid the design of manufacturing systems , 2001, Knowl. Based Syst..

[18]  Frank Leymann,et al.  Identifying influential factors of business process performance using dependency analysis , 2011, Enterp. Inf. Syst..

[19]  Hui Wang,et al.  Influencing factors for predicting financial performance based on genetic algorithms , 2009 .

[20]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[21]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[22]  Ying Liu,et al.  Cluster-based outlier detection , 2009, Ann. Oper. Res..

[23]  Won Kim,et al.  Refining search results using a mining framework , 2009, Expert Syst. Appl..

[24]  Moustafa Chenine,et al.  Data accuracy assessment using enterprise architecture , 2011, Enterp. Inf. Syst..

[25]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[26]  Mario Piattini,et al.  Generating event logs from non-process-aware systems enabling business process mining , 2011, Enterp. Inf. Syst..

[27]  Mu-Chen Chen,et al.  The adaptive approach for storage assignment by mining data of warehouse management system for distribution centres , 2011, Enterp. Inf. Syst..

[28]  Pavel Blagoveston Bochev,et al.  A vector space model for information retrieval with generalized similarity measures. , 2012 .

[29]  Li Wang,et al.  Knowledge portal construction and resources integration for a large scale hydropower dam , 2009 .

[30]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[31]  William Nick Street,et al.  Healthcare information systems: data mining methods in the creation of a clinical recommender system , 2011, Enterp. Inf. Syst..

[32]  Lynn Ling X Li,et al.  Knowledge-based problem solving: an approach to health assessment1This research was partially sponsored by the National Natural Science Foundation of China and China Bridge International Foundation. Dr Li is a professor at China Textile University.1 , 1999 .

[33]  John N. Warfield,et al.  Advances in intelligent information processing , 2007, Inf. Syst..

[34]  Lynn Ling X. Li Knowledge-based problem solving: an approach to health assessment 1 This research was partially spon , 1999 .

[35]  Maybin K. Muyeba,et al.  Business information query expansion through semantic network , 2010, Enterp. Inf. Syst..

[36]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[37]  Li D. Xu Information architecture for supply chain quality management , 2011 .

[38]  Li Wang,et al.  A decision support system for substage-zoning filling design of rock-fill dams based on particle swarm optimization , 2011, Inf. Technol. Manag..

[39]  Takenobu Tokunaga,et al.  Text Categorization based on Weighted Inverse Document Frequency , 1994 .