Information retrieval on documents methodology based on entropy filtering methodologies

Information retrieval problem occurs when the target information is not available 'literally' into the set of documents. In problems in which the goal is to find 'hidden' information, it is important to develop hybrid methodologies or improve and design a new one. In this work the authors are dealing with identifying the most informative piece of data on a collection of documents, in order to obtain the best result on a posterior fuzzy clustering stage. The aim is to find similarities between the documents and a reference target, to establish relationships related to a non-literal feature. We propose to apply the well-known entropy term weighting scheme and then show a posterior different procedures to the right election of the interest data. This procedure brings the biggest amount of information within the smallest amount of data. Applying a specific selection procedure for a group of words, gives more information to differentiate and separate the documents after using the entropy weighting. This returns considerable results on the processing time and the right fuzzy clustering of the documents collection.

[1]  Robert E. Williamson,et al.  Testing of a natural language retrieval system for a full text knowledge base , 1984, J. Am. Soc. Inf. Sci..

[2]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[3]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[4]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[5]  Stefan Schulz,et al.  Bootstrapping dictionaries for cross-language information retrieval , 2005, SIGIR '05.

[6]  Donna K. Harman,et al.  An experimental study of factors important in document ranking , 1986, SIGIR '86.

[7]  Jacques Savoy A stemming procedure and stopword list for general French corpora , 1999 .

[8]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[9]  Jacques Savoy,et al.  Indexing and searching strategies for the Russian language , 2009 .

[10]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[11]  Martti Juhola,et al.  Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[12]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[13]  Luis Alfonso Ureña López,et al.  Sentiment polarity detection in Spanish reviews combining supervised and unsupervised approaches , 2013, Expert Syst. Appl..

[14]  R. Burchfield Frequency Analysis of English Usage: Lexicon and Grammar. By W. Nelson Francis and Henry Kučera with the assistance of Andrew W. Mackie. Boston: Houghton Mifflin. 1982. x + 561 , 1985 .

[15]  Peter Willett,et al.  SIBRIS: the Sandwich Interactive Browsing and Ranking Information System , 1989, J. Inf. Sci..

[16]  Magnus Rosell Improving Clustering of Swedish Newspaper Articles using Stemming and Compound Splitting , 2003 .

[17]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[18]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[19]  W. Bruce Croft,et al.  Document Retrieval and Routing Using the INQUERY System , 1994, TREC.

[20]  P. Willett,et al.  Effectiveness of stemming for Turkish text retrieval , 2000 .

[21]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[22]  Yuefeng Li,et al.  A Pattern Discovery Model for Effective Text Mining , 2012, MLDM.

[23]  Jacques Savoy,et al.  Ad Hoc Retrieval with the Persian Language , 2009, CLEF.

[24]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[25]  Martha W. Evens,et al.  Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System , 1994 .

[26]  T. Kalamboukis Suffix stripping with modern Greek , 1995 .

[27]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[28]  Jacques Savoy,et al.  Searching strategies for the Bulgarian language , 2007, Information Retrieval.

[29]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[30]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[31]  Tengku Mohd Tengku Sembok,et al.  Experiments with a stemming algorithm for Malay words , 1996 .

[32]  Rudolf Kruse,et al.  Interactive text retrieval based on document similarities , 2000 .

[33]  Donna Harman,et al.  How effective is suffixing , 1991 .

[34]  Chris D. Paice,et al.  Another stemmer , 1990, SIGF.

[35]  Jacques Savoy,et al.  Searching strategies for the Hungarian language , 2008, Inf. Process. Manag..

[36]  Christian Borgelt,et al.  Fast Fuzzy Clustering of Web Page Collections , 2004 .

[37]  Jacques Savoy,et al.  Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages , 2010, TALIP.

[38]  Gurpreet Singh Lehal,et al.  A Survey of Text Mining Techniques and Applications , 2009 .

[39]  Rafael Valencia-García,et al.  A semantic role labelling-based framework for learning ontologies from Spanish documents , 2013, Expert Syst. Appl..

[40]  George W. Adamson,et al.  The use of an association measure based on character structure to identify semantically related pairs of words and document titles , 1974, Inf. Storage Retr..

[41]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[42]  Lynn A. Streeter,et al.  Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval , 1989, Inf. Process. Manag..

[43]  José Manuel Perea Ortega,et al.  Semantic orientation for polarity classification in Spanish reviews , 2013, Expert Syst. Appl..

[44]  Yueming Lu,et al.  A New Method Based on Fuzzy C-Means Algorithm for Search Results Clustering , 2012, ISCTCS.

[45]  Jacques Savoy,et al.  Light stemming approaches for the French, Portuguese, German and Hungarian languages , 2006, SAC.

[46]  Martin Braschler,et al.  How Effective is Stemming and Decompounding for German Text Retrieval? , 2004, Information Retrieval.

[47]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[48]  Dipak Patel,et al.  A Review on Web Pages Clustering Techniques , 2011 .

[49]  Peter Willett,et al.  The effectiveness of stemming for natural‐language access to Slovene textual data , 1992 .

[50]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[51]  Christopher J. Fox,et al.  A stop list for general text , 1989, SIGF.

[52]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[53]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[54]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[55]  Jacques Savoy,et al.  Indexing and stemming approaches for the Czech language , 2009, Inf. Process. Manag..