Using the Web 1T 5-Gram Database for Attribute Selection in Formal Concept Analysis to Correct Overstemmed Clusters

As part of information retrieval processes, words are often stemmed to a common root. The Porter Stemming Algorithm operates as a rule-based suffix-removal process. Stemming can be viewed as a way to cluster related words together according to one common stem. Sometimes Porter includes words in a cluster that are un-related. This experiment attempts to correct this using Formal Concept Analysis (FCA). FCA is the process of formulating formal concepts from a given formal context. A formal context consists of objects and attributes, and a binary relation that indicates the attributes possessed by each object. A formal concept is formed by computing the closure of subsets of objects and attributes. Using the Cranfield document collection, this experiment crafted a comparison measure between each word in the stemmed cluster using the Google Web 1T 5-gram data set. Using FCA to correct the clusters, the results showed a varying level of success dependent upon the error threshold allowed.

[1]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[2]  Gerd Stumme,et al.  Concept Exploration - A Tool for Creating and Exploring Conceptual Hierarchies , 1997, ICCS.

[3]  Bernhard Ganter,et al.  Two Basic Algorithms in Concept Analysis , 2010, ICFCA.

[4]  Cyril W. Cleverdon The effect of variations in relevance assessments in comparative experimental tests of index languages , 1970 .

[5]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[6]  Charles R. Hildreth,et al.  Accounting for users' inflated assessments of on-line catalogue search performance and usefulness: an experimental study , 2001, Inf. Res..

[7]  Uta Priss,et al.  Formal concept analysis in information science , 2006, Annu. Rev. Inf. Sci. Technol..

[8]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[9]  Joaquín Borrego-Díaz,et al.  Selecting Attributes for Sport Forecasting using Formal Concept Analysis , 2011, ArXiv.

[10]  Vasileios Lampos,et al.  The Expression of Emotions in 20th Century Books , 2013, PloS one.

[11]  Stefan Evert,et al.  Corpora and collocations , 2007 .

[12]  Rudolf Wille,et al.  Restructuring Lattice Theory: An Approach Based on Hierarchies of Concepts , 2009, ICFCA.

[13]  Benno Stein,et al.  Putting Successor Variety Stemming to Work , 2006, GfKl.

[14]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[15]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[16]  Stefan Evert,et al.  58. Corpora and collocations , 2009 .

[17]  Martin Reynaert Parallel identification of the spelling variants in corpora , 2009, AND '09.

[18]  Simon Andrews,et al.  In-Close, a fast algorithm for computing formal concepts , 2009 .