Lexicon-based Document Representation

It is a big challenge for an information retrieval system (IRS) to interpret the queries made by users, particularly because the common form of query consists of very few terms. Tolerance rough sets models (TRSM), as an extension of rough sets theory, have demonstrated their ability to enrich document representation in terms of semantic relatedness. However, system efficiency is at stake because the weight vector created by TRSM (TRSM-representation) is much less sparse. We mapped the terms occurring in TRSM-representation to terms in the lexicon, hence the final representation of a document was a weight vector consisting only of terms that occurred in the lexicon (LEX-representation). The LEX-representation can be viewed as a compact form of TRSM-representation in a lower dimensional space and eliminates all informal terms previously occurring in TRSM-vector. With these facts, we may expect a more efficient system. We employed recall and precision commonly used in information retrieval to evaluate the effectiveness of LEXrepresentation. Based on our examination, we found that the effectiveness of LEX-representation is comparable with TRSM-representation while the efficiency of LEX-representation should be better than the existing TRSM-representation. We concluded that lexicon-based document representation was another alternative potentially used to represent a document while considering semantics. We are tempted to implement the LEX-representation together with linguistic computation, such as tagging and feature selection, in order to retrieve more relevant terms with high weight. With regard to the TRSM method, enhancing the quality of tolerance class is crucial based on the fact that the ∗Address for correspondence: Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland 28 G. Virginia and H. S. Nguyen / Lexicon-based Document Representation TRSM method is fully reliant on the tolerance classes. We plan to combine other resources such as Wikipedia Indonesia to generate a better tolerance class.

[1]  Donna K. Harman,et al.  Overview of the Ninth Text REtrieval Conference (TREC-9) , 2000, TREC.

[2]  Andrzej Skowron,et al.  Tolerance Approximation Spaces , 1996, Fundam. Informaticae.

[3]  Andrzej Janusz,et al.  Interactive Document Indexing Method Based on Explicit Semantic Analysis , 2012, RSCTC.

[4]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5]  Tu Bao Ho,et al.  Nonhierarchical document clustering based on a tolerance rough set model , 2002, Int. J. Intell. Syst..

[6]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[7]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[8]  Hugh E. Williams,et al.  Stemming Indonesian: A confix-stripping approach , 2007, TALIP.

[9]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[10]  Gloria Virginia,et al.  Investigating the Effectiveness of Thesaurus Generated Using Tolerance Rough Set Model , 2011, ISMIS.

[11]  Tu Bao Ho,et al.  Hierarchical Document Clustering Based on Tolerance Rough Set Model , 2000, PKDD.

[12]  Renata Wassermann,et al.  An information retrieval application using ontologies , 2010, Journal of the Brazilian Computer Society.

[13]  Gloria Virginia,et al.  Automatic Ontology Constructor for Indonesian Language , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[14]  Andrzej Skowron,et al.  Rough Sets: A Tutorial , 1998 .

[15]  Tu Bao Ho,et al.  Rough Document Clustering and the Internet , 2008 .

[16]  Lu Zhengding,et al.  A novel web query automatic expansion based on rough set , 2006, Wuhan University Journal of Natural Sciences.

[17]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[18]  ChengXiang Zhai,et al.  Adaptive relevance feedback in information retrieval , 2009, CIKM.

[19]  Euripides G. M. Petrakis,et al.  Information Retrieval by Semantic Similarity , 2006, Int. J. Semantic Web Inf. Syst..

[20]  ChengXiang Zhai,et al.  Positional relevance model for pseudo-relevance feedback , 2010, SIGIR.