Search by multiple examples

It is often difficult for users to adopt keywords to express their information needs. Search-By-Multiple-Examples (SBME), a promising method for overcoming this problem, allows users to specify their information needs as a set of relevant documents rather than as a set of keywords. Most of the studies on SBME adopt the Positive Unlabeled learning (PU learning) techniques by treating the users' provided examples (denote as query examples) as positive set and the entire data collection as unlabeled set. However, it is inefficient to treat the entire data collection as unlabeled set, as its size can be huge. In addition, the query examples are treated as being relevant to a single topic, but it is often the case that they can be relevant to multiple topics. As the query examples are much fewer than the unlabeled data, the system performance may downgrade dramatically because of the class imbalance problem. What's more, the experiments conducted in these studies have not taken into account the settings in online search, which are very different from the controlled experiments scenario. This proposed research seeks to explore how to improve SBME by exploring: (1) how to predict user' information needs by modeling the content of the documents using probabilistic topic models; (2) how to deal with the class imbalance problem by reducing the size of the unlabeled data and adopting machine learning techniques. We will also conduct extensive experiments to better evaluate SBME using different sizes of query examples to simulate users' information needs.

[1]  Charles Elkan,et al.  Learning to Find Relevant Biological Articles without Negative Training Examples , 2008, Australasian Conference on Artificial Intelligence.

[2]  Dell Zhang,et al.  Query-By-Multiple-Examples using Support Vector Machines , 2009, J. Digit. Inf. Manag..

[3]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[4]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[5]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[8]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[9]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[10]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[11]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[12]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[13]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[14]  Carlos Guestrin,et al.  Beyond keyword search: discovering relevant scientific literature , 2011, KDD.

[15]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[16]  Kevin Duh,et al.  Learning to rank with partially-labeled data , 2008, SIGIR '08.

[17]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[18]  Rémi Gilleron,et al.  Text Classification from Positive and Unlabeled Examples , 2002 .

[19]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[20]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[21]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[22]  Chao Xu,et al.  IFME: information filtering by multiple examples with under-sampling in a digital library environment , 2013, JCDL '13.