Automatic new topic identification using multiple linear regression

The purpose of this study is to provide automatic new topic identification of search engine query logs, and estimate the effect of statistical characteristics of search engine queries on new topic identification. By applying multiple linear regression and multi-factor ANOVA on a sample data log from the Excite search engine, we demonstrated that the statistical characteristics of Web search queries, such as time interval, search pattern and position of a query in a user session, are effective on shifting to a new topic. Multiple linear regression is also a successful tool for estimating topic shifts and continuations. The findings of this study provide statistical proof for the relationship between the non-semantic characteristics of Web search queries and the occurrence of topic shifts and continuations.

[1]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[2]  Arnold L. Rosenberg,et al.  Finding topic words for hierarchical summarization , 2001, SIGIR '01.

[3]  Thad Starner,et al.  Web Technologies - Thick Clients for Personal Wireless Devices , 2002, Computer.

[4]  Catherine M. Harmonosky,et al.  A real-time methodology for minimizing mean flowtime in FMSs with machine breakdowns: threshold-based selective rerouting , 2004 .

[5]  David J. Harper,et al.  Topic modeling for mediated access to very large document collections , 2004, J. Assoc. Inf. Sci. Technol..

[6]  Amanda Spink,et al.  A day in the life of Web searching: an exploratory study , 2004, Inf. Process. Manag..

[7]  Xuehua Shen,et al.  Context-sensitive information retrieval using implicit feedback , 2005, SIGIR '05.

[8]  ChengXiang Zhai,et al.  Error analysis of difficult TREC topics , 2003, SIGIR '03.

[9]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[10]  Fernando Diaz,et al.  A User-Centered Approach to Evaluating Topic Models , 2004, ECIR.

[11]  Azadeh Shakery,et al.  Toward Entity Retrieval over Structured and Text Data , 2004 .

[12]  Seda Özmutlu,et al.  Neural network applications for automatic new topic identification , 2005, Online Inf. Rev..

[13]  Sanna Talja,et al.  The production of context in information seeking research: a metatheoretical view , 1999, Inf. Process. Manag..

[14]  Luo Si,et al.  Preference-based Graphic Models for Collaborative Filtering , 2002, UAI.

[15]  Makiko Miwa User Situations and Multiple Levels of user Goals in Information Problem Solving Processes of AskERIC Users. , 2001 .

[16]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Amanda Spink,et al.  Analysis of large data logs: an application of Poisson sampling on excite web queries , 2002, Inf. Process. Manag..

[19]  Ayse Goker Context learning in Okapi , 1997 .

[20]  W. Bruce Croft,et al.  Analysis of Statistical Question Classification for Fact-Based Questions , 2005, Information Retrieval.

[21]  Jiawei Han,et al.  Text classification from positive and unlabeled documents , 2003, CIKM '03.

[22]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[23]  Huseyin Cenk Özmutlu,et al.  Application of automatic topic identification on Excite Web search engine data logs , 2005, Inf. Process. Manag..

[24]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[25]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[26]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[27]  James Allan,et al.  Using Names and Topics for New Event Detection , 2005, HLT/EMNLP.

[28]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[29]  Azadeh Shakery,et al.  Relevance Propagation for Topic Distillation UIUC TREC 2003 Web Track Experiments , 2003, TREC.

[30]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[31]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[32]  Amanda Spink,et al.  Neural network applications for automatic new topic identification on excite web search engine data logs , 2004, ASIST.

[33]  Daqing He,et al.  Analysing Web Search Logs to Determine Session Boundaries for User-Oriented Learning , 2000, AH.

[34]  Peiling Wang,et al.  Mining longitudinal web queries: Trends and patterns , 2003, J. Assoc. Inf. Sci. Technol..

[35]  W. Bruce Croft,et al.  Automatic recognition of reading levels from user queries , 2004, SIGIR '04.

[36]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[37]  Amanda Spink,et al.  Multitasking information seeking and searching processes , 2002, J. Assoc. Inf. Sci. Technol..

[38]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[39]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..

[40]  James Allan Modeling Topics for Detection and Tracking , 2003 .

[41]  Amanda Spink,et al.  Are people asking questions of general Web search engines? , 2003, Online Inf. Rev..

[42]  Hwee Tou Ng,et al.  Bayesian online classifiers for text classification and filtering , 2002, SIGIR '02.

[43]  Shui-Lung Chuang,et al.  Subject categorization of query terms for exploring Web users' search interests , 2002, J. Assoc. Inf. Sci. Technol..

[44]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[45]  Andrew McCallum,et al.  Group and topic discovery from relations and text , 2005, LinkKDD '05.

[46]  Amanda Spink,et al.  Characteristics of question format web queries: an exploratory study , 2002, Inf. Process. Manag..

[47]  Amanda Spink,et al.  From E-Sex to E-Commerce: Web Search Changes , 2002, Computer.

[48]  Seda Özmutlu,et al.  Automatic new topic identification in search engine transaction logs , 2006, Internet Res..

[49]  Victor Lavrenko,et al.  Language-specific models in multilingual topic tracking , 2004, SIGIR '04.

[50]  Wei Li,et al.  Semi-Supervised Sequence Modeling with Syntactic Topic Models , 2005, AAAI.

[51]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[52]  Amanda Spink,et al.  Multitasking Web searching and implications for design , 2003, ASIST.

[53]  Catherine M. Harmonosky,et al.  Production , Manufacturing and Logistics A real-time methodology for minimizing mean flowtime in FMSs with routing flexibility : Threshold-based alternate routing , 2005 .

[54]  Ronald W. Wolff,et al.  Poisson Arrivals See Time Averages , 1982, Oper. Res..

[55]  Daqing He,et al.  Detecting session boundaries from Web user logs , 2000 .

[56]  Russell Greiner,et al.  Learning a Model of a Web User's Interests , 2003, User Modeling.

[57]  Amanda Spink,et al.  Multimedia Web searching trends: 1997-2001 , 2003, Inf. Process. Manag..

[58]  Federico Girosi,et al.  Support Vector Machines: Training and Applications , 1997 .

[59]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[60]  Amanda Spink,et al.  Searching the Web: a survey of EXCITE users , 1999, Internet Res..

[61]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[62]  Daqing He,et al.  Combining evidence for automatic Web session identification , 2002, Inf. Process. Manag..