Term Weighting in Short Documents for Document Categorization , Keyword Extraction and Query Expansion

This thesis focuses on term weighting in short documents. I propose weighting approaches for assessing the importance of terms for three tasks: (1) document categorization, which aims to classify documents such as tweets into categories, (2) keyword extraction, which aims to identify and extract the most important words of a document, and (3) keyword association modeling, which aims to identify links between keywords and use them for query expansion. As the focus of text mining is shifting toward datasets that hold usergenerated content, for example, social media, the type of data used in the text mining research is changing. The main characteristic of this data is its shortness. For example, a user status update usually contains less than 20 words. When using short documents, the biggest challenge in term weighting comes from the fact that most words of a document occur only once within the document. This is called hapax legomena and we call it Term Frequency = 1, or TF=1 challenge. As many traditional feature weighting approaches, such as Term Frequency Inverse Document Frequency, are based on the occurrence frequency of each word within a document, these approaches do not perform well with short documents. The first contribution of this thesis is a term weighting approach for doc-

[1]  J. Knott The organization of behavior: A neuropsychological theory , 1951 .

[2]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[3]  R. Shiffrin,et al.  Search of associative memory. , 1981 .

[4]  Jesfis Peral,et al.  Heuristics -- intelligent search strategies for computer problem solving , 1984 .

[5]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[6]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Yukio Ohsawa,et al.  KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[9]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[10]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[11]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[12]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[13]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[14]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[15]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[16]  Anette Hulth,et al.  Automatic Keyword Extraction Using Domain Knowledge , 2001, CICLing.

[17]  Yaakov HaCohen-Kerner,et al.  Automatic Extraction of Keywords from Abstracts , 2003, KES.

[18]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[19]  Peter D. Turney Coherent Keyphrase Extraction via Web Mining , 2003, IJCAI.

[20]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[21]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[22]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[25]  Fabio Crestani,et al.  Application of Spreading Activation Techniques in Information Retrieval , 1997, Artificial Intelligence Review.

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[28]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[29]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[30]  Anette Hulth,et al.  Enhancing Linguistically Oriented Automatic Keyword Extraction , 2004, NAACL.

[31]  Fabrizio Sebastiani Text Categorization , 2005, Encyclopedia of Database Technologies and Applications.

[32]  Tommi S. Jaakkola,et al.  Using term informativeness for named entity detection , 2005, SIGIR '05.

[33]  B. Magnini,et al.  A Keyphrase-Based Approach to Summarization : the LAKE System at DUC-2005 , 2005 .

[34]  Anita Krishnakumar anita Building a kNN classifier for the Reuters-21578 collection , 2006 .

[35]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[36]  P. Smith,et al.  A review of ontology based query expansion , 2007, Inf. Process. Manag..

[37]  Ilyas Cicekli,et al.  Using lexical chains for keyword extraction , 2007, Inf. Process. Manag..

[38]  George Forman,et al.  BNS feature scaling: an improved representation over tf-idf for svm text classification , 2008, CIKM '08.

[39]  Xiaojun Wan,et al.  CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction , 2008, COLING.

[40]  Alan Ritter,et al.  Unsupervised Modeling of Twitter Conversations , 2010, NAACL.

[41]  Teng-Sheng Moh,et al.  Can You Judge a Man by His Friends? - Enhancing Spammer Detection on the Twitter Microblogging Platform Using Friends and Followers , 2010, ICISTM.

[42]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[43]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[44]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[45]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[46]  Mika Timonen Categorization of Very Short Documents , 2012, KDIR.

[47]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[48]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.