Link-based Text Classification

A key challenge for machine learning is tackling the problem of mining richly structured datasets, where the objects are linked in some way. Links among the objects demonstrate certain patterns, which can be helpful for many machine learning tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, bibliographic citation data, epidemiological data and other domains best described using a linked or graph structure. In this paper we propose a framework for modeling link distributions, a link-based model that supports discriminative models describing both the link distributions and the attributes of linked objects. We use a structured logistic regression model, capturing both content and links. We systematically evaluate several variants of our link-based model on a range of datasets including both web and citation collections. In all cases, the use of the link distribution improves classification accuracy.

[1]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[2]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[3]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[4]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[5]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[6]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[7]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[8]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[9]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[10]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[13]  Steven W. Zucker,et al.  On the Foundations of Relaxation Labeling Processes , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[15]  Peter A. Flach,et al.  Propositionalization approaches to relational data mining , 2001 .

[16]  Peter A. Flach,et al.  The role of feature construction in inductive rule learning , 2000 .

[17]  Mark Craven,et al.  Combining Statistical and Relational Methods for Learning in Hypertext Domains , 1998, ILP.

[18]  Yiming Yang,et al.  Stochastic link and group detection , 2002, AAAI/IAAI.

[19]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[20]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[21]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[22]  Anil K. Jain,et al.  Markov random fields : theory and application , 1993 .

[23]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[24]  Lawrence B. Holder,et al.  Graph-Based Data Mining , 2000, IEEE Intell. Syst..

[25]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[26]  S. Džeroski,et al.  Relational Data Mining , 2001, Springer Berlin Heidelberg.

[27]  David D. Jensen Statistical challenges to inductive inference in linked data , 1999, AISTATS.

[28]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[29]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[30]  David M. Pennock,et al.  Towards Structural Logistic Regression: Combining Relational and Statistical Learning , 2002 .

[31]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[32]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[33]  Saěso Dězeroski Relational Data Mining , 2001, Encyclopedia of Machine Learning and Data Mining.

[34]  Jennifer Neville,et al.  Iterative Classification in Relational Data , 2000 .