AutoPCS: A Phrase-Based Text Categorization System for Similar Texts

Nearly all text classification methods classify texts into predefined categories according to the terms appeared in texts. State-of-the-art of text classification prefer to simplely take a word as a term since it performs good on some famous datasets; some experts even pointed out that phrases don't improve or improve only marginally the classifiction accuracy. However, we found out that this is not always true when we try to categorize texts about similar topics in the same domain. With words only we can not categorize those texts effectively since they nearly share the same word set. Then we suppose the results might be improved if we also use phrases as terms. To testify our supposition, we propose our own phrase extraction way as well as select proper feature selection method and classifier by conducting experimental study on a data set which comes from paper abstracts in the field of Databases . Accordingly, we also develop a system called AutoPCS which can be used to help experts in choosing relevant topics for newly coming papers from a predefined topic list only by their abstracts.

[1]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[2]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[3]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[4]  Cornelis H. A. Koster,et al.  Taming Wild Phrases , 2003, ECIR.

[5]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  Luc De Raedt,et al.  Machine Learning: ECML 2001 , 2001, Lecture Notes in Computer Science.

[8]  Amita Goyal Chin Text Databases and Document Management: Theory and Practice , 2000 .

[9]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[10]  M. Shaw,et al.  Induction of fuzzy decision trees , 1995 .

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[13]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[14]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[15]  R. Bekkerman,et al.  Using Bigrams in Text Categorization , 2003 .

[16]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[17]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[18]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[19]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[20]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[21]  David G. Stork,et al.  Pattern Classification , 1973 .

[22]  Adam Kowalczyk,et al.  Second Order Features for Maximising Text Classification Performance , 2001, ECML.

[23]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[24]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[25]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[26]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[27]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[28]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[29]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.