Using Ontologies to Improve Document Classification with Transductive Support Vector Machines

Many applications of automatic document classification require learning accurately with little training data. The semi-supervised classification technique uses labeled and unlabeled data for training. This technique has shown to be effective in some cases; however, the use of unlabeled data is not always beneficial. On the other hand, the emergence of web technologies has originated the collaborative development of ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency of the semi-supervised document classification. We used support vector machines, which is one of the most effective algorithms that have been studied for text. Our algorithm enhances the performance of transductive support vector machines through the use of ontologies. We report experimental results applying our algorithm to three different datasets. Our experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the traditional semi-supervised model.

[1]  E. Merzari,et al.  Large-Scale Simulations on Thermal-Hydraulics in Fuel Bundles of Advanced Nuclear Reactors , 2007 .

[2]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[3]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[4]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[5]  Jianhui Luo,et al.  Experiments on Supervised Learning Algorithms for Text Categorization , 2005, 2005 IEEE Aerospace Conference.

[6]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[7]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .

[8]  Prem Melville Social Media Analytics: Channeling the Power of the Blogosphere for Marketing Insight , 2009 .

[9]  Jean-Michel Renders,et al.  Semi-supervised Document Classification with a Mislabeling Error Model , 2008, ECIR.

[10]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[11]  Vikas Sindhwani,et al.  Concept Labeling: Building Text Classifiers with Minimal Supervision , 2011, IJCAI.

[12]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[13]  Shichao Zhang,et al.  Identifying interesting visitors through Web log classification , 2005, IEEE Intelligent Systems.

[14]  Ee-Peng Lim,et al.  Automated online news classification with personalization , 2001 .

[15]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[16]  Roberto Basili,et al.  Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims , 2003, Comput. Linguistics.

[17]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[18]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[19]  Jos de Bruijn,et al.  Information Integration with Ontologies: Experiences from an Industrial Showcase , 2005 .

[20]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[21]  Lee W. Lacy OWL: Representing Information Using the Web Ontology Language , 2006 .

[22]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.