Effectiveness of information extraction, multi-relational, and multi-view learning for prediction gene deletion experiments

We focus on the problem of predicting gene deletion experiments. In order to build a model that describes the underlying biological system well, our goal is to effectively utilize all data sources that are available, including unlabeled data, relational data, and abstracts of research papers. We study the effectiveness of transduction and co-training for exploiting unlabeled data. We investigate a propositionalization approach which uses gene interaction data. We study the benefit of text classification and information extraction for utilizing scientific abstracts. The studied task is one of the two data mining problems of the KDD Cup 2002; the solution that we describe achieved the highest score in one of the two subtasks and received an "Honorable Mention" for the overall task. Our results shed light on the benefits and limitations of several machine learning techniques for this large-scale application.

[1]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[2]  Peter A. Flach,et al.  Propositionalization approaches to relational data mining , 2001 .

[3]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[4]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[5]  Ronald W. Davis,et al.  Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. , 1999, Science.

[6]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[7]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[9]  Stefan Wrobel,et al.  Transformation-Based Learning Using Multirelational Aggregation , 2001, ILP.

[10]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[11]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[12]  Stan Matwin,et al.  Email classification with co-training , 2011, CASCON.

[13]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[14]  Tobias Scheffer,et al.  Combining data and text mining techniques for yeast gene regulation prediction: a case study , 2002, SKDD.

[15]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..