Automatic categorization of bioscience literature containing QTL information

In this paper we introduce text categorization methods to address the classification problem of literature containing Quantitative Trait Locus, QTL information. Our work focused on building an automatic categorization system targeting the QTL information of various species based on Support Vector Machines, SVM. A text representation strategy is proposed combining words and phrases that effectively improve the classification accuracy. Through studying literature containing QTL information and other species-related publications, we determined representative phrases and detected abbreviations in order to form another set of features. Together with the words selected by Chi value, the two sets of features were both used to represent text samples. We employed a portion of particular species’ QTL-related literature data to conduct an experiment regarding the system’s construction, and then tested our system using the data of multiple plants and species. The experiment results indicate that our work may help further research on constructing QTL information databases. KeywordsAutomatic categorization; bioscience literature; QTL information.

[1]  Qi Wei,et al.  Towards classifying species in systems biology papers using text mining , 2011, BMC Research Notes.

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[4]  Zhiyong Lu,et al.  Text Mining for Translational Bioinformatics , 2015, BioMed research international.

[5]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[6]  Kimberly Van Auken,et al.  Automatic categorization of diverse experimental information in the bioscience literature , 2012, BMC Bioinformatics.

[7]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[8]  Kimberly Van Auken,et al.  Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR , 2012, Database J. Biol. Databases Curation.