Data and Text Mining Techniques for In-Domain and Cross-Domain Applications

In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on. Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge? This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated.

[1]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[2]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[3]  Julie S. Zide,et al.  LinkedIn and recruitment: how profiles differ across occupations , 2014 .

[4]  Tim Weitzel,et al.  Matching People and Jobs: A Bilateral Recommendation Approach , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[5]  Bing Liu,et al.  Sentiment Analysis and Opinion Mining , 2012, Synthesis Lectures on Human Language Technologies.

[6]  Mingsheng Long,et al.  Topic Correlation Analysis for Cross-Domain Text Classification , 2012, AAAI.

[7]  Timothy T. Baldwin,et al.  Extracurricular activity as an indicator of interpersonal skill: Prudent evaluation or recruiting malpractice? , 2002 .

[8]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[9]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[10]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[11]  Thierry Bertin-Mahieux,et al.  Automatic Generation of Social Tags for Music Recommendation , 2007, NIPS.

[12]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[13]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[14]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[15]  Fuji Ren,et al.  Class-indexing-based term weighting for automatic text classification , 2013, Inf. Sci..

[16]  Qiang Yang,et al.  Can chinese web pages be classified with english data source? , 2008, WWW.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Xiao Li,et al.  Active Learning for Hierarchical Text Classification , 2012, PAKDD.

[19]  M. Masseroli,et al.  Proceedings of CIBB 2011 1 Integration of genomic , proteomic and biomolecular interaction data to support biomedical knowledge discovery , 2011 .

[20]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[21]  Qiang Yang,et al.  Cross-domain sentiment classification via spectral feature alignment , 2010, WWW '10.

[22]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[23]  Hui Zhang,et al.  Inverse-Category-Frequency based Supervised Term Weighting Schemes for Text Categorization , 2010, J. Inf. Sci. Eng..

[24]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[25]  V. Ramesh,et al.  Performance Analysis of Data Mining Techniques for Placement Chance Prediction , 2011 .

[26]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[27]  Jane Lomax,et al.  Get ready to GO! A biologist's guide to the Gene Ontology , 2005, Briefings Bioinform..

[28]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[29]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[30]  Alessandro Perina,et al.  Expression microarray classification using topic models , 2010, SAC '10.

[31]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[32]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[33]  Purvesh Khatri,et al.  Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[35]  Athman Bouguettaya,et al.  Efficient agglomerative hierarchical clustering , 2015, Expert Syst. Appl..

[36]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[37]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[38]  Paul N. Bennett,et al.  Refined experts: improving classification in large taxonomies , 2009, SIGIR.

[39]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[40]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[41]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[42]  Christophe Dessimoz,et al.  The what, where, how and why of gene ontology—a primer for bioinformaticians , 2011, Briefings Bioinform..

[43]  Karin M. Verspoor,et al.  A categorization approach to automated ontological function annotation , 2006, Protein science : a publication of the Protein Society.

[44]  Marco Masseroli,et al.  Discovering New Gene Functionalities from Random Perturbations of Known Gene Ontological Annotations , 2014, KDIR.

[45]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[46]  Suzanna E Lewis,et al.  Gene Ontology: looking backwards and forwards , 2004, Genome Biology.

[47]  Sung-Hyon Myaeng,et al.  A novel term weighting scheme based on discrimination power obtained from past retrieval results , 2012, Inf. Process. Manag..

[48]  Carolyn Penstein Rosé,et al.  Recovering Implicit Thread Structure in Newsgroup Style Conversations , 2021, ICWSM.

[49]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[50]  Haym Hirsh,et al.  Using LSI for text classification in the presence of background text , 2001, CIKM '01.

[51]  Qiang Yang,et al.  Bridging Domains Using World Wide Knowledge for Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[52]  Christophe Dessimoz,et al.  Quality of Computationally Inferred Gene Ontology Annotations , 2012, PLoS Comput. Biol..

[53]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[54]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[55]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[56]  Hongfei Lin,et al.  Gene Function Prediction Based on the Gene Ontology Hierarchical Structure , 2014, PloS one.

[57]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[58]  Hokey Min,et al.  Developing the profiles of truck drivers for their successful recruitment and retention , 2003 .

[59]  Giacomo Domeniconi,et al.  GOTA: GO term annotation of biomedical literature , 2015, BMC Bioinformatics.

[60]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[61]  Gary D. Bader,et al.  An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology , 2010, BMC Bioinformatics.

[62]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[63]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[64]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[65]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[66]  Barry Smyth,et al.  Using twitter to recommend real-time topical news , 2009, RecSys '09.

[67]  Jian Hu,et al.  Using Wikipedia for Co-clustering Based Cross-Domain Text Classification , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[68]  Hsinchun Chen,et al.  Graph Kernel-Based Learning for Gene Function Prediction from Gene Interaction Network , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[69]  Erik Aumayr,et al.  Reconstruction of Threaded Conversations in Online Discussion Forums , 2011, ICWSM.

[70]  Ido Dagan,et al.  Text Categorization from Category Name via Lexical Reference , 2009, NAACL.

[71]  Nick Craswell,et al.  Overview of the TREC 2006 Enterprise Track , 2006, TREC.

[72]  Jung-Hsien Chiang,et al.  Overview of the gene ontology task at BioCreative IV , 2014, Database J. Biol. Databases Curation.

[73]  Deepak Garg,et al.  Applying data mining techniques in job recommender system for considering candidate job preferences , 2014, 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[74]  Andrea Broughton,et al.  The use of social media in the recruitment process , 2013 .

[75]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[76]  Claudio Sartori,et al.  Iterative Refining of Category Profiles for Nearest Centroid Cross-Domain Text Classification , 2014, IC3K.

[77]  Hans Peter Luhn,et al.  A Business Intelligence System , 1958, IBM J. Res. Dev..

[78]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[79]  Robert M. Peterson,et al.  Building Student Networks with LinkedIn: The Potential for Connections, Internships, and Jobs , 2014 .

[80]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[81]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[82]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[83]  Craig H. Martell,et al.  Topic Detection and Extraction in Chat , 2008, 2008 IEEE International Conference on Semantic Computing.

[84]  M. Bing,et al.  Friend or Foe? The Promise and Pitfalls of Using Social Networking Sites for HR Decisions , 2011 .

[85]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[86]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[87]  Marco Masseroli,et al.  Weighting Scheme Methods for Enhanced Genomic Annotation Prediction , 2013, CIBB.

[88]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[89]  Xiaogang Peng,et al.  Document Classifications based on Word Semantic Hierarchies , 2005, Artificial Intelligence and Applications.

[90]  Marco Masseroli,et al.  Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[91]  Qiang Yang,et al.  Co-clustering based classification for out-of-domain documents , 2007, KDD '07.

[92]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[93]  Man Lan,et al.  A comparative study on term weighting schemes for text categorization , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[94]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[95]  Claudio Sartori,et al.  Cross-domain Text Classification through Iterative Refining of Target Categories Representations , 2014, KDIR.

[96]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[97]  Mark A. Finlayson Java Libraries for Accessing the Princeton Wordnet: Comparison and Evaluation , 2014, GWC.

[98]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[99]  Thomas Hofmann,et al.  Text categorization by boosting automatically extracted concepts , 2003, SIGIR.

[100]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[101]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[102]  Vaibhavi N Patodkar,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2016 .

[103]  Marco Masseroli,et al.  Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[104]  Yaxin Bi,et al.  Using kNN model for automatic text categorization , 2006, Soft Comput..

[105]  David M. Pennock,et al.  Categories and Subject Descriptors , 2001 .

[106]  Gary D. Bader,et al.  The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function , 2010, Nucleic Acids Res..

[107]  Qiang Yang,et al.  Topic-bridged PLSA for cross-domain text classification , 2008, SIGIR '08.

[108]  Lena Katharina Flecke Utilizing Facebook, LinkedIn and Xing as Assistance Tools for Recruiters in the Selection of Job Candidates Based on the Person-Job Fit , 2015 .

[109]  Marco Masseroli,et al.  A discrete optimization approach for SVD best truncation choice based on ROC curves , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[110]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[111]  Miguel A. Andrade-Navarro,et al.  Gene annotation from scientific literature using mappings between keyword systems , 2004, Bioinform..

[112]  David Carmel,et al.  Conversation Detection in Email Systems , 2008, ECIR.

[113]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[114]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[115]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[116]  Nicholas Mitsakakis,et al.  Prediction of Drosophila melanogaster gene function using Support Vector Machines , 2013, BioData Mining.

[117]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[118]  S. Baskar,et al.  A NOVEL TERM WEIGHTING SCHEME MIDF FOR TEXT CATEGORIZATION , 2010 .

[119]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[120]  Qiang Yang,et al.  Spectral domain-transfer learning , 2008, KDD.

[121]  Marco Tagliasacchi,et al.  Semantically improved genome-wide prediction of Gene Ontology annotations , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[122]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[124]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[125]  Purvesh Khatri,et al.  Semantic Analysis of Genome Annotations using Weighting Schemes , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[126]  Hui Xiong,et al.  Exploiting associations between word clusters and document classes for cross-domain text categorization , 2011, Stat. Anal. Data Min..

[127]  Rolf van der Velden,et al.  Educational mismatches versus skill mismatches: effects on wages, job satisfaction, and on-the-job search , 2001 .

[128]  Purvesh Khatri,et al.  A semantic analysis of the annotations of the human genome , 2005, Bioinform..

[129]  Michelangelo Ceci,et al.  Classifying web documents in a hierarchy of categories: a comprehensive study , 2007, Journal of Intelligent Information Systems.

[130]  Qiang Yang,et al.  Cross-domain activity recognition via transfer learning , 2011, Pervasive Mob. Comput..

[131]  Guan Yi A Survey of Document Clustering , 2006 .

[132]  Xutao Deng,et al.  A hidden Markov model for gene function prediction from sequential expression data , 2004 .

[133]  Patrick Ruch,et al.  Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases , 2013, Database J. Biol. Databases Curation.

[134]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[135]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[136]  Jen-Yuan Yeh,et al.  Email Thread Reassembly Using Similarity Matching , 2006, CEAS.

[137]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[138]  Aristides Gionis,et al.  Machine learned job recommendation , 2011, RecSys '11.

[139]  Hyung Jin Kim,et al.  LinkedIn skills: large-scale topic extraction and inference , 2014, RecSys '14.

[140]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[141]  Richard Colbaugh,et al.  Toward Emerging Topic Detection for Business Intelligence: Predictive Analysis of 'Meme' Dynamics , 2010, ArXiv.

[142]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[143]  Damiano Piovesan,et al.  FFPred 2.0: Improved Homology-Independent Prediction of Gene Ontology Terms for Eukaryotic Protein Sequences , 2013, PloS one.

[144]  Michelangelo Ceci,et al.  Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction , 2013, BMC Bioinformatics.

[145]  Harith Alani,et al.  Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification , 2011, ACL.

[146]  Paolo Fontana,et al.  Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms , 2012, BMC Bioinformatics.

[147]  V. Radhika,et al.  A Survey on Predicting Student Performance , 2014 .

[148]  J. Breaugh Employee recruitment: Current knowledge and important areas for future research , 2008 .

[149]  Chia-Fen Chi,et al.  A study on job placement for handicapped workers using job analysis data , 1999 .

[150]  Yu Zheng,et al.  Methodologies for Cross-Domain Data Fusion: An Overview , 2015, IEEE Transactions on Big Data.

[151]  Tanya Z. Berardini,et al.  Building an efficient curation workflow for the Arabidopsis literature corpus , 2012, Database J. Biol. Databases Curation.

[152]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[153]  Yang Fan,et al.  Job recommender systems: A survey , 2012, 2012 7th International Conference on Computer Science & Education (ICCSE).

[154]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[155]  Thomas H. Wonnacott,et al.  Introductory Statistics , 2007, Technometrics.

[156]  Bin Zhou,et al.  Contextual correlation based thread detection in short text message streams , 2011, Journal of Intelligent Information Systems.

[157]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[158]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[159]  Joaquín Dopazo,et al.  The role of the environment in Parkinson's disease. , 1996, Nucleic Acids Res..

[160]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[161]  Patrick Ruch,et al.  Closing the loop: from paper to protein annotation using supervised Gene Ontology classification , 2014, Database J. Biol. Databases Curation.

[162]  Jiawei Han,et al.  Knowledge transfer via multiple model local structure mapping , 2008, KDD.

[163]  Francesco Pinciroli,et al.  GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining , 2004, Nucleic Acids Res..

[164]  Rada Mihalcea,et al.  Random Walk Term Weighting for Improved Text Classification , 2007, Int. J. Semantic Comput..

[165]  Phillip W. Braddy,et al.  E-recruitment and the benefits of organizational web appeal , 2008, Comput. Hum. Behav..

[166]  Ricardo Buettner,et al.  A Framework for Recommender Systems in Online Social Network Recruiting: An Interdisciplinary Call to Arms , 2014, 2014 47th Hawaii International Conference on System Sciences.

[167]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[168]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[169]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[170]  Lior Rokach,et al.  Introduction to Recommender Systems Handbook , 2011, Recommender Systems Handbook.

[171]  Prasenjit Mitra,et al.  Temporal and Information Flow Based Event Detection from Social Text Streams , 2007, AAAI.

[172]  Aaron Kershenbaum,et al.  The Effect of Using Hierarchical Classifiers in Text Categorization , 2000, RIAO.

[173]  Faisal M. Khan,et al.  Mining Chat-room Conversations for Social and Semantic Interactions , 2002 .

[174]  T. Velmurugan,et al.  A Comparative Analysis on the Evaluation of Classification Algorithms in the Prediction of Students Performance , 2015 .

[175]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[176]  Claudio Gentile,et al.  Hierarchical classification: combining Bayes with SVM , 2006, ICML.

[177]  Sven Laumer,et al.  Extending the Architecture for a Next-Generation Holistic E-Recruiting System , 2008 .

[178]  J. Breaugh,et al.  Research on Employee Recruitment: So Many Studies, So Many Remaining Questions , 2000 .

[179]  Gina J. Medsker,et al.  A Review of Current Practices for Evaluating Causal Models in Organizational Behavior and Human Resources Management Research , 1994 .

[180]  Takenobu Tokunaga,et al.  Text Categorization based on Weighted Inverse Document Frequency , 1994 .

[181]  B. J. Field TOWARDS AUTOMATIC INDEXING: AUTOMATIC ASSIGNMENT OF CONTROLLED‐LANGUAGE INDEXING AND CLASSIFICATION FROM FREE INDEXING , 1975 .

[182]  Jyoti Vashishtha,et al.  A Generalized Data mining Framework for Placement Chance Prediction Problems , 2011 .

[183]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[184]  Kristof Coussement,et al.  Improving Customer Complaint Management by Automatic Email Classification Using Linguistic Style Features as Predictors , 2007 .

[185]  Andrew J. Bulpitt,et al.  Gene function prediction using semantic similarity clustering and enrichment analysis in the malaria parasite Plasmodium falciparum , 2010, Bioinform..

[186]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[187]  Barry Smyth,et al.  Personalised Retrieval for Online Recruitment Services , 2000 .

[188]  Claudio Gentile,et al.  Incremental Algorithms for Hierarchical Classification , 2004, J. Mach. Learn. Res..

[189]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[190]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[191]  Azadeh Shakery,et al.  A learning approach for email conversation thread reconstruction , 2013, J. Inf. Sci..

[192]  Marco Masseroli,et al.  Latent Dirichlet Allocation based on Gibbs Sampling for gene function prediction , 2014, 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology.

[193]  Claudio Sartori,et al.  A Study on Term Weighting for Text Categorization: A Novel Supervised Variant of tf.idf , 2015, DATA.

[194]  G. Carenini,et al.  A Publicly Available Annotated Corpus for Supervised Email Summarization , 2008 .

[195]  Xia Wang,et al.  Email Conversations Reconstruction Based on Messages Threading for Multi-person , 2008, 2008 International Workshop on Education Technology and Training & 2008 International Workshop on Geoscience and Remote Sensing.

[196]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[197]  In Lee An architecture for a next-generation holistic e-recruiting system , 2007, CACM.

[198]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[199]  Qiang Yang,et al.  Transfer Learning via Dimensionality Reduction , 2008, AAAI.

[200]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[201]  Hui Xiong,et al.  A semantic term weighting scheme for text categorization , 2011, Expert Syst. Appl..

[202]  Daniel L. Rubin,et al.  Biomedical ontologies: a functional perspective , 2007, Briefings Bioinform..

[203]  Andrew McCallum,et al.  Text Classification by Bootstrapping with Keywords, EM and Shrinkage , 1999 .

[204]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[205]  Harold Borko,et al.  Automatic Document Classification , 1963, JACM.

[206]  Donna K. Harman,et al.  Overview of the first TREC conference , 1993, SIGIR.

[207]  Loriene Roy,et al.  Content-based book recommending using learning for text categorization , 1999, DL '00.

[208]  Thomas Lengauer,et al.  Improving disease gene prioritization using the semantic similarity of Gene Ontology terms , 2010, Bioinform..

[209]  Shiwei Tang,et al.  A Comparative Study on Feature Weight in Text Categorization , 2004, APWeb.

[210]  Stan Matwin,et al.  Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.

[211]  Qiang Yang,et al.  Transferring Naive Bayes Classifiers for Text Classification , 2007, AAAI.

[212]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[213]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[214]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[215]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[216]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[217]  Marco Masseroli,et al.  Random Perturbations of Term Weighted Gene Ontology Annotations for Discovering Gene Unknown Functionalities , 2014, IC3K.

[218]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[219]  Benno Stein,et al.  Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[220]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[221]  Christophe Moulin,et al.  Entropy based feature selection for text categorization , 2011, SAC.

[222]  Marco Masseroli,et al.  Management and Analysis of Genomic Functional and Phenotypic Controlled Annotations to Support Biomedical Investigation and Practice , 2007, IEEE Transactions on Information Technology in Biomedicine.

[223]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[224]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[225]  Douglas W. Oard,et al.  Indexing emails and email threads for retrieval , 2005, SIGIR '05.

[226]  Marco Masseroli,et al.  Integration of Biomolecular Interaction Data in a Genomic and Proteomic Data Warehouse to Support Biomedical Knowledge Discovery , 2011, CIBB.

[227]  Alessandro Perina,et al.  Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray , 2010, PRIB.

[228]  Raymond Y. K. Lau,et al.  Unsupervised Multi-label Text Classification Using a World Knowledge Ontology , 2012, PAKDD.

[229]  Eugene Agichtein,et al.  Discovering authorities in question answer communities by using link analysis , 2007, CIKM '07.

[230]  Tony Kinder,et al.  The use of the Internet in recruitment—case studies from West Lothian, Scotland , 2000 .

[231]  Qiang Yang,et al.  Thread detection in dynamic text message streams , 2006, SIGIR.

[232]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[233]  Prasenjit Mitra,et al.  Event Detection and Visualization for Social Text Streams , 2007, ICWSM.

[234]  Jeffrey A. Michaels,et al.  The use of an automated employment recruiting and screening system for temporary professional employees: A case study , 2004 .

[235]  Yixian Zheng,et al.  Chromosome Alignment and Segregation Regulated by Ubiquitination of Survivin , 2005, Science.