Automated Classification and Categorization of Mathematical Knowledge

There is a commonMathematics SubjectClassification(MSC) System used for categorizing mathematical papers and knowledge. We present results of machine learning of the MSC on full texts of papers in the mathematical digital libraries DML-CZ and NUMDAM. The F1- measure achieved on classification task of top-level MSC categories exceeds 89%. We describe and evaluate our methods for measuring the similarity of papers in the digital library based on paper full texts.

[1]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[2]  Thierry Bouche Toward a digital mathematics library , 2008 .

[3]  H. E. A.,et al.  The International Catalogue of Scientific Literature , 1900, Nature.

[4]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[5]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[6]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[9]  Radim Řehůřek,et al.  The Influence of Preprocessing Parameters on TextCategorization , 2007 .

[10]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[11]  Stuart Macdonald,et al.  User Engagement in Research Data Curation , 2009, ECDL.

[12]  Jahrbuch über die Fortschritte der Mathematik , 1889 .

[13]  Andrea Esuli,et al.  Boosting multi-label hierarchical text categorization , 2008, Information Retrieval.

[14]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[15]  Petr Sojka,et al.  From Scanned Image to Knowledge Sharing Formats and Technologies in the Digital Mathematics Library Project , 2005 .

[16]  H H Field THE INTERNATIONAL CATALOGUE OF SCIENTIFIC LITERATURE. , 1899, Science.

[17]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[18]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[19]  Petr Sojka,et al.  DML-CZ: The Objectives and the First Steps , 2008 .

[20]  George F. Foster,et al.  Confidence estimation for NLP applications , 2006, TSLP.

[21]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[22]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[24]  Yiming Yang,et al.  Text categorization , 2008, Scholarpedia.