Large-Scale Experiments for Mathematical Document Classification

The ever increasing amount of digitally available information is curse and blessing at the same time. On the one hand, users have increasingly large amounts of information at their fingertips. On the other hand, the assessment and refinement of web search results becomes more and more tiresome and difficult for non-experts in a domain. Therefore, established digital libraries offer specialized collections with a certain degree of quality. This quality can largely be attributed to the great effort invested into semantic enrichment of the provided documents e.g. by annotating their documents with respect to a domain-specific taxonomy. This process is still done manually in many domains, e.g. chemistry (CAS), medicine (MeSH), or mathematics (MSC). But due to the growing amount of data, this manual task gets more and more time consuming and expensive. The only solution for this problem seems to employ automated classification algorithms, but from evaluations done in previous research, conclusions to a real world scenario are difficult to make. We therefore conducted a large scale feasibility study on a real world data set from one of the biggest mathematical digital libraries, i.e. Zentralblatt MATH, with special focus on its practical applicability.

[1]  Michael Kohlhase,et al.  MathWebSearch 0.5 An Open Formula Search Engine , 2011, LWA.

[2]  Petr Sojka,et al.  Automated Classification and Categorization of Mathematical Knowledge , 2008, AISC/MKM/Calculemus.

[3]  Heiko Paulheim,et al.  Automated Feature Generation from Structured Knowledge , 2011 .

[4]  Dan Shen,et al.  Large-scale item categorization for e-commerce , 2012, CIKM.

[5]  Wolf-Tilo Balke,et al.  REVIEW DRIVEN CUSTOMER SEGMENTATION FOR IMPROVED E-SHOPPING EXPERIENCE , 2011 .

[6]  Michael Kohlhase,et al.  MathWebSearch 0.5: Scaling an Open Formula Search Engine , 2012, AISC/MKM/Calculemus.

[7]  Hui Wan,et al.  Personalized Tag Recommendations via Tagging and Content-based Similarity Metrics , 2007, ICWSM.

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  Roelof van Zwol,et al.  Flickr tag recommendation based on collective knowledge , 2008, WWW.

[10]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[11]  Josef Urban,et al.  The Mizar Mathematical Library in OMDoc: Translation and Applications , 2013, Journal of Automated Reasoning.

[12]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[13]  Tommaso Di Noia,et al.  Semantic Wonder Cloud: Exploratory Search in DBpedia , 2010, ICWE Workshops.

[14]  Florian Daniel,et al.  Current Trends in Web Engineering , 2010, Lecture Notes in Computer Science.

[15]  Yang Song,et al.  Real-time automatic tag recommendation , 2008, SIGIR '08.

[16]  Ray R. Larson Experiments in automatic Library of Congress Classification , 1992 .

[17]  Edward A. Fox,et al.  Combining structural and citation-based evidence for text classification , 2004, CIKM '04.

[18]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[19]  Wolfgang Nejdl,et al.  Using ODP metadata to personalize search , 2005, SIGIR '05.