A hybrid approach to discover semantic hierarchical sections in scholarly documents

Scholarly documents are usually composed of sections, each of which serves a different purpose by conveying specific context. The ability to automatically identify sections would allow us to understand the semantics of what is different in different sections of documents, such as what was in the introduction, methodologies used, experimental types, trends, etc. We propose a set of hybrid algorithms to 1) automatically identify section boundaries, 2) recognize standard sections, and 3) build a hierarchy of sections. Our algorithms achieve an F-measure of 92.38% in section boundary detection, 96% accuracy (average) on standard section recognition, and 95.51% in accuracy in the section positioning task.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Wenyi Huang,et al.  Towards building a scholarly big data platform: Challenges, lessons and opportunities , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[3]  Andreas Dengel,et al.  Analysis of the Logical Layout of Documents , 2014, Handbook of Document Image Processing and Recognition.

[4]  Song Mao,et al.  Software architecture of PSET: a page segmentation evaluation toolkit , 2002, International Journal on Document Analysis and Recognition.

[5]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[6]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[7]  Giovanni Soda,et al.  Conversion of PDF Books in ePub Format , 2011, 2011 International Conference on Document Analysis and Recognition.

[8]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[9]  Randolph A. Miller,et al.  Research Paper: Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents , 2009, J. Am. Medical Informatics Assoc..

[10]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[11]  C. Lee Giles,et al.  SEERLAB: A System for Extracting Keyphrases from Scholarly Documents , 2010, SemEval@ACL.

[12]  C. Lee Giles,et al.  Automatic tag recommendation for metadata annotation using probabilistic topic modeling , 2013, JCDL '13.

[13]  Dominika Tkaczyk,et al.  GROTOAP: ground truth for open access publications , 2012, JCDL '12.

[14]  David Doermann,et al.  Handbook of Document Image Processing and Recognition , 2014, Springer London.

[15]  Tamir Hassan,et al.  Object-level document analysis of PDF files , 2009, DocEng '09.

[16]  Marcel Salathé,et al.  An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages , 2014, J. Biomed. Informatics.

[17]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[18]  Sherif M. Yacoub,et al.  Identification of document structure and table of content in magazine archives , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[19]  Michelangelo Ceci,et al.  Correcting the document layout: a machine learning approach , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  C. Lee Giles,et al.  A generalized topic modeling approach for automatic document annotation , 2015, International Journal on Digital Libraries.

[22]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[23]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.