Automatic extraction of titles from general documents using machine learning

We propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word is 0.810 and 0.837 respectively, and precision and recall for title extraction from PowerPoint is 0.875 and 0.895 respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to another domain, and more surprisingly we can even train models in one language and apply them to another language. Moreover, we can significantly improve search ranking results in do document retrieval by using the extracted titles

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[3]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[6]  Edie Rasmussen,et al.  Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries , 2007 .

[7]  Li Zhang,et al.  Focused named entity recognition using machine learning , 2004, SIGIR '04.

[8]  Judith L. Klavans,et al.  Columbia Newsblaster: Multilingual News Summarization on the Web , 2004, NAACL.

[9]  O. Yilmazel,et al.  MetaExtract: an NLP system to automatically assign metadata , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[10]  Neel Sundaresan,et al.  Metadata based Web mining for relevance , 2000, Proceedings 2000 International Database Engineering and Applications Symposium (Cat. No.PR00789).

[11]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[12]  Donna Harman,et al.  Information Processing and Management , 2022 .

[13]  Elin Stangeland,et al.  Contemporary issues of enterprise content management: the case of statoil , 2003, ECIS.

[14]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[15]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[16]  John Shawe-Taylor,et al.  The Perceptron Algorithm with Uneven Margins , 2002, ICML.

[17]  Adwait Ratnaparkhi,et al.  Statistical Models for Unsupervised Prepositional Phrase Attachment , 1998, ACL.

[18]  Song Mao,et al.  A dynamic feature generation system for automated metadata extraction in preservation of digital materials , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[19]  Hui Han,et al.  eBizSearch: a niche search engine for e-business , 2003, SIGIR '03.

[20]  Terry J. Anderson,et al.  Data and metadata for finding and reminding , 1999, 1999 IEEE International Conference on Information Visualization (Cat. No. PR00210).

[21]  Elizabeth D. Liddy,et al.  Automatic metadata generation & evaluation , 2002, SIGIR '02.

[22]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[23]  Jin Zhang,et al.  Internet Search Engines’ Response to Metadata Dublin Core Implementation , 2004, J. Inf. Sci..

[24]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[25]  Hwee Tou Ng,et al.  A maximum entropy approach to information extraction from semi-structured and free text , 2002, AAAI/IAAI.

[26]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[27]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.