论文信息 - Docear's PDF inspector: title extraction from PDF files

Docear's PDF inspector: title extraction from PDF files

In this demo-paper we present Docear's PDF Inspector (DPI). DPI extracts titles from academic PDF files by applying a simple heuristic: the largest text on the first page of a PDF is assumed to be the title. This simple heuristic achieves accuracies around 70% and outperforms the tools ParsCit and SciPlore Xtract in both run-time and accuracy. In addition, DPI is released under the free open source license GPL 2+ at http://www.docear.org, written in JAVA, and runs on any major operating system.

Jöran Beel | Stefan Langer | Marcel Genzmehr | Christoph Müller

[1] Edward A. Fox,et al. Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[2] Lillian N. Cassel,et al. Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries , 2011, JCDL 2011.

[3] Catherine C. Marshall,et al. Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries , 2003 .

[4] C. Lee Giles,et al. ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[5] Fredric C. Gey,et al. Proceedings of LREC , 2010 .

[6] Jöran Beel,et al. Docear: an academic literature suite for searching, organizing and creating academic literature , 2011, JCDL '11.

[7] Qinghua Zheng,et al. Automatic extraction of titles from general documents using machine learning , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[8] Jöran Beel,et al. Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[9] Jöran Beel,et al. SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) , 2010, ECDL.

[10] Andrew McCallum,et al. Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.