FLUX-CIM: flexible unsupervised extraction of citation metadata

In this paper we propose a knowledge-base approach to help extracting the correct components of citations in any given format. Differently from related approaches that rely on manually built knowledge-bases (KBs) for recognizing the components of a citation, in our case, such a KB is automatically constructed from an existing set of sample metadata records from a given area (e.g., computer science or health sciences). Our approach does not rely on patterns encoding specific delimitators of a particular citation style. It is also unsupervised, in the sense that it does not rely on a learning method that requires a training phase. These features assign to our technique a high degree of automation and flexibility. To demonstrate the effectiveness and applicability of our proposed approach we have run experiments in which we applied it to extract information from citations in papers of two different domains. Results of these experiments indicate precision and recall levels above 94% and perfect extraction for the large majority of citations tested.

[1]  Shih-Hung Wu,et al.  A knowledge-based approach to citation extraction , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..

[2]  Byung-Won On,et al.  Are Your Citations Clean ? New Scenarios and Challenges in Maintaining Digital Libraries , 2006 .

[3]  S da SilvaAltigran,et al.  A brief survey of web data extraction tools , 2002 .

[4]  Edleno Silva de Moura,et al.  LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces , 2007, Inf. Process. Manag..

[5]  Berthier A. Ribeiro-Neto,et al.  A comparative study of citations and links in document classification , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[6]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[7]  O. Yilmazel,et al.  MetaExtract: an NLP system to automatically assign metadata , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[8]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[9]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[10]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[11]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[12]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[13]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[14]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[15]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[16]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[17]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[18]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[19]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[20]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[21]  Alberto H. F. Laender,et al.  DEByE - Data Extraction By Example , 2002, Data Knowl. Eng..

[22]  Byung-Won On,et al.  Are your citations clean? , 2007, CACM.

[23]  Gordon W. Paynter,et al.  Developing practical automatic metadata assignment and evaluation tools for internet resources , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[24]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[25]  Nivio Ziviani,et al.  Link-based similarity measures for the classification of Web documents , 2006, J. Assoc. Inf. Sci. Technol..