论文信息 - A Hybrid Machine Learning Approach for Information Extraction

A Hybrid Machine Learning Approach for Information Extraction

Information Extraction (IE) aims to extract from textual documents only the relevant data required by the user. In this paper, we propose a hybrid machine learning approach for IE on semi-structured texts that combines conventional text classification techniques and Hidden Markov Models (HMM). In this approach, a text classifier technique generates an initial output, which is refined by an HMM, providing a globally optimal extraction. An implemented prototype was used to extract information from bibliographic references, reaching a consistent gain in performance through the use of the HMM.

Ricardo B. C. Prudêncio | Flávia de Almeida Barros | Eduardo F. A. Silva | R. Prudêncio | F. Barros

[1] Alex Bateman,et al. An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[2] Stephen Soderland,et al. Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[3] Flávia de Almeida Barros,et al. ProdExt: A Knowledge-Based Wrapper for Extraction of Technical and Scientific Production in Web Pages , 2000, IBERAMIA-SBIA 2000 Open Discussion Track.

[4] Douglas E. Appelt,et al. Introduction to Information Extraction Technology , 1999, IJCAI 1999.

[5] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[6] Remco R. Bouckaert. Low Level Information Extraction: a Bayesian network based approach , 2002 .

[7] S. Mermelstein,et al. Information extraction by text classification , 2001 .

[8] Maurice Bruynooghe,et al. Information Extraction in Structured Documents Using Tree Automata Induction , 2002, PKDD.

[9] Ian Witten,et al. Data Mining , 2000 .

[10] Sunita Sarawagi,et al. Automatic segmentation of text into structured records , 2001, SIGMOD '01.