A Hybrid Machine Learning Approach for Information Extraction

Information Extraction (IE) aims to extract from textual documents only the relevant data required by the user. In this paper, we propose a hybrid machine learning approach for IE on semi-structured texts that combines conventional text classification techniques and Hidden Markov Models (HMM). In this approach, a text classifier technique generates an initial output, which is refined by an HMM, providing a globally optimal extraction. An implemented prototype was used to extract information from bibliographic references, reaching a consistent gain in performance through the use of the HMM.