Machine learning for information extraction

The dissertation presents a number of novel machine learning techniques and applies them to information extraction. The study addresses several information extraction subtasks: part of speech tagging, entity extraction, coreference resolution, and relation extraction. Each of the tasks is formalized as a learning problem and appropriate learning algorithms are developed and applied to the problem. The dissertation studies part of speech tagging as a multi-class classification problem, and applies the SNOW (Sparse Network of Winnows) learning system to learn a part of speech classifier. A comprehensive experimental evaluation of the system confirms that it is appropriate for NLP applications. The dissertation addresses the problem of entity extraction is conjunction with coreference resolution. A classification approach is presented for entity extraction, and coreference resolution is treated from the decoding perspective. The dissertation describes novel decoding algorithms that given local coreference decisions produce a global coherent interpretation of document entities. The dissertation studies the problem of relation extraction as a classification problem, and applies kernel methods to learn the relation classifiers. Novel kernels are defined in terms of shallow parses, and efficient algorithms are given for computing the kernels. The study evaluates the kernel approach experimentally, with positive results. The dissertation combines the constituent solutions to present a single coherent information extraction system and concludes that machine learning is a viable methodology for designing natural language processing applications.