Boosting text segmentation via progressive classification

A novel approach for reconciling tuples stored as free text into an existing attribute schema is proposed. The basic idea is to subject the available text to progressive classification, i.e., a multi-stage classification scheme where, at each intermediate stage, a classifier is learnt that analyzes the textual fragments not reconciled at the end of the previous steps. Classification is accomplished by an ad hoc exploitation of traditional association mining algorithms, and is supported by a data transformation scheme which takes advantage of domain-specific dictionaries/ontologies. A key feature is the capability of progressively enriching the available ontology with the results of the previous stages of classification, thus significantly improving the overall classification accuracy. An extensive experimental evaluation shows the effectiveness of our approach.

[1]  I. V. Ramakrishnan,et al.  Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences , 2004, CoopIS/DOA/ODBASE.

[2]  Horacio Rodríguez,et al.  A Machine Learning Approach to POS Tagging , 2000, Machine Learning.

[3]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[4]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[5]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[6]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[7]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[8]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[9]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[10]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[11]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[12]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[13]  Elio Masciari,et al.  Web wrapper induction: a brief survey , 2004, AI Commun..

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[16]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[17]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[18]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[19]  Markus Junker,et al.  Learning for Text Categorization and Information Extraction with ILP , 1999, Learning Language in Logic.

[20]  Dayne Freitag,et al.  Toward General-Purpose Learning for Information Extraction , 1998, ACL.

[21]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.