Information extraction from unstructured web text

In the past few years the World Wide Web has emerged as an important source of data, much of it in the form of unstructured text. This thesis describes an extensible model for information extraction that takes advantage of the unique characteristics of Web text and leverages existent search engine technology in order to ensure the quality of the extracted information. The key features of our approach are the use of lexico-syntactic patterns, Web-scale statistics and unsupervised or semi-supervised learning methods. Our information extraction model has been instantiated and extended in order to solve a set of diverse information extraction tasks: subclass and related class extraction, relation property learning, the acquisition of salient product features and corresponding user opinions from customer reviews and finally, the mining of commonsense information from the Web for the benefit of integrated AI systems.

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[3]  P. Pantel,et al.  A Bootstrapping Algorithm for Automatically Harvesting Semantic Relations , 2006, Proceedings of the Fifth International Workshop on Inference in Computational Semantics.

[4]  Michael L. Littman,et al.  Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.

[5]  Marius Pasca,et al.  Finding Instance Names and Alternative Glosses on the Web: WordNet Reloaded , 2005, CICLing.

[6]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[7]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[8]  Timothy Baldwin,et al.  Learning the Countability of English Nouns from Corpus Data , 2003, ACL.

[9]  Michael Gamon,et al.  Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis , 2004, COLING.

[10]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[11]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[12]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[13]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[14]  Alexiei Dingli,et al.  Integrating Information to Bootstrap Information Extraction from Web Sites , 2003, IIWeb.

[15]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[16]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[17]  Razvan C. Bunescu,et al.  Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques , 2003, Third IEEE International Conference on Data Mining.

[18]  York Sure-Vetter,et al.  Automatic Evaluation of Ontologies (AEON) , 2005, SEMWEB.

[19]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[20]  Daniel Marcu,et al.  An Unsupervised Approach to Recognizing Discourse Relations , 2002, ACL.

[21]  Yang Jin,et al.  Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE , 2005, ACL.

[22]  Henry A. Kautz,et al.  Sensor-Based Understanding of Daily Life via Large-Scale Use of Common Sense , 2006, AAAI.

[23]  乾 孝司,et al.  Acquiring causal knowledge from text using connective markers , 2004 .

[24]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[25]  Anand Rangarajan,et al.  Self-annealing and self-annihilation: unifying deterministic annealing and relaxation labeling , 2000, Pattern Recognit..

[26]  Aldo Gangemi,et al.  The OntoWordNet Project: Extension and Axiomatization of Conceptual Relations in WordNet , 2003, OTM.

[27]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[28]  Advaith Siddharthan,et al.  Resolving Pronouns Robustly: Plumbing the Depths of Shallowness , 2003 .

[29]  Preslav Nakov,et al.  Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution , 2005, HLT.

[30]  Oren Etzioni,et al.  Crossing the Structure Chasm , 2003, CIDR.

[31]  David Faure,et al.  A corpus-based conceptual clustering method for verb frames and ontology , 1998 .

[32]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[33]  Janyce Wiebe,et al.  Just How Mad Are You? Finding Strong and Weak Opinion Clauses , 2004, AAAI.

[34]  Kentaro Inui,et al.  Collecting Evaluative Expressions for Opinion Extraction , 2004, IJCNLP.

[35]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[36]  Andrew McCallum,et al.  Collective Segmentation and Labeling of Distant Entities in Information Extraction , 2004 .

[37]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[38]  Massimo Poesio,et al.  Attribute-Based and Value-Based Clustering: An Evaluation , 2004, EMNLP.

[39]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[40]  Razvan C. Bunescu,et al.  Collective Information Extraction with Relational Markov Networks , 2004, ACL.

[41]  Preslav Nakov,et al.  Search Engine Statistics Beyond the n-Gram: Application to Noun Compound Bracketing , 2005, CoNLL.

[42]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[43]  Michael J. Witbrock,et al.  Searching for Common Sense: Populating Cyc™ from the Web , 2005, AAAI.

[44]  Mark A. Musen,et al.  A Template-Based Approach Toward Acquisition of Logical Sentences , 2002, Intelligent Information Processing.

[45]  Rakesh Gupta,et al.  Common Sense Data Acquisition for Indoor Mobile Robots , 2004, AAAI.

[46]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[47]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[48]  Matthai Philipose,et al.  Unsupervised Activity Recognition Using Automatically Mined Common Sense , 2005, AAAI.

[49]  Kenneth Ward Church,et al.  Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.

[50]  Lenhart K. Schubert Can we derive general world knowledge from texts , 2002 .

[51]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[52]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[53]  Oren Etzioni,et al.  Detecting Parser Errors Using Web-based Semantic Filters , 2006, EMNLP.

[54]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[55]  Thomas Schiex,et al.  A constraint satisfaction framework for decision under uncertainty , 1995, UAI.

[56]  Wanda Pratt,et al.  H.3.3 Information Search and Retrieval , 2022 .

[57]  Andrew McCallum,et al.  Learning Field Compatibilities to Extract Database Records from Unstructured Text , 2006, EMNLP.

[58]  Andrew McCallum,et al.  Object Consolodation by Graph Partitioning with a Conditionally›Trained Distance Metric , 2003 .

[59]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[60]  Soo-Min Kim,et al.  Determining the Sentiment of Opinions , 2004, COLING.

[61]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[62]  Steffen Staab,et al.  From Manual to Semi-Automatic Semantic Annotation: About Ontology-Based Text Annotation Tools , 2000, SAIC@COLING.

[63]  Steffen Staab,et al.  Discovering Conceptual Relations from Text , 2000, ECAI.

[64]  Takashi Inui,et al.  Acquiring Causal Knowledge from Text Using Connective Markers , 2004 .

[65]  Mitsuru Ishizuka,et al.  Acquisition of Hypernyms and Hyponyms from the WWW , 2003 .

[66]  Roxana Gîrju,et al.  Automatic Detection of Causal Relations for Question Answering , 2003, ACL 2003.

[67]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[68]  Satoshi Morinaga,et al.  Mining product reputations on the Web , 2002, KDD.

[69]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[70]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[71]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[72]  Bernardo Magnini,et al.  Is It the Right Answer? Exploiting Web Redundancy for Answer Validation , 2002, ACL.

[73]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[74]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[75]  Steffen Staab,et al.  Learning Ontologies for the Semantic Web , 2001, SemWeb.

[76]  Satoshi Sekine,et al.  Preemptive Information Extraction using Unrestricted Relation Discovery , 2006, NAACL.

[77]  Oren Etzioni,et al.  Relational Web Search , 2006 .

[78]  Andrew McCallum,et al.  Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text , 2006, NAACL.

[79]  Nicola Guarino,et al.  An Overview of OntoClean , 2004, Handbook on Ontologies.

[80]  Steven W. Zucker,et al.  On the Foundations of Relaxation Labeling Processes , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81]  Ellen Riloff,et al.  Exploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons , 2002, EMNLP.

[82]  Steffen Staab,et al.  Ontology Engineering beyond the Modeling of Concepts and Relations , 2000 .

[83]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[84]  Takashi Inui,et al.  Extracting Semantic Orientations of Words using Spin Model , 2005, ACL.

[85]  Oren Etzioni,et al.  Class Extraction from the World Wide Web , 2004 .

[86]  Mathias Bauer,et al.  Instructible information agents for Web mining , 2000, IUI '00.

[87]  Olga Uryupina Semi-supervised learning of geographical gazetteers from the internet , 2003, HLT-NAACL 2003.

[88]  Matthai Philipose,et al.  Mining models of human activities from the web , 2004, WWW '04.

[89]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[90]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[91]  Ellen Riloff,et al.  Learning subjective nouns using extraction pattern bootstrapping , 2003, CoNLL.

[92]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[93]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[94]  Wanda Pratt,et al.  Collaborative information synthesis , 2005, ASIST.

[95]  Michele Banko,et al.  AskMSR: Question Answering Using the Worldwide Web , 2002 .

[96]  Dong-Hong Ji,et al.  Relation Extraction Using Label Propagation Based Semi-Supervised Learning , 2006, ACL.

[97]  David Yarowsky,et al.  Multi-Field Information Extraction and Cross-Document Fusion , 2005, ACL.

[98]  Brian Roark,et al.  Noun-Phrase Co-Occurence Statistics for Semi-Automatic Semantic Lexicon Construction , 1998, COLING-ACL.

[99]  Vasileios Hatzivassiloglou,et al.  Towards the Automatic Identification of Adjectival Scales: Clustering Adjectives According to Meaning , 1993, ACL.

[100]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[101]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[102]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[103]  Daniel Marcu,et al.  Towards Developing Probabilistic Generative Models for Reasoning with Natural Language Representations , 2005, CICLing.

[104]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[105]  Olatz Ansa,et al.  Enriching WordNet concepts with topic signatures , 2001, ArXiv.

[106]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[107]  Doug Downey,et al.  Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison , 2004, AAAI.

[108]  Eugene Charniak,et al.  Finding Parts in Very Large Corpora , 1999, ACL.

[109]  Doug Downey,et al.  A Probabilistic Model of Redundancy in Information Extraction , 2005, IJCAI.

[110]  Patrick Pantel,et al.  VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations , 2004, EMNLP.

[111]  Kathleen R. McKeown,et al.  Predicting the semantic orientation of adjectives , 1997 .

[112]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[113]  Hang Li,et al.  Generalizing Case Frames Using a Thesaurus and the MDL Principle , 1995, CL.

[114]  Preslav Nakov,et al.  A study of using search engine page hits as a proxy for n-gram frequencies , 2005 .