Cluster-Based Pattern Recognition in Natural Language Text

ii Acknowledgements I would like to thank my adviser Prof. Tishby for his guidance and assistance in producing this work, and for his suggestions and positive input. I would also like to thank Beata Beigman Klebanov for her constant help and advice throughout this work, including (but definitely not limited to) the contribution of the parsed data used here. Also deserving of thanks are my family, for their support, and especially my grandmother, Rose Brody, for her confidence in my achievements. iii Abstract This work presents the Clustered Clause structure, which uses information-based clustering and dependencies between sentence components to provide a simplified and generalized model of a grammatical clause. We show that this representation, which is based on dependencies within the sentence, enables us to detect complex textual relations at a higher level of context. The relations we detect are of interest in themselves, as linguistic phenomena, and are also highly suited for use in certain linguistic and cognitive tasks. We define and search for several types of patterns, moving from basic patterns to more complex ones, from patterns within the sentence to those involving entire sentences. Examples of recognized patterns of each type are presented, and also descriptions of several interesting phenomena detected by our method. We assess the quality of the results, and demonstrate the importance of the clustering and dependency model we chose. The principles behind our method are largely domain-independent, and can therefore be applied to other forms of structured sequential data as well.

[1]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[2]  Kellyn Kruger,et al.  Fully Automatic Acquisition of Taxonomic Knowledge from Large Corpora of Texts: Limited Syntax Knowledge Representation System Based on Natural Language , 2000, ISMIS.

[3]  Dan I. Moldovan,et al.  Text Mining for Causal Relations , 2002, FLAIRS.

[4]  Malvina Nissim,et al.  Using the Web for Nominal Anaphora Resolution , 2003 .

[5]  Patrick Pantel,et al.  VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations , 2004, EMNLP.

[6]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[7]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[8]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[9]  Eugene Charniak,et al.  Finding Parts in Very Large Corpora , 1999, ACL.

[10]  Dekang Lin,et al.  DIRT – Discovery of Inference Rules from Text , 2001 .

[11]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[12]  Dekang Lin,et al.  PRINCIPAR - An Efficient, Broad-coverage, Principle-based Parser , 1994, COLING.

[13]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[14]  Steffen Staab,et al.  Learning by googling , 2004, SKDD.

[15]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[16]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[17]  Susan T. Dumais,et al.  Latent semantic analysis and the measurement of knowledge , 1994 .

[18]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[19]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[20]  Wiebke Petersen,et al.  A Set-Theoretical Approach for the Induction of Inheritance Hierarchies , 2004, FGMOL.

[21]  Mikhail J. Atallah,et al.  Reliable detection of episodes in event sequences , 2004, Knowledge and Information Systems.

[22]  Stefanos D. Kollias,et al.  Context - Sensitive Query Expansion Based on Fuzzy Clustering of Index Terms , 2002, FQAS.

[23]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[24]  Siddharth Patwardhan,et al.  Incorporating Dictionary and Corpus Information into a Context Vector Measure of Semantic Relatednes , 2003 .

[25]  Takenobu Tokunaga,et al.  Complementing WordNet with Roget’s and Corpus-based Thesauri for Information Retrieval , 1999, EACL.

[26]  Martha Palmer,et al.  Class-Based Construction of a Verb Lexicon , 2000, AAAI/IAAI.

[27]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[28]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[29]  Gregory Grefenstette,et al.  Evaluation Techniques for Automatic Semantic Extraction: Comparing Syntactic and Window Based Approaches , 1996 .

[30]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[31]  Eduard Hovy,et al.  Towards terascale knowledge acquisition , 2004, COLING 2004.

[32]  Patrick Pantel,et al.  Induction of semantic classes from natural language text , 2001, KDD '01.

[33]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[34]  C. Fillmore FRAME SEMANTICS AND THE NATURE OF LANGUAGE * , 1976 .

[35]  Patrick Pantel,et al.  Automatically Labeling Semantic Classes , 2004, NAACL.

[36]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[37]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[38]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[39]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[40]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[41]  Srinivas Bangalore,et al.  Automatic Acquisition of Phrase Grammars for Stochastic Language Modeling , 1998, VLC@COLING/ACL.

[42]  Olatz Ansa,et al.  Enriching very large ontologies using the WWW , 2000, ECAI Workshop on Ontology Learning.

[43]  Massimo Poesio,et al.  Acquiring Lexical Knowledge for Anaphora Resolution , 2002, LREC.

[44]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[45]  Peter Wiemer-Hastings,et al.  Inferring the Meaning of Verbs from Context , 1999 .

[46]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[47]  Ran El-Yaniv,et al.  On feature distributional clustering for text categorization , 2001, SIGIR '01.

[48]  Jun'ichi Tsujii,et al.  An efficient clustering algorithm for class-based language models , 2003, CoNLL.

[49]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[50]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[51]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[52]  Steffen Staab,et al.  Discovering Conceptual Relations from Text , 2000, ECAI.

[53]  Mitchell P. Marcus,et al.  Adding Semantic Annotation to the Penn TreeBank , 1998 .

[54]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[55]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[56]  Brian R. Gaines,et al.  Eliciting Knowledge and Transferring It Effectively to a Knowledge-Based System , 1993, IEEE Trans. Knowl. Data Eng..