Semantic Frame-Based Document Representation for Comparable Corpora

Document representation is a fundamental problem for text mining. Many efforts have been done to generate concise yet semantic representation, such as bag-of-words, phrase, sentence and topic-level descriptions. Nevertheless, most existing techniques counter difficulties in handling monolingual comparable corpus, which is a collection of monolingual documents conveying the same topic. In this paper, we propose the use of frame, a high-level semantic unit, and construct frame-based representations to semantically describe documents by bags of frames, using an information network approach. One major challenge in this representation is that semantically similar frames may be of different forms. For example, "radiation leaked" in one news article can appear as "the level of radiation increased" in another article. To tackle the problem, a text-based information network is constructed among frames and words, and a link-based similarity measure called SynRank is proposed to calculate similarity between frames. As a result, different variations of the semantically similar frames are merged into a single descriptive frame using clustering, and a document can then be represented as a bag of representative frames. It turns out that frame-based document representation not only is more interpretable, but also can facilitate other text analysis tasks such as event tracking effectively. We conduct both qualitative and quantitative experiments on three comparable news corpora, to study the effectiveness of frame-based document representation and the similarity measure SynRank, respectively, and demonstrate that the superior performance of frame-based document representation on different real-world applications.

[1]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[2]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[3]  Daniel Jurafsky,et al.  Automatic Labeling of Semantic Roles , 2002, CL.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Michael Healy,et al.  Theory and Applications of Ontology: Computer Applications , 2010 .

[6]  Mirella Lapata,et al.  Using Semantic Roles to Improve Question Answering , 2007, EMNLP.

[7]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[8]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[9]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[10]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[11]  R. Bekkerman Distributional Word Clusters vs , 2006 .

[12]  C. Fillmore FRAME SEMANTICS AND THE NATURE OF LANGUAGE * , 1976 .

[13]  Thomas F. La Porta,et al.  IntruMine: Mining Intruders in Untrustworthy Data of Cyber-physical Systems , 2012, SDM.

[14]  Dekang Lin,et al.  Phrase Clustering for Discriminative Learning , 2009, ACL.

[15]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[16]  Brendan J. Frey,et al.  Non-metric affinity propagation for unsupervised image categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[18]  Jimmy J. Lin,et al.  Extracting Structural Paraphrases from Aligned Monolingual Corpora , 2003, IWP@ACL.

[19]  Yihong Gong,et al.  Multi-Document Summarization using Sentence-based Topic Models , 2009, ACL.

[20]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[21]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[22]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[23]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[24]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[25]  Vlado Keselj,et al.  Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering , 2005, CIKM '05.

[26]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[27]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[28]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[29]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[30]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[31]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[32]  Oren Etzioni,et al.  Semantic Role Labeling for Open Information Extraction , 2010, HLT-NAACL 2010.

[33]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[34]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .