CTMS : A Comparative Text Mining System

In many applications, there is often a need for comparing multiple text collections to find commonalities and differences in topical themes, a task we refer to as comparative text mining . In this paper, we present a general comparative mining system (CTMS). The CTMS system takes any two collections of text and generates a list of cross-collection themes and their associated individual collection-specific themes. The themes are linked to representative passages in each collection. The themes are represented as word distributions, and the underlying comparative mining algorithm is based on a probabilistic mixture model. The system carries out all the stages of text mining from data cleaning and preprocessing to the actual mining and post-processing, allowing users to perform comparative analysis between any two collections and navigate through the extracted theme space. This system can potentially be applied to a broad range of areas including opinion summarization, business intelligence, and summarization of text.

[1]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[2]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[3]  Regina Barzilay,et al.  Towards Multidocument Summarization by Reformulation: Progress and Prospects , 1999, AAAI/IAAI.

[4]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[5]  Tomek Strzalkowski,et al.  Evaluating Summaries for Multiple Documents in an Interactive Environment , 2000, LREC.

[6]  Dragomir R. Radev,et al.  Generating Natural Language Summaries from Multiple On-Line Sources , 1998, CL.

[7]  Bing Liu,et al.  Mining Opinion Features in Customer Reviews , 2004, AAAI.

[8]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[9]  Chris Buckley,et al.  Automatic Text Summarization by Paragraph Extraction , 1997 .

[10]  Mike Y. Chen,et al.  Yahoo! For Amazon: Sentiment Parsing from Small Talk on the Web , 2001 .

[11]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[12]  Inderjeet Mani,et al.  Summarizing Similarities and Differences Among Related Documents , 1997, Information Retrieval.

[13]  Katsumi Tanaka,et al.  A comparative web browser (CWB) for browsing and comparing web pages , 2003, WWW '03.

[14]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[15]  Jade Goldstein-Stewart,et al.  Creating and evaluating multi-document sentence extract summaries , 2000, CIKM '00.