Untangling Text Data Mining

The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. Perhaps for this reason, there has been little work in text data mining to date, and most people who have talked about it have either conflated it with information access or have not made use of text directly to discover heretofore unknown information. In this paper I will first define data mining, information access, and corpus-based computational linguistics, and then discuss the relationship of these to text data mining. The intent behind these contrasts is to draw attention to exciting new kinds of problems for computational linguists. I describe examples of what I consider to be real text data mining efforts and briefly outline recent ideas about how to pursue exploratory data analysis over text.

[1]  Earl Rennison,et al.  Galaxy of news: an approach to visualizing and understanding expansive news landscapes , 1994, UIST '94.

[2]  Doug Beeferman Lexical Discovery with an Enriched Semantic Network , 1998, WordNet@ACL/COLING.

[3]  Steven F. Roth,et al.  Toward an information visualization workspace: combining multiple means of expression , 1997 .

[4]  Peter Willett,et al.  The limitations of term co-occurrence data for query expansion in document retrieval systems , 1991, J. Am. Soc. Inf. Sci..

[5]  Neil R. Smalheiser,et al.  Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic disease , 1994 .

[6]  Deborah L. McGuinness,et al.  Integrated Support for Data Archeology , 1993, Int. J. Cooperative Inf. Syst..

[7]  Michael Stuart,et al.  Understanding Robust and Exploratory Data Analysis , 1984 .

[8]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[9]  Don R. Swanson,et al.  Complementary structures in disjoint science literatures , 1991, SIGIR '91.

[10]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[11]  Don R. Swanson,et al.  Two medical literatures that are logically but not bibliographically connected , 1987, J. Am. Soc. Inf. Sci..

[12]  Kimberly S. Hamilton,et al.  The increasing linkage between U.S. technology and public science , 1997 .

[13]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[14]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[15]  Neil R. Smalheiser,et al.  Artificial Intelligence An interactive system for finding complementary literatures : a stimulus to scientific discovery , 1995 .

[16]  David G. Hendry,et al.  An informal information-seeking environment , 1997 .

[17]  Christopher D. Manning Automatic Acquisition of a Large Sub Categorization Dictionary From Corpora , 1993, ACL.

[18]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[19]  James J. Thomas,et al.  Visualizing the non-visual: spatial analysis and interaction with information from text documents , 1995, Proceedings of Visualization 1995 Conference.

[20]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[21]  Steven F. Roth,et al.  An Interactive Visualization Environment for Data Exploration , 1997, KDD.

[22]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[23]  Matthew Chalmers,et al.  Bead: explorations in information visualization , 1992, SIGIR '92.

[24]  Hsinchun Chen,et al.  Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques , 1998, J. Am. Soc. Inf. Sci..

[25]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[26]  Ido Dagan,et al.  Keyword-Based Browsing and Analysis of Large Document Sets , 1996 .

[27]  David G. Hendry,et al.  An Informal Information-Seeking Environment , 1997, J. Am. Soc. Inf. Sci..

[28]  C. Fellbaum An Electronic Lexical Database , 1998 .

[29]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[30]  Frederick Mosteller,et al.  Understanding robust and exploratory data analysis , 1983 .

[31]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[32]  Paul R. Cohen,et al.  A Mixed-Initiative Planning Approach to Exploratory Data Analysis , 1996 .

[33]  Ramasamy Uthurusamy,et al.  Data Mining and Knowledge Discovery in Databases (Introduction to the Special Section). , 1996 .

[34]  Ronen Feldman,et al.  Visualization Techniques to Explore Data Mining Results for Document Collections , 1997, KDD.

[35]  K. Welch,et al.  Low Brain Magnesium in Migraine , 1989, Headache.

[36]  C. Mair,et al.  Using large corpora , 1997 .