Error-Driven Boolean-Logic-Rule-Based Learning for Mining Chatroom Conversations

The ephemeral nature of human communication via networks today poses interesting and challenging problems for information technologists. The sheer volume of communication in venues such as email, newsgroups, and chat precludes manual techniques of information management. Currently, no systematic mechanisms exist for accumulating these artifacts of communication in a form that lends itself to the construction of models of semantics [5]. In essence, dynamic techniques of analysis are needed if textual data of this nature is to be effectively mined. At Lehigh University we are developing a text mining tool for analysis of chat-room conversations. Project goals concentrate on the development of functionality to answer questions such as “What topics are being discussed in a chat-room?”, “Who is discussing which topics?” and “Who is interacting with whom?” The objective is to develop technology that can automatically identify such patterns of interaction in both social and semantic terms. In this article we present our preliminary findings for a novel technique developed to identify threads of conversation in multitopic, multi-person chat-rooms. This is the first step towards building models of social and semantic interaction. We term our technique Error-Driven Boolean-LogicRule-Based Learning (BLogRBL), a variation on Brill’s Transformation Based Learning [11] [12] [13]. Similar to Brill’s method, rules are automatically derived from templates during learning. It differs from Brill’s technique in that rules take the form of complex expressions of combinational logic. We report on the scope and design of our technique, as well as discussing preliminary results. 1.0 Background And Motivation The goal of the project is to develop a computational approach towards understanding social and semantic interactions in textual mediums. Chat-room conversation has been identified as the communication medium of interest due to its increasing popularity and the need for research in the area. The 430 million daily instant messages on AOL’s network alone provide a treasure trove of knowledge [3]. Chat conversation is radically different from various other mediums due to its often informal nature. Existing text mining techniques rely on more structured, formal corpuses containing research papers, abstracts, technical reports, etc. Approaches toward understanding the dynamics of chat conversation are limited, and as usage grows the need for automated analysis increases. Due to the dynamic nature of chat conversations, dynamic modeling of social interactions and their contextual topics is a genuine research challenge. This research is being conducted at the behest of the Intelink intelligence network. Intelink is a secure military communications channel used for critical exchanges of information. Intelink’s goal is to monitor chat conversation over the network and map relationships between users and their topics of conversation to determine the appropriateness of usage and the effectiveness of the communication network. They are interested in information such as the frequency of employee communication, the topics discussed, conversational participants and the focus of the conversations. This research has, however, applications beyond the scope of Intelink’s needs. In fact, the techniques under development apply to any organization with an internal communications network, and perhaps to Internet users in general – patrons of chat services such as AOL Instant Messenger (AIM) and IRC (Internet Relay Chat) could benefit from utilizing such a tool. 1.1 Modeling Social & Semantic Relations The application under development at Lehigh University to model social and semantic interactions is the Social Semantic Builder (SSB). The SSB is a relational modeling tool that utilizes the HDDI [6][2][1] text mining infrastructure, and models relationships between distinct conceptual and/or behavioral abstractions. The purpose of the SSB is to determine, analyze, and model the relationships and interactions between these abstract relational entities. As an example, consider the domain of research papers. Abstractions within this application field include authors of the papers and the concepts they explore. Corpuses are constructed and used to cluster instances of the abstract entities according to their co-relational properties. In this example, two separate models would be created, one of authors who write together and another one of concepts that are similar within the document space. Continuing the example, the SSB would combine these two models to create a meta level model between research paper authors and their conceptual content. The metamodel could then be used to analyze and discover previously unknown relationships between authors and their content. This would allow questions such as “What topics are being discussed?”, “Who is discussing which topics?” and “Who is authoring with whom?” to be answered. Although we have presented the SSB in the context of research article authors and content, social and semantic modeling can take place in any domain involving multiple authors and content. 2.0 Application Domain: Chat The Social Semantic Builder is a utility with a variety of applications. We have discussed authors and research papers, but there are also students and courses, journalists and newspaper articles, etc. The SSB is designed in such a fashion that its general relational structure could be deployed for mapping between various types of entities. Each application has its own particular issues that need to be addressed, and in this article we address those issues relevant to the analysis of chat conversations. Some of the questions particular to chat are “Who are the participants in a particular conversation?” “What are they talking about?” “How focused is their conversation?” “How are the participants socially interacting?” “What forms of language do they use to express themselves?”. A user (such as Intelink) interested in such chat relational models would be able to use the SSB for extracting such information. Chat conversational documents would be input into the SSB, and it would create models of chat participants and conversational topics. A user then could use the models to associate topics with participants, observing which participants discussed a particular topic. Information such as which participants were involved in discussions together, and the topic of those discussions would be readily available. The basic questions of “What topics are being discussed in a chat-room?", "Who is discussing which topics?" and "Who is interacting with whom?" can thus be answered. 2.1 Chat Input Issues The SSB accepts input in XML form, utilizing the HDDI infrastructure for processing the documents. At present, the HDDI System processes the input and then constructs a collection of statistics for analysis. The models for the social and semantic domains are then created and linked together by the Social Semantic Model Builder. Some application domains easily map to this input format such as research papers and newsgroup postings . The authors are identified at the beginning, and the body of the document or posting can be tagged as content. Chat conversation is not so structured. Chat is often a continuous medium with users entering and leaving a given chat room. Furthermore, even though a chat room may have many users logged in, not all of them may be participating. Of those users who are involved, they do not all participate in the same discussion with all the other users. Often there are several conversations simultaneously taking place between users – a single participant may also be involved in multiple conversations at once. It is an extremely chaotic environment and at first glance seems to lack consistent structure. In this situation, various chat conversations are interlaced throughout multiple postings and extracting “authors” and their content into single cohesive units for input to the SSB is a daunting task. Furthermore, in our research we have observed that there are numerous categories of chat. Factors such as the number of participants, the topic(s) of chat, the familiarity of users with each other, etc. lead to radically different conversation styles. If a conversation is between acquaintances discussing a common topic, for example, the conversational flow tends to be informal with little attention paid to grammar. If the session is, however, a help session or a discussion medium for a focused topic in which the users don’t know one another, the conversation is typically focused and formal and a broader usage of vocabulary is observed. Our objective is to partition chat data into collections of postings composed of two or more authors discussing a single topic, 1 Margaret A. Root defines postings as single messages entered into a network communication system (e.g., chat room or Usenet Newsgroup) [18]. creating input (that we refer to as items) for analysis by the SSB. Thus, each item consists of postings relevant to a single topic, and the users who participated in that topic. Within this framework, the names/screen names of the posters identify authors and the postings identify content. A co-authorship relationship is defined between users based on the content of their postings, and separate content and author clusters are created by the SSB. Each SSB input item is thus a thread of conversation or discussion revolving around a single, or group of very similar topics. 3.0 Conversational Flow As Threads As noted, in online chat environments, there are often multiple discussions taking place and within a particular room or channel, authors will participate in multiple discussions, or items [9]. As a result, in the log of a chat session various items overlap as content from multiple discussions is interlaced. We thus define a thread or item as a collection of multiple authors’ postings grouped together by semantic similarity. Intui

[1]  Marc Smith,et al.  Conversation trees and threaded chats , 2000, CSCW '00.

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[3]  Masaki Murata,et al.  Topic search for intelligent network news reader HISHO , 2000, SAC '00.

[4]  Ken Samuel,et al.  An Investigation of Transformation-Based Learning in Discourse , 1998, ICML.

[5]  William M. Pottenger,et al.  Distributed Information Management. , 2001 .

[6]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[7]  Steven M. Drucker,et al.  Alternative interfaces for chat , 1999, UIST '99.

[8]  Ken Samuel,et al.  Dialogue Act Tagging with Transformation-Based Learning , 1998, ACL.

[9]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[10]  Earl Rennison,et al.  Galaxy of news: an approach to visualizing and understanding expansive news landscapes , 1994, UIST '94.

[11]  Eric Brill A Report of Recent Progress in Transformation-Based Error-Driven Learning , 1994, HLT.

[12]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[13]  Warren Sack,et al.  Conversation map: a content-based Usenet newsgroup browser , 2000, IUI '00.

[14]  F. D. Bouskila The Role of Semantic Locality in Hierarchical Distributed Dynamic Indexing and Information Retrieval , 1999 .

[15]  Yong-Bin Kim,et al.  HDDI™: Hierarchical Distributed Dynamic Indexing , 2001 .

[16]  William M. Pottenger,et al.  The Role of the HDDI Collection Builder in Hierarchical Distributed Dynamic Indexing , 2004 .