Knowledge Transfer and Opinion Detection in the TREC 2006 Blog Track

The paper describes the opinion detection system developed in Carnegie Mellon University for TREC 2006 Blog track. The system performed a two-stage process: passage retrieval and opinion detection. Due to lack of training data for the TREC Blog corpus, online opinion reviews provided in other domains, such as movie review and product review, were used as the training data. Knowledge transfer was performed to make the cross-domain learning possible. Logistic regression ranked the sentence-level opinions vs. objective statements. The evaluation shows that the algorithm is effective in the task. Introduction The Blog track is a new task in the TREC 2006 evaluation. The main task of the track is “opinion detection” in the domain of the online blogs posted during the period of Dec 2005 to Feb 2006. The posts and the comments are from Technorati, Bloglines, Blogpulse and other web hosts. The system developed in Carnegie Mellon University for the opinion detection task consists of the modules described below. Data Preprocessing The data from NIST are mainly xml files with tags similar to previous TREC web collections, such as WT10G or GOV2. Three types of files are provided by NIST: permalinks (html documents containing the posts), RSS feeds, and blog homepages (html documents containing the homepages of the feeds). Permalinks contains the actual content of the corpus, and are the main target of this task. Indexing, retrieval and opinion detection are all performed based on permalink documents. Further study should involve RSS feeds since they reveal the structure and network of multiple blog posts. Due to messy nature of online html, data cleaning is an important preprocessing step. Two approaches were tried. The first utilized the built-in html file cleaning functions of the latest Indri 2.3.1 toolkit [1]. Additional preprocessing was done to handle stylesheets and javascript, which were not handled by the current version of Indri. The index was then built based on the “trecweb” format supported by Indri. The unit for indexing and retrieval is one permalink document, i.e., one blog post with its following comments. However, it was soon realized that taking the raw html files and throwing them into Indri to index limits the flexibility of gathering more information from the raw text, for example, sentence structure, paragraph information, part-of-speech tagging, etc, which could be important for opinion detection in later stages. These could all be done by creating more functions in Indri, however, due to the amount of programming effort and time constraints, the first approach was discarded. The second approach was to transform the html files as closely as possible into regular text files. This was done by several steps. Removing HTML tags, scripts, stylesheets: A wrapper was created on top of a tool called “striptags” from the REAP project [2]. The text documents looked much neater; however, there were still advertisements, text from side bars and menus floating around. These are all noise in the main text. To remove them another module with machine learned patterns could possibly remove them. However, such noise would also be filtered out automatically by the retrieval and opinion detection in the later stages, hence leaving them in caused little harm to the final results. 1 This work was done while the author was at the Language Technologies Institute at Carnegie Mellon University. Removing Non-English characters: The TREC 2006 Blog corpus contains non-English posts. Characters with ASCII code less than 32 and greater than 126 were removed from the corpus. Sentence Splitting: A modified version of UIUC’s sentence splitter [3] was used to annotate the corpus with <s> and </s> tags that identified the beginning and end of each sentence. Creating Artificial Paragraphs: Original line breaks from the text were reserved as segmentations of paragraphs. Moreover, for long original paragraphs, an around-100-word paragraph break was introduced with no crossing of the sentence boundaries. Removing Dummy Sentences: If a sentence contained only punctuation, or just a single number, it was removed. This step filtered out form counters largely available on the Web documents. Moreover, if within a sentence, any word had an occurrence of more than N times (N=20 in these tests), that sentence was removed. This step filtered out advertisement and web category anchor text, which were not important content of the Web blog and hence are irrelevant to any potential query <DOC> <TEXT> <PARAGRAPH> <s>blah blah blah</s> <s>blah blah blah</s> ... </PARAGRAPH> <PARAGRAPH> ... </TEXT> </DOC> . Query Formulation The topic file that NIST provided contains topics each with the title, description and narrative parts. Based on the title field, the topics could be classified into 6 categories [Figure 1]: blog topic category distribution

[1]  Michael I. Jordan,et al.  Multi-task feature selection , 2006 .

[2]  Rajat Raina,et al.  Abstract , 1997, Veterinary Record.