CoBAn : A Context Based Approach for Text Classification

1.1. The Concept The proposed approach first identifies key terms that serve as initial indication for the presence of sentiment and then analyzed the context in which they appear. This enables us to detect even small amounts of relevant text hidden in a much larger section. The original rule-based model, presented in [1], was evaluated in the field of data leakage detection. It used predefined formulae to determine the "confidentiality score" of the analyzed text. The model was able to detect small amounts of rephrased confidential text hidden in larger nonconfidential documents, a task which proved difficult both for fingerprinting algorithms [2, 3] as well as BOW classifiers.

[1]  Lior Rokach,et al.  Wikipedia-based query performance prediction , 2014, SIGIR.

[2]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[3]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[4]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[5]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[6]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[7]  Yoshua Bengio,et al.  The Curse of Dimensionality for Local Kernel Machines , 2005 .

[8]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[9]  Reda Alhajj,et al.  Effectiveness of template detection on noise reduction and websites summarization , 2013, Inf. Sci..

[10]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[11]  Yuval Elovici,et al.  CoBAn: A context based model for data leakage prevention , 2014, Inf. Sci..

[12]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[13]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[14]  Djoerd Hiemstra,et al.  Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term , 2002, SIGIR '02.

[15]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[16]  ChengXiang Zhai,et al.  Positional language models for information retrieval , 2009, SIGIR.

[17]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[18]  Boleslaw K. Szymanski,et al.  Taming the Curse of Dimensionality in Kernels and Novelty Detection , 2004, WSC.

[19]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.