Author attribution on streaming data

The concept of novel authors occurring in streaming data source, such as evolving social media, is an unaddressed problem up until now. Existing author attribution techniques deals with the datasets, where the total number of authors do not change in the training or the testing time of the classifiers. This study focuses on the question, “what happens if new authors are added into the system by time?”. Moreover in this study we are also dealing with the problems that some of the authors may not stay and may disappear by time or may reappear after a while. In this study stream mining approaches are proposed to solve the problem. The test scenarios are created over the existing IMDB62 data set, which is widely used by author attribution algorithms already. We used our own shuffling algorithms to create the effect of novel authors. Also before the stream mining, POS tagging approaches and the TF-IDF methods are applied for the feature extraction. And we have applied bi-tag approach where two consecutive tags are considered as a new feature in our approach. By the help of novel techniques, first time proposed in this paper, the success rate has been increased from 35% to 61% for the authorship attribution on streaming text data.

[1]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[2]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[3]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[4]  Ingrid Zukerman,et al.  Authorship Attribution with Latent Dirichlet Allocation , 2011, CoNLL.

[5]  Charu C. Aggarwal,et al.  Detecting Recurring and Novel Classes in Concept-Drifting Data Streams , 2011, 2011 IEEE 11th International Conference on Data Mining.

[6]  Bhavani M. Thuraisingham,et al.  Detecting Remote Exploits Using Data Mining , 2008, IFIP Int. Conf. Digital Forensics.

[7]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[8]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[9]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints , 2011, IEEE Transactions on Knowledge and Data Engineering.

[10]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[11]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[12]  Sadi Evren Seker,et al.  Calculation of surface settlements caused by EPBM tunneling using artificial neural network, SVM, and Gaussian processes , 2013, Environmental Earth Sciences.

[13]  Ingrid Zukerman,et al.  Personalised rating prediction for new users using latent factor models , 2011, HT '11.