Recursive Style Breach Detection with Multifaceted Ensemble Learning

We present a supervised approach for style change detection, which aims at predicting whether there are changes in the style in a given text document, as well as at finding the exact positions where such changes occur. In particular, we combine a TF.IDF representation of the document with features specifically engineered for the task, and we make predictions via an ensemble of diverse classifiers including SVM, Random Forest, AdaBoost, MLP, and LightGBM. Whenever the model detects that style change is present, we apply it recursively, looking to find the specific positions of the change. Our approach powered the winning system for the PAN@CLEF 2018 task on Style Change Detection.

[1]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[2]  Rita Kuznetsova,et al.  Style Breach Detection with Neural Sentence Embeddings , 2017, CLEF.

[3]  Jamal Ahmad Khan Style Breach Detection: An Unsupervised Detection Model , 2017, CLEF.

[4]  Vadim V. Strijov,et al.  Methods for Intrinsic Plagiarism Detection and Author Diarization , 2016, CLEF.

[5]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[6]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[7]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[8]  Rao Muhammad Adeel Nawab,et al.  Author Diarization Using Cluster-Distance Approach , 2016, CLEF.

[9]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[10]  Iqra Ameer,et al.  Identification of Author Personality Traits using Stylistic Features: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[11]  Preslav Nakov,et al.  SU@PAN'2016: Author Obfuscation , 2016, CLEF.

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[14]  Martyna Spiewak,et al.  OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection , 2017, CLEF.

[15]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection , 2018, CLEF.

[16]  Matthias Hagen,et al.  Overview of the Author Obfuscation Task at PAN 2017: Safety Evaluation Revisited , 2017, CLEF.

[17]  Preslav Nakov,et al.  The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation - (Best of the Labs Track at CLEF-2017) , 2017, CLEF.

[18]  Matthias Hagen,et al.  Author Obfuscation: Attacking the State of the Art in Authorship Verification , 2016, CLEF.

[19]  Diana Inkpen,et al.  Getting More from Segmentation Evaluation , 2012, HLT-NAACL.

[20]  Preslav Nakov,et al.  An Ensemble-Rich Multi-Aspect Approach Towards Robust Style Change Detection: Notebook for PAN at CLEF 2018 , 2018, CLEF.

[21]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.