J48S: A Sequence Classification Approach to Text Analysis Based on Decision Trees

Sequences play a major role in the extraction of information from data. As an example, in business intelligence, they can be used to track the evolution of customer behaviors over time or to model relevant relationships. In this paper, we focus our attention on the domain of contact centers, where sequential data typically take the form of oral or written interactions, and word sequences often play a major role in text classification, and we investigate the connections between sequential data and text mining techniques. The main contribution of the paper is a new machine learning algorithm, called J48S, that associates semantic knowledge with telephone conversations. The proposed solution is based on the well-known C4.5 decision tree learner, and it is natively able to mix static, that is, numeric or categorical, data and sequential ones, such as texts, for classification purposes. The algorithm, evaluated in a real business setting, is shown to provide competitive classification performances compared with classical approaches, while generating highly interpretable models and effectively reducing the data preparation effort.

[1]  Mario Cannataro,et al.  Protein-to-protein interactions: Technologies, databases, and algorithms , 2010, CSUR.

[2]  Hoai Bac Le,et al.  Efficient algorithms for simultaneously mining concise representations of sequential patterns based on extended pruning conditions , 2018, Eng. Appl. Artif. Intell..

[3]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[4]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[5]  Antonio Gomariz,et al.  VGEN: Fast Vertical Mining of Sequential Generator Patterns , 2014, DaWaK.

[6]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[7]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[8]  Manuel Campos,et al.  Fast Vertical Mining of Sequential Patterns Using Co-occurrence Information , 2014, PAKDD.

[9]  Frédérik Cailliau,et al.  Mining Automatic Speech Transcripts for the Retrieval of Problematic Calls , 2013, CICLing.

[10]  Siau-Cheng Khoo,et al.  Mining and Ranking Generators of Sequential Pattern , 2008, SDM 2008.

[11]  Roque Marín,et al.  ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences , 2013, PAKDD.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Hong-Yeop Song,et al.  A New Criterion in Selection and Discretization of Attributes for the Generation of Decision Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Avishai Mandelbaum,et al.  Telephone Call Centers: Tutorial, Review, and Research Prospects , 2003, Manuf. Serv. Oper. Manag..

[15]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[16]  Nizar R. Mabroukeh,et al.  A taxonomy of sequential pattern mining algorithms , 2010, CSUR.

[17]  Elizabeth Chang,et al.  Past, present and future of contact centers: a literature review , 2017, Bus. Process. Manag. J..

[18]  Meghna Abhishek Pandharipande,et al.  A novel approach to identify problematic call center conversations , 2012, 2012 Ninth International Conference on Computer Science and Software Engineering (JCSSE).

[19]  Philip S. Yu,et al.  Direct mining of discriminative and essential frequent patterns via model-based search tree , 2008, KDD.

[20]  Jean-Luc Gauvain,et al.  CallSurf: Automatic Transcription, Indexing and Structuration of Call Center Conversational Speech for Knowledge Extraction and Query by Content , 2008, LREC.

[21]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[22]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[23]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[24]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[25]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[26]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.