Emerging directions in predictive text mining

In recent years, Text Mining has seen a tremendous spurt of growth as data scientists focus their attention on analyzing unstructured data. The main drivers for this growth have been big data as well as complex applications where the information in the text is often combined with other kinds of information in building predictive models. These applications require highly efficient and scalable algorithms to meet the overall performance demands. In this context, six main directions are identified where research in text mining is heading: Deep Learning, Topic Models, Graphical Modeling, Summarization, Sentiment Analysis, Learning from Unlabeled Text. Each direction has its own motivations and goals. There is some overlap of concepts because of the common themes of text and prediction. The predictive models involved are typically ones that involve meta‐information or tags that could be added to the text. These tags can then be used in other text processing tasks such as information extraction. While the boundary between the fields of Text Mining and Natural Language Processing is becoming increasingly blurry, the importance of predictive models for various applications involving text means there is still substantial growth potential within the traditional sub‐fields of text mining. These data‐centric directions are also likely to influence future research in Natural Language Processing, especially in resource‐poor languages and in multilingual texts. WIREs Data Mining Knowl Discov 2015, 5:155–164. doi: 10.1002/widm.1154

[1]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[2]  Long Jiang,et al.  User-level sentiment analysis incorporating social networks , 2011, KDD.

[3]  Kaspar Riesen,et al.  Graph Classification by Means of Lipschitz Embedding , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Von-Wun Soo,et al.  Extract conceptual graphs from plain texts in patent claims , 2012, Eng. Appl. Artif. Intell..

[5]  Barbara Di Eugenio,et al.  Emerging Applications of Natural Language Generation in Information Visualization, Education, and Health Care , 2010, Handbook of Natural Language Processing.

[6]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[7]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[8]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Geoffrey E. Hinton Learning multiple layers of representation , 2007, Trends in Cognitive Sciences.

[11]  Abraham Kandel,et al.  Classification of Web documents using a graph model , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[12]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[13]  O. Blanchard,et al.  Social Media Roi: Managing and Measuring Social Media Efforts in Your Organization , 2011 .

[14]  Ke Wang,et al.  Mining Generalized Associations of Semantic Relations from Textual Web Content , 2007, IEEE Transactions on Knowledge and Data Engineering.

[15]  Kristina Lerman,et al.  Tripartite graph clustering for dynamic sentiment analysis on social media , 2014, SIGMOD Conference.

[16]  Dragos Stefan Munteanu,et al.  ParaEval: Using Paraphrases to Evaluate Summaries Automatically , 2006, NAACL.

[17]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[18]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[19]  Jun Wang,et al.  Robustness Analysis of Global Exponential Stability of Recurrent Neural Networks in the Presence of Time Delays and Random Disturbances , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[21]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[22]  Bing Liu,et al.  Sentiment Analysis and Subjectivity , 2010, Handbook of Natural Language Processing.

[23]  Siti Sakira Kamaruddin,et al.  Dissimilarity algorithm on conceptual graphs to mine text outliers , 2009, 2009 2nd Conference on Data Mining and Optimization.

[24]  Frans Coenen,et al.  Text Classification using Graph Mining-based Feature Extraction , 2010, SGAI Conf..

[25]  Karel Jezek,et al.  Two uses of anaphora resolution in summarization , 2007, Inf. Process. Manag..

[26]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[27]  Theresa Wilson Fine-grained subjectivity and sentiment analysis: recognizing the intensity, polarity, and attitudes of private states , 2008 .

[28]  Anja Belz,et al.  Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models , 2008, Natural Language Engineering.

[29]  Ram Ramamoorthy,et al.  Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence , 2014 .

[30]  Paolo Rosso,et al.  A multidimensional approach for detecting irony in Twitter , 2013, Lang. Resour. Evaluation.

[31]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[32]  Jeffrey Heer,et al.  Termite: visualization techniques for assessing textual topic models , 2012, AVI.

[33]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[34]  Stan Matwin,et al.  Email classification with co-training , 2011, CASCON.

[35]  Lukás Burget,et al.  Neural network based language models for highly inflective languages , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Abderrafih Lehmam Essential summarizer: innovative automatic text summarization software in twenty languages , 2010, RIAO.

[37]  David Sussillo,et al.  Neural circuits as computational dynamical systems , 2014, Current Opinion in Neurobiology.

[38]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[39]  Xiaojun Wan,et al.  Co-Training for Cross-Lingual Sentiment Classification , 2009, ACL.

[40]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[41]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[42]  Alexander F. Gelbukh,et al.  Text Mining at Detail Level Using Conceptual Graphs , 2002, ICCS.

[43]  Shikha Jain,et al.  EmoXract: Domain independent emotion mining model for unstructured data , 2014, 2014 Seventh International Conference on Contemporary Computing (IC3).

[44]  Matt Gardner The Topic Browser An Interactive Tool for Browsing Topic Models , 2010 .

[45]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[46]  Jeremy Lee Wright,et al.  Topic Chains for Determining Risk of Unauthorized Information Transfer , 2014 .

[47]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[48]  Doreen Eichel,et al.  Learning And Soft Computing Support Vector Machines Neural Networks And Fuzzy Logic Models , 2016 .

[49]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[50]  Philip S. Yu,et al.  ViST: a dynamic index method for querying XML data by tree structures , 2003, SIGMOD '03.

[51]  Chia-Hui Chang,et al.  Sentiment-oriented contextual advertising , 2009, Knowledge and Information Systems.

[52]  John F. Sowa,et al.  Conceptual Structures: Information Processing in Mind and Machine , 1983 .

[53]  Huan Liu,et al.  Unsupervised sentiment analysis with emotional signals , 2013, WWW.

[54]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[55]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[56]  Youngjoong Ko,et al.  Automatic Text Categorization by Unsupervised Learning , 2000, COLING.

[57]  David M. Pennock,et al.  Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms , 2004, NIPS.

[58]  Justin Reich,et al.  Computer-Assisted Reading and Discovery for Student Generated Text in Massive Open Online Courses , 2014, J. Learn. Anal..

[59]  M. Saravanan,et al.  Improving Legal Document Summarization Using Graphical Models , 2006, JURIX.

[60]  Maria-Florina Balcan,et al.  Co-Training and Expansion: Towards Bridging Theory and Practice , 2004, NIPS.

[61]  Mohamed M. Mostafa,et al.  More than words: Social networks' text mining for consumer brand sentiments , 2013, Expert Syst. Appl..

[62]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[63]  Chia-Hui Chang,et al.  Sentiment-oriented contextual advertising , 2009, Knowledge and Information Systems.

[64]  Janyce Wiebe,et al.  Articles: Recognizing Contextual Polarity: An Exploration of Features for Phrase-Level Sentiment Analysis , 2009, CL.

[65]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[66]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[67]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[68]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[69]  Yihong Gong,et al.  A Two-Level Topic Model Towards Knowledge Discovery from Citation Networks , 2014, IEEE Transactions on Knowledge and Data Engineering.

[70]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[71]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[72]  Diego Reforgiato Recupero,et al.  Sentilo: Frame-Based Sentiment Analysis , 2014, Cognitive Computation.

[73]  Bing Liu,et al.  Mining Opinions in Comparative Sentences , 2008, COLING.

[74]  Matthias Jarke,et al.  Building and Exploring Dynamic Topic Models on the Web , 2014, CIKM.

[75]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[76]  Charu C. Aggarwal,et al.  Towards graphical models for text processing , 2012, Knowledge and Information Systems.

[77]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[78]  J. I N Y U,et al.  Choosing the content of textual summaries of large time-series data sets , 2005 .

[79]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[80]  Haizhou Li,et al.  Graph-based informative-sentence selection for opinion summarization , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[81]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[82]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[83]  Tom M. Mitchell,et al.  Estimating Accuracy from Unlabeled Data , 2014, UAI.

[84]  Jure Leskovec,et al.  Impact of Linguistic Analysis on the Semantic Graph Coverage and Learning of Document Extracts , 2005, AAAI.

[85]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[86]  Johannes Gehrke,et al.  Plagiarism Detection in arXiv , 2006, Sixth International Conference on Data Mining (ICDM'06).