Business environmental analysis for textual data using data mining and sentence-level classification

The purpose of this paper is to propose a methodology to analyze a large amount of unstructured textual data into categories of business environmental analysis frameworks.,This paper uses machine learning to classify a vast amount of unstructured textual data by category of business environmental analysis framework. Generally, it is difficult to produce high quality and massive training data for machine-learning-based system in terms of cost. Semi-supervised learning techniques are used to improve the classification performance. Additionally, the lack of feature problem that traditional classification systems have suffered is resolved by applying semantic features by utilizing word embedding, a new technique in text mining.,The proposed methodology can be used for various business environmental analyses and the system is fully automated in both the training and classifying phases. Semi-supervised learning can solve the problems with insufficient training data. The proposed semantic features can be helpful for improving traditional classification systems.,This paper focuses on classifying sentences that contain the information of business environmental analysis in large amount of documents. However, the proposed methodology has a limitation on the advanced analyses which can directly help managers establish strategies, since it does not summarize the environmental variables that are implied in the classified sentences. Using the advanced summarization and recommendation techniques could extract the environmental variables among the sentences, and they can assist managers to establish effective strategies.,The feature selection technique developed in this paper has not been used in traditional systems for business and industry, so that the whole process can be fully automated. It also demonstrates practicality so that it can be applied to various business environmental analysis frameworks. In addition, the system is more economical than traditional systems because of semi-supervised learning, and can resolve the lack of feature problem that traditional systems suffer. This work is valuable for analyzing environmental factors and establishing strategies for companies.

[1]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[2]  Jimmy J. Lin Scalable Language Processing Algorithms for the Masses: A Case Study in Computing Word Co-occurrence Matrices with MapReduce , 2008, EMNLP.

[3]  Hae-Chang Rim,et al.  Knowledge-based question answering using the semantic embedding space , 2015, Expert Syst. Appl..

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Shixiong Xia,et al.  An Improved KNN Text Classification Algorithm Based on Clustering , 2009, J. Comput..

[6]  Alan F. Smeaton,et al.  Classifying sentiment in microblogs: is brevity an advantage? , 2010, CIKM.

[7]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[8]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[9]  Erkki Sutinen,et al.  MinEDec: A decision support model that combines text mining with competitive intelligence , 2010, 2010 International Conference on Computer Information Systems and Industrial Management Applications (CISIM).

[10]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[11]  Jenny A. Harding,et al.  Textual data mining for industrial knowledge management and text classification: A business oriented approach , 2012, Expert Syst. Appl..

[12]  Hongyun Zhang,et al.  Rough set based hybrid algorithm for text classification , 2009, Expert Syst. Appl..

[13]  Farhad Ameri,et al.  A Text Mining Technique for Manufacturing Supplier Classification , 2015 .

[14]  John A Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD , 2012, Behavior Research Methods.

[15]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Giuseppe Riccardi,et al.  How may I help you? , 1997, Speech Commun..

[18]  Stephen M. Stigler,et al.  The History of Statistics: The Measurement of Uncertainty before 1900 , 1986 .

[19]  Thomas Klose,et al.  Text mining and visualization tools - Impressions of emerging capabilities , 2008 .

[20]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[21]  Junlan Feng,et al.  Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[22]  Jae-Woong Choe,et al.  Trends 21 Corpus: Public Web Resources and Search Tools , 2014 .

[23]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[24]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[25]  Ranjit Bose,et al.  Competitive intelligence process and tools for intelligence analysis , 2008, Ind. Manag. Data Syst..

[26]  Xiaolin Du,et al.  Short Text Classification: A Survey , 2014, J. Multim..

[27]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[28]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[29]  Nizar Bouguila,et al.  A study of spam filtering using support vector machines , 2010, Artificial Intelligence Review.

[30]  R. Whittington,et al.  Exploring Corporate Strategy: Text and Cases , 1989 .

[31]  N. Komoda,et al.  SWOT Analysis Support Tool for Verification of Business Strategy , 2006, 2006 IEEE International Conference on Computational Cybernetics.

[32]  Hae-Chang Rim,et al.  The Effects of Feature Optimization on High-Dimensional Essay Data , 2015 .

[33]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[34]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[35]  Ranjit Bose,et al.  Advanced analytics: opportunities and challenges , 2009, Ind. Manag. Data Syst..

[36]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[37]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  J. Andrew Bagnell,et al.  Efficient high dimensional maximum entropy modeling via symmetric partition functions , 2012, NIPS.

[39]  Lin Ma,et al.  Extracting failure time data from industrial maintenance records using text mining , 2017, Adv. Eng. Informatics.

[40]  Erkki Sutinen,et al.  MOETA: a novel text-mining model for collecting and analysing competitive intelligence , 2013, Int. J. Adv. Media Commun..

[41]  Hae-Chang Rim,et al.  Probabilistic Modeling of Korean Morphology , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.