Locality Alignment Discriminant Analysis for Visualizing Regional English

In this paper, a novel dimensionality reduction algorithm named locality alignment discriminant analysis (LADA) for visualizing regional English is proposed. In the LADA algorithm, the proposed intrinsic graph or penalty graph measures the similarities between each pairwise textual slices, which can better characterize the intra-class compactness and inter-class separability; the projection matrix obtained by the proposed method is orthogonal, which can eliminate the redundancy between different projection directions, and is more effective for preserving the intrinsic geometry and improving the discriminating ability. To evaluate the performance of the algorithm, a regional written English corpus is designed and collected. Consequently, articles are split into slices and then transformed into 140-dimensional data points by 140 text style markers. Finally, variations existing in the regional written English are attempted to be recognized with our proposed LADA. The similarity among different types of English can be observed by the data plots. The results of visualization and numerical comparison indicate that LADA outperforms other existing algorithms in handling regional English data, as the proposed LADA can better preserve the local discriminative information embedded in the data, which is suitable for pattern classification.

[1]  Tommy W. S. Chow,et al.  Trace Ratio Optimization-Based Semi-Supervised Nonlinear Dimensionality Reduction for Marginal Manifold Visualization , 2013, IEEE Transactions on Knowledge and Data Engineering.

[2]  Jong Hwan Suh,et al.  Applying text and data mining techniques to forecasting the trend of petitions filed to e-People , 2010, Expert Syst. Appl..

[3]  Efstathios Stamatatos,et al.  Automatic Authorship Attribution , 1999, EACL.

[4]  Feiping Nie,et al.  Trace Ratio Problem Revisited , 2009, IEEE Transactions on Neural Networks.

[5]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[6]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[7]  Tai-Yue Wang,et al.  Solving multi-label text categorization problem using support vector machine approach with membership function , 2011, Neurocomputing.

[8]  Malik Yousef,et al.  One-class document classification via Neural Networks , 2007, Neurocomputing.

[9]  J. Beal Arthur Hughes, Peter Trudgill, and Dominic Watt. 2012. English Accents and Dialects: An Introduction to Social and Regional Varieties of English in the British Isles. , 2015 .

[10]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[11]  Tommy W. S. Chow,et al.  Recognition of word collocation habits using frequency rank ratio and inter-term intimacy , 2013, Expert Syst. Appl..

[12]  Marilyn A. Walker,et al.  Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text , 2007, J. Artif. Intell. Res..

[13]  Xuelong Li,et al.  Patch Alignment for Dimensionality Reduction , 2009, IEEE Transactions on Knowledge and Data Engineering.

[14]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[15]  Natalie Schilling-Estes,et al.  American English: Dialects and Variation , 1998 .

[16]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Jian Yang,et al.  Sparse tensor discriminant analysis , 2013, IEEE Transactions on Image Processing.

[19]  Tommy W. S. Chow,et al.  Trace Ratio Linear Discriminant Analysis for Medical Diagnosis: A Case Study of Dementia , 2013, IEEE Signal Processing Letters.

[20]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[21]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[22]  Regina Barzilay,et al.  Learning Document-Level Semantic Properties from Free-Text Annotations , 2008, ACL.

[23]  Allan Metcalf How We Talk: American Regional English Today. A Talking Tour of American English, Region by Region. , 2000 .

[24]  Shuicheng Yan,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007 .

[25]  Feiping Nie,et al.  Trace Ratio Criterion for Feature Selection , 2008, AAAI.

[26]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[27]  Shigenori Tanaka English and Multiculturalism— from the Language User’s Perspective , 2006 .

[28]  Tommy W. S. Chow,et al.  Trace ratio criterion based generalized discriminative learning for semi-supervised dimensionality reduction , 2012, Pattern Recognit..

[29]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[30]  Tommy W. S. Chow,et al.  M-Isomap: Orthogonal Constrained Marginal Isomap for Nonlinear Dimensionality Reduction , 2013, IEEE Transactions on Cybernetics.

[31]  P. Trudgill,et al.  English Accents and Dialects : An Introduction to Social and Regional Varieties of English in the British Isles , 1996 .

[32]  Susan Fitt,et al.  Synthesis of regional English using a keyword lexicon , 1999, EUROSPEECH.

[33]  Dong Xu,et al.  Trace Ratio vs. Ratio Trace for Dimensionality Reduction , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[35]  Wai Keung Wong,et al.  Sparse Alignment for Robust Tensor Learning , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[36]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[37]  Lean Yu,et al.  A Rough-Set-Refined Text Mining Approach for Crude Oil Market Tendency Forecasting , 2005 .

[38]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.