PROSE AND POETRY CLASSIFICATION AND BOUNDARY DETECTION USING WORD ADJACENCY NETWORK ANALYSIS

Word adjacency networks constructed from written works reflect differences in the structure of prose and poetry. We present a method to disambiguate prose and poetry by analyzing network parameters of word adjacency networks, such as the clustering coefficient, average path length and average degree. We determine the relevant parameters for disambiguation using linear discriminant analysis (LDA) and the effect size criterion. The accuracy of the method is 74.9 ± 2.9% for the training set and 73.7 ± 6.4% for the test set which are greater than the acceptable classifier requirement of 67.3%. This approach is also useful in locating text boundaries within a single article which falls within a window size where the significant change in clustering coefficient is observed. Results indicate that an optimal window size of 75 words can detect the text boundaries.

[1]  Johan F. Hoorn,et al.  Neural network identification of poets using letter sequences , 1999 .

[2]  Azriel Rosenfeld,et al.  Classification of document pages using structure-based features , 2001, International Journal on Document Analysis and Recognition.

[3]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[4]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[5]  Jean-Pierre Eckmann,et al.  Curvature of co-links uncovers hidden thematic layers in the World Wide Web , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[7]  Jean-Pierre Eckmann,et al.  Entropy of dialogues creates coherent structures in e-mail traffic. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[9]  Ludovic Denoyer,et al.  Bayesian network model for semi-structured document classification , 2004, Inf. Process. Manag..

[10]  R. Ferrer i Cancho,et al.  Zipf's law from a communicative phase transition , 2005 .

[11]  K. Kaski,et al.  Intensity and coherence of motifs in weighted complex networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Alan H. Fielding,et al.  Cluster and Classification Techniques for the Biosciences , 2006 .

[13]  Lucas Antiqueira,et al.  COMPLEX NETWORKS ANALYSIS OF MANUAL AND MACHINE TRANSLATIONS , 2008 .

[14]  Christopher Monterola,et al.  PREDICTION OF POTENTIAL HIT SONG AND MUSICAL GENRE USING ARTIFICIAL NEURAL NETWORKS , 2009 .