论文信息 - Syntactic and Semantic Analysis and Visualization of Unstructured English Texts

Syntactic and Semantic Analysis and Visualization of Unstructured English Texts

People have complex thoughts, and they often express their thoughts with complex sentences using natural languages. This complexity may facilitate efficient communications among the audience with the same knowledge base. But on the other hand, for a different or new audience this composition becomes cumbersome to understand and analyze. Analysis of such compositions using syntactic or semantic measures is a challenging job and defines the base step for natural language processing. In this dissertation I explore and propose a number of new techniques to analyze and visualize the syntactic and semantic patterns of unstructured English texts. The syntactic analysis is done through a proposed visualization technique which categorizes and compares different English compositions based on their different reading complexity metrics. For the semantic analysis I use Latent Semantic Analysis (LSA) to analyze the hidden patterns in complex compositions. I have used this technique to analyze comments from a social visualization web site for detecting the irrelevant ones (e.g., spam). The patterns of collaborations are also studied through statistical analysis. Word sense disambiguation is used to figure out the correct sense of a word in a sentence or composition. Using textual similarity measure, based on the different word similarity measures and word sense disambiguation on collaborative text snippets from social collaborative environment, reveals a direction to untie the knots of complex hidden patterns of collaboration. INDEX WORDS: Readability, Complexity depth of field, Grammatical structure, Visualization, Chernoff faces, Web mining, Web information retrieval, Online social visualization, Recommendation, Composition style, Matrix, Latent Semantic Analysis, Online collaborative web site, Social media, Co-occurrence frequency, Pattern searching, Statistical analysis, Categorical data, Semantic similarity, Word sense disambiguation, Natural text, Natural Language, Social network. SYNTACTIC AND SEMANTIC ANALYSIS AND VISUALIZATION OF UNSTRUCTURED ENGLISH TEXTS

Saurav Karmakar | S. Karmakar

[1] Danielle S. McNamara,et al. Handbook of latent semantic analysis , 2007 .

[2] Daniel A. Keim,et al. Literature Fingerprinting: A New Method for Visual Literary Analysis , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[3] Douglas B. Terry,et al. Using collaborative filtering to weave an information tapestry , 1992, CACM.

[4] Harold Borko,et al. Encyclopedia of library and information science , 1970 .

[5] Ted Pedersen,et al. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[6] N R Smalheiser,et al. Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. , 1998, Computer methods and programs in biomedicine.

[7] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8] Martin Kay,et al. Syntactic Process , 1979, ACL.

[9] Andreas Hotho,et al. A Brief Survey of Text Mining , 2005, LDV Forum.

[10] Terry Winograd,et al. Procedures As A Representation For Data In A Computer Program For Understanding Natural Language , 1971 .

[11] David P. Dobkin,et al. A search engine for 3D models , 2003, TOGS.

[12] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[13] George A. Miller,et al. Introduction to WordNet: An On-line Lexical Database , 1990 .

[14] Neil R. Smalheiser,et al. Implicit Text Linkages between Medline Records: Using Arrowsmith as an Aid to Scientific Discovery , 1999, Libr. Trends.

[15] Paul Lamere,et al. Generating transparent, steerable recommendations from textual descriptions of items , 2009, RecSys '09.

[16] Martin Chodorow,et al. Combining local context and wordnet similarity for word sense identification , 1998 .

[17] R. P. Fishburne,et al. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[18] Richard Edward Cullingford,et al. Script application: computer understanding of newspaper stories. , 1977 .

[19] Eduard H. Hovy,et al. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[20] Ying Zhu,et al. Visualizing multiple text readability indexes , 2010, 2010 International Conference on Education and Management Technology.

[21] George R. Klare,et al. The measurement of readability , 1963 .

[22] J. Chall,et al. Readability revisited : the new Dale-Chall readability formula , 1995 .

[23] Margaret Masterman,et al. The thesaurus in syntax and semantics , 1957, Mech. Transl. Comput. Linguistics.

[24] B. M. Gupta,et al. Collaboration profile of theoretical population genetics speciality , 1997, Scientometrics.

[25] Ronald N. Kostoff,et al. Citation mining: Integrating text mining and bibliometrics for research user profiling , 2001, J. Assoc. Inf. Sci. Technol..

[26] Albert L. Ingram,et al. Online Collaboration: Making It Work. , 2002 .

[27] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[28] Richard Alterman,et al. Visualizing student activity in a wiki-mediated co-blogging exercise , 2009, CHI Extended Abstracts.

[29] Darrell Laham,et al. From paragraph to graph: Latent semantic analysis for information visualization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[30] Karl Pearson F.R.S.. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[31] T. Kwon. Adapting the Lesk Algorithm for Word Sense Disambiguation to WordNet by Satanjeev Banerjee , 2002 .

[32] Dan Klein,et al. Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[33] A. M. Turing,et al. Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[34] John R. Anderson,et al. MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[35] Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval , 1972 .

[36] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[37] Olle Persson,et al. Studying research collaboration using co-authorships , 1996, Scientometrics.

[38] Lauren B. Doyle,et al. Semantic Road Maps for Literature Searchers , 1961, JACM.

[39] Jack Gilliland,et al. The concept of readability , 1968 .

[40] Herman Chernoff,et al. The Use of Faces to Represent Points in k- Dimensional Space Graphically , 1973 .

[41] Neil R. Smalheiser,et al. Artificial Intelligence An interactive system for finding complementary literatures : a stimulus to scientific discovery , 1995 .

[42] Peter W. Foltz,et al. An introduction to latent semantic analysis , 1998 .

[43] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[44] K. Subramanyam,et al. Bibliometric studies of research collaboration: A review , 1983 .

[45] G. Harry McLaughlin,et al. SMOG Grading - A New Readability Formula. , 1969 .

[46] Stephen M. Lawani,et al. Quality, Collaboration and citations in cancer Research: a bibliometric Study , 1980 .

[47] Ying Zhu,et al. Mining collaboration through textual semantic interpretation , 2011, 2011 11th International Conference on Hybrid Intelligent Systems (HIS).

[48] Hans Peter Luhn,et al. The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[49] Leo A. Goodman,et al. A General Model for the Analysis of Surveys , 1972, American Journal of Sociology.

[50] R. Gunning. The Technique of Clear Writing. , 1968 .

[51] Maxine Eskénazi,et al. An Application of Latent Semantic Analysis to Word Sense Discrimination for Words with Related and Unrelated Meanings , 2009, BEA@NAACL.

[52] Rada Mihalcea,et al. Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[53] Jean Tague-Sutcliffe,et al. Collaborative coefficient: A single measure of the degree of collaboration in research , 1988, Scientometrics.

[54] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[55] Michael E. Lesk,et al. Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[56] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[57] Loriene Roy,et al. Content-based book recommending using learning for text categorization , 1999, DL '00.

[58] Dekang Lin,et al. An Information-Theoretic Definition of Similarity , 1998, ICML.

[59] Martha Palmer,et al. Verb Semantics and Lexical Selection , 1994, ACL.

[60] Thomas Hofmann,et al. Probabilistic Latent Semantic Analysis , 1999, UAI.

[61] SaltonGerard,et al. Term-weighting approaches in automatic text retrieval , 1988 .

[62] Martin Wattenberg,et al. ManyEyes: a Site for Visualization at Internet Scale , 2007, IEEE Transactions on Visualization and Computer Graphics.

[63] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[64] Roger C. Schank,et al. MARGIE: Memory Analysis Response Generation, and Inference on English , 1973, IJCAI.

[65] 조석주,et al. 교과서 문장의 Readability , 1985 .

[66] M. Sheelagh T. Carpendale,et al. DocuBurst: Visualizing Document Content using Language Structure , 2009, Comput. Graph. Forum.

[67] M. Coleman,et al. A computer readability formula designed for machine scoring. , 1975 .

[68] Michael W. Berry,et al. Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[69] Carlo Strapparava,et al. Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[70] David W. Conrath,et al. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[71] Gerard Salton,et al. Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[72] Elio Masciari,et al. Detecting Structural Similarities between XML Documents , 2002, WebDB.

[73] Sergei Egorov,et al. MedScan, a natural language processing engine for MEDLINE abstracts , 2003, Bioinform..

[74] A. Purandare,et al. Semantic Relatedness Applied to All Words Sense Disambiguation Contents 1 Introduction 2 2 Measuring Semantic Relatedness 5 List of Figures List of Tables List of Algorithms , 2005 .

[75] Pasquale Lops,et al. Knowledge infusion into content-based recommender systems , 2009, RecSys '09.