Syntactic and Semantic Analysis and Visualization of Unstructured English Texts

People have complex thoughts, and they often express their thoughts with complex sentences using natural languages. This complexity may facilitate efficient communications among the audience with the same knowledge base. But on the other hand, for a different or new audience this composition becomes cumbersome to understand and analyze. Analysis of such compositions using syntactic or semantic measures is a challenging job and defines the base step for natural language processing. In this dissertation I explore and propose a number of new techniques to analyze and visualize the syntactic and semantic patterns of unstructured English texts. The syntactic analysis is done through a proposed visualization technique which categorizes and compares different English compositions based on their different reading complexity metrics. For the semantic analysis I use Latent Semantic Analysis (LSA) to analyze the hidden patterns in complex compositions. I have used this technique to analyze comments from a social visualization web site for detecting the irrelevant ones (e.g., spam). The patterns of collaborations are also studied through statistical analysis. Word sense disambiguation is used to figure out the correct sense of a word in a sentence or composition. Using textual similarity measure, based on the different word similarity measures and word sense disambiguation on collaborative text snippets from social collaborative environment, reveals a direction to untie the knots of complex hidden patterns of collaboration. INDEX WORDS: Readability, Complexity depth of field, Grammatical structure, Visualization, Chernoff faces, Web mining, Web information retrieval, Online social visualization, Recommendation, Composition style, Matrix, Latent Semantic Analysis, Online collaborative web site, Social media, Co-occurrence frequency, Pattern searching, Statistical analysis, Categorical data, Semantic similarity, Word sense disambiguation, Natural text, Natural Language, Social network. SYNTACTIC AND SEMANTIC ANALYSIS AND VISUALIZATION OF UNSTRUCTURED ENGLISH TEXTS

[1]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[2]  Daniel A. Keim,et al.  Literature Fingerprinting: A New Method for Visual Literary Analysis , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[3]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[4]  Harold Borko,et al.  Encyclopedia of library and information science , 1970 .

[5]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[6]  N R Smalheiser,et al.  Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. , 1998, Computer methods and programs in biomedicine.

[7]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8]  Martin Kay,et al.  Syntactic Process , 1979, ACL.

[9]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[10]  Terry Winograd,et al.  Procedures As A Representation For Data In A Computer Program For Understanding Natural Language , 1971 .

[11]  David P. Dobkin,et al.  A search engine for 3D models , 2003, TOGS.

[12]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[13]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[14]  Neil R. Smalheiser,et al.  Implicit Text Linkages between Medline Records: Using Arrowsmith as an Aid to Scientific Discovery , 1999, Libr. Trends.

[15]  Paul Lamere,et al.  Generating transparent, steerable recommendations from textual descriptions of items , 2009, RecSys '09.

[16]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[17]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[18]  Richard Edward Cullingford,et al.  Script application: computer understanding of newspaper stories. , 1977 .

[19]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[20]  Ying Zhu,et al.  Visualizing multiple text readability indexes , 2010, 2010 International Conference on Education and Management Technology.

[21]  George R. Klare,et al.  The measurement of readability , 1963 .

[22]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[23]  Margaret Masterman,et al.  The thesaurus in syntax and semantics , 1957, Mech. Transl. Comput. Linguistics.

[24]  B. M. Gupta,et al.  Collaboration profile of theoretical population genetics speciality , 1997, Scientometrics.

[25]  Ronald N. Kostoff,et al.  Citation mining: Integrating text mining and bibliometrics for research user profiling , 2001, J. Assoc. Inf. Sci. Technol..

[26]  Albert L. Ingram,et al.  Online Collaboration: Making It Work. , 2002 .

[27]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[28]  Richard Alterman,et al.  Visualizing student activity in a wiki-mediated co-blogging exercise , 2009, CHI Extended Abstracts.

[29]  Darrell Laham,et al.  From paragraph to graph: Latent semantic analysis for information visualization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[31]  T. Kwon Adapting the Lesk Algorithm for Word Sense Disambiguation to WordNet by Satanjeev Banerjee , 2002 .

[32]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[33]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[34]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[35]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[36]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[37]  Olle Persson,et al.  Studying research collaboration using co-authorships , 1996, Scientometrics.

[38]  Lauren B. Doyle,et al.  Semantic Road Maps for Literature Searchers , 1961, JACM.

[39]  Jack Gilliland,et al.  The concept of readability , 1968 .

[40]  Herman Chernoff,et al.  The Use of Faces to Represent Points in k- Dimensional Space Graphically , 1973 .

[41]  Neil R. Smalheiser,et al.  Artificial Intelligence An interactive system for finding complementary literatures : a stimulus to scientific discovery , 1995 .

[42]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[43]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[44]  K. Subramanyam,et al.  Bibliometric studies of research collaboration: A review , 1983 .

[45]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[46]  Stephen M. Lawani,et al.  Quality, Collaboration and citations in cancer Research: a bibliometric Study , 1980 .

[47]  Ying Zhu,et al.  Mining collaboration through textual semantic interpretation , 2011, 2011 11th International Conference on Hybrid Intelligent Systems (HIS).

[48]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[49]  Leo A. Goodman,et al.  A General Model for the Analysis of Surveys , 1972, American Journal of Sociology.

[50]  R. Gunning The Technique of Clear Writing. , 1968 .

[51]  Maxine Eskénazi,et al.  An Application of Latent Semantic Analysis to Word Sense Discrimination for Words with Related and Unrelated Meanings , 2009, BEA@NAACL.

[52]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[53]  Jean Tague-Sutcliffe,et al.  Collaborative coefficient: A single measure of the degree of collaboration in research , 1988, Scientometrics.

[54]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[55]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[56]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[57]  Loriene Roy,et al.  Content-based book recommending using learning for text categorization , 1999, DL '00.

[58]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[59]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[60]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[61]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[62]  Martin Wattenberg,et al.  ManyEyes: a Site for Visualization at Internet Scale , 2007, IEEE Transactions on Visualization and Computer Graphics.

[63]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[64]  Roger C. Schank,et al.  MARGIE: Memory Analysis Response Generation, and Inference on English , 1973, IJCAI.

[65]  조석주,et al.  교과서 문장의 Readability , 1985 .

[66]  M. Sheelagh T. Carpendale,et al.  DocuBurst: Visualizing Document Content using Language Structure , 2009, Comput. Graph. Forum.

[67]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[68]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[69]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[70]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[71]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[72]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[73]  Sergei Egorov,et al.  MedScan, a natural language processing engine for MEDLINE abstracts , 2003, Bioinform..

[74]  A. Purandare,et al.  Semantic Relatedness Applied to All Words Sense Disambiguation Contents 1 Introduction 2 2 Measuring Semantic Relatedness 5 List of Figures List of Tables List of Algorithms , 2005 .

[75]  Pasquale Lops,et al.  Knowledge infusion into content-based recommender systems , 2009, RecSys '09.

[76]  Richard M. Schwartz,et al.  A Fully Statistical Approach to Natural Language Interfaces , 1996, ACL.

[77]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[78]  Michael D. Gordon,et al.  Literature-Based Discovery by Lexical Statistics , 1999, J. Am. Soc. Inf. Sci..

[79]  Ying Zhu,et al.  Recommendation by composition style , 2010, 2010 10th International Conference on Intelligent Systems Design and Applications.

[80]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[81]  E A Smith,et al.  Automated readability index. , 1967, AMRL-TR. Aerospace Medical Research Laboratories.

[82]  R. Khan,et al.  Sequential Tests of Statistical Hypotheses. , 1972 .

[83]  Yorick Wilks,et al.  A Preferential, Pattern-Seeking, Semantics for Natural Language Inference , 1975, Artif. Intell..

[84]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[85]  William H. DuBay Smart Language: Readers, Readability, and the Grading of Text , 2007 .

[86]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[87]  Jochen Dörre,et al.  Text mining: finding nuggets in mountains of textual data , 1999, KDD '99.

[88]  George Karypis,et al.  Item-based top-N recommendation algorithms , 2004, TOIS.

[89]  Ying Zhu,et al.  Visualizing text readability , 2010, 2010 6th International Conference on Advanced Information Management and Service (IMS).

[91]  Steven L. Lytinen Dynamically Combining Syntax and Semantics in Natural Language Processing , 1986, AAAI.

[92]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[93]  William H. DuBay The Principles of Readability. , 2004 .

[94]  S. Logeswari,et al.  A Survey on Text Mining in Clustering , 2011 .

[95]  Dean W. Lytle,et al.  A figure of merit technique for the resolution of non-grammatical ambiguity , 1965, Mech. Transl. Comput. Linguistics.

[96]  Emden R. Gansner,et al.  Drawing graphs with dot , 2006 .

[97]  Zheng Chen,et al.  Latent semantic analysis for multiple-type interrelated data objects , 2006, SIGIR.

[98]  Emden R. Gansner,et al.  Graphviz - Open Source Graph Drawing Tools , 2001, GD.

[99]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[100]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[101]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[102]  Yiannis Kompatsiaris,et al.  A semantic framework for personalized ad recommendation based on advanced textual analysis , 2009, RecSys '09.

[103]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[104]  Yoav Shoham,et al.  Fab: content-based, collaborative recommendation , 1997, CACM.

[105]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[106]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[107]  Zhiwen Yu,et al.  Ontology-Based Semantic Recommendation for Context-Aware E-Learning , 2007, UIC.

[108]  Tom Landauer,et al.  Latent semantic analysis: theory, method and application , 2002, CSCL.

[109]  Joshua Alspector,et al.  Comparing feature-based and clique-based user models for movie selection , 1998, DL '98.

[110]  J. Barwise,et al.  Generalized quantifiers and natural language , 1981 .

[111]  C. J. van Rijsbergen,et al.  A Non-Classical Logic for Information Retrieval , 1997, Comput. J..

[112]  Thomas Rist,et al.  From adaptive hypertext to personalized web companions , 2002, CACM.

[113]  Chaomei Chen,et al.  Storylines: Visual exploration and analysis in latent semantic spaces , 2007, Comput. Graph..

[114]  R N Kostoff,et al.  Extracting information from the literature by text mining. , 2001, Analytical chemistry.

[115]  L. A. Goodman The Multivariate Analysis of Qualitative Data: Interactions among Multiple Classifications , 1970 .