Processing highly variant language using incremental model selection

This dissertation demonstrates a framework for incremental model selection and processing of highly variant speech transcripts and user-generated text. The system reduces natural language processing (NLP) ambiguity by segmenting text by domain, allowing for domain-specific downstream processes to analyze each segment independently. A tokenized text input stream is received by the system. At every word, an Indicator Function calculates a quantitative feature signal we call an Indicator Value Signal, that runs in parallel to the input stream. This feature signal is monitored for domain changes by an event controller, which segments the stream into feature chunks. The event controller can activate slowly over large spans of text, or rapidly and intrasententially. As the event controller indicates each domain change with an event signal, pipeline processes assigned to specific indicator function values are executed to process the segment, and add additional feature signals to the feature signal stack. At the end of the pipeline, feature signals are unified to produce a single annotated output stream. To exemplify the framework, this dissertation makes three additional contributions. The first is a novel short-string language identification system that calculates our Indicator Value Signal. The second is a machine transliteration system to convert the Arabizi chat alphabet into Arabic script. The third is a modular part of speech tagger for multilingual code-mixing. The short-string language identification system extracts an n-gram, and selects the closest language out of 373 reference languages by using a Support Vector Machine (SVM) classifier trained on a matrix of language model measurements. This classifier learns patterns of similarity and divergence of a language's tokens across all reference languages, leading to high accuracy on in-domain n-grams from a legal corpus as well as out-of-domain tokens from an English-Egyptian Arabic code-mixing microblog corpus. The machine transliteration system converts Arabizi, a Latinized Arabic chat alphabet into Arabic script, in order to utilize existing NLP tools on Arabic chat text. A parallel, word-aligned corpus of the chat alphabet was collected from a dozen Arabic speakers. From the corpus we induced a probabilistic mapping of cross-dialect Arabizi characters to Arabic script and used this to train a highly accurate transducer. The multilingual part of speech tagger demonstrates the modularity of our framework. We find that segmenting language before tagging, and then applying single-language homogeneous language models, is competitive to multilingual heterogeneous tagging models. We compare the two approaches on a speech transcript of English-Spanish code-mixing. In addition to language identification, we consider a range of alternative indicator functions, such as genre identification, entropy, and gender identification, which could add a language adaptation ability on top of existing NLP systems and provide a boost in accuracy and performance on variational processing. To summarize, this dissertation provides an architecture for NLP that allows for better handling of complicated language variation. To demonstrate the model, we introduce a short-string language identification system with state of the art accuracy, the first research on machine transliteration for a chat alphabet, and a modular part of speech tagger for multilingual code-mixing.

[1]  Mike Rosner,et al.  A tagging algorithm for mixed language identification in a noisy domain , 2007, INTERSPEECH.

[2]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[3]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[4]  Glenn Fung,et al.  The disputed federalist papers: SVM feature selection via concave minimization , 2003, TAPIA '03.

[5]  Mathias Schulze,et al.  Towards Authentic Tasks and Experiences: the Example of Parser-based Call , 2022 .

[6]  Yang Liu,et al.  Learning to Predict Code-Switching Points , 2008, EMNLP.

[7]  Rajarathnam Chandramouli,et al.  Gender identification from E-mails , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[8]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[9]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[10]  David Palfreyman,et al.  "A Funky Language for Teenzz to Use": Representing Gulf Arabic in Instant Messaging , 2006, J. Comput. Mediat. Commun..

[11]  Dmitry V. Khmelev,et al.  Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..

[12]  Kim Luyckx,et al.  Scalability Issues in Authorship Attribution , 2011 .

[13]  Pascal Denis,et al.  Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort , 2009, PACLIC.

[14]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[15]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  José Gabriel Pereira Lopes,et al.  Longest Sorted Sequence Algorithm for Parallel Text Alignment , 2005, EUROCAST.

[18]  François Yvon,et al.  Detecting Fake Content with Relative Entropy Scoring , 2008, PAN.

[19]  A Concurrent Validity Study of the Raygor Readability Estimate. , 1979 .

[20]  Chris Taylor,et al.  Error Correction for Arabic Dictionary Lookup , 2010, LREC.

[21]  Mikko Kurimo,et al.  Morfessor and variKN machine learning tools for speech and language technology , 2007, INTERSPEECH.

[22]  Mark Warschauer,et al.  Language Choice Online: Globalization and Identity in Egypt , 2006, J. Comput. Mediat. Commun..

[23]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[24]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[25]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[26]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[27]  Mohammad Ali Yaghan,et al.  Arabizi: A Contemporary Style of Arabic Slang , 2008, Design Issues.

[28]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[29]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[30]  Toshikazu Ikuta,et al.  On Statistical Parameter Setting , 2004 .

[31]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[32]  V. Melissa Holland,et al.  Parsers in Tutors: What Are They Good For?. , 2013 .

[33]  Neri Merhav,et al.  A measure of relative entropy between individual sequences with application to universal classification , 1993, IEEE Trans. Inf. Theory.

[34]  Ronald Wardhaugh An introduction to sociolinguistics , 1988 .

[35]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[36]  Jeffrey Heath,et al.  Jewish and Muslim Dialects of Moroccan Arabic , 2002 .

[37]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[38]  Matthias Scheutz,et al.  Robust spoken instruction understanding for HRI , 2010, HRI 2010.

[39]  Christer Samuelsson,et al.  Grammar Specialization Through Entropy Thresholds , 1994, ACL.

[40]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[41]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[42]  Mor Naaman,et al.  Is it really about me?: message content in social awareness streams , 2010, CSCW '10.

[43]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Akira Kurematsu,et al.  Language model selection based on the analysis of Japanese spontaneous speech on travel arrangement task , 1999, EUROSPEECH.

[45]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[46]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[47]  Ioannis Pitas,et al.  Language identification in web documents using discrete HMMs , 2004, Pattern Recognit..

[48]  Tommi Vatanen,et al.  Language Identification of Short Text Segments with N-gram Models , 2010, LREC.

[49]  Toshikazu Ikuta,et al.  On unsupervised grammar induction from untagged corpora , 2006 .

[50]  Paul Rodrigues,et al.  Learning Arabic Morphology With Information Theory , 2005 .

[51]  Suresh Venkatasubramanian,et al.  Streaming for large scale NLP: Language Modeling , 2009, NAACL.

[52]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[53]  Geoffrey Sampson,et al.  A proposal for improving the measurement of parse accuracy , 2000 .

[54]  Paul McNamee,et al.  Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[55]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[56]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[57]  Sandra Kübler,et al.  POS Tagging for German: how important is the Right Context? , 2008, LREC.

[58]  Martin Chodorow,et al.  Automated Essay Scoring for Nonnative English Speakers , 1999 .

[59]  Mahmoud A. Al-Khatib,et al.  Language Choice in Mobile Text Messages among Jordanian University Students , 2008 .

[60]  Véronique Hoste,et al.  Towards an Improved Methodology for Automated Readability Prediction , 2010, LREC.

[61]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[62]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[63]  Johnnie F. Caver Novel Topic Impact on Authorship Attribution , 2009 .

[64]  Christiana Themistocleous Written Cypriot Greek in online chat: Usage and attitudes. , 2007 .

[65]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[66]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[67]  Sandra Kuebler,et al.  A statistical method for syntactic dialectometry , 2010 .

[68]  Grzegorz Kondrak,et al.  Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion , 2008, ACL.

[69]  Mari Ostendorf,et al.  Classifying Factored Genres with Part-of-Speech Histograms , 2009, HLT-NAACL.

[70]  J. Chaker,et al.  Genre Categorization of Web Pages , 2007 .

[71]  V. Loreto,et al.  Data compression and learning in time sequences analysis , 2002, cond-mat/0207321.

[72]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[73]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[74]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[75]  Nicola Cancedda,et al.  Corpus-Based Grammar Specialization , 2000, CoNLL/LLL.

[76]  Richard Dazeley,et al.  Authorship Attribution for Twitter in 140 Characters or Less , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[77]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[78]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[79]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[80]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[81]  Manny Rayner,et al.  Fast Parsing Using Pruning and Grammar Specialization , 1996, ACL.

[82]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[83]  Boris Katz,et al.  A Comparative Study of Language Models for Book and Author Recognition , 2005, IJCNLP.

[84]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[85]  Matthias Scheutz,et al.  Adding Context Information to Part Of Speech Tagging for Dialogues , 2010 .

[86]  Hideki Kashioka,et al.  Trigger-Pair Predictors in Parsing and Tagging , 1998, COLING-ACL.

[87]  J. M. Prager Linguini: language identification for multilingual documents , 1999 .

[88]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[89]  Paul Rodrigues,et al.  Learning Arabic morphology using statistical constraint-satisfaction models , 2007 .

[90]  Khalil Sima'an,et al.  Parsing with subdomain instance weighting from raw corpora , 2008, INTERSPEECH.

[91]  Yaser Al-Onaizan,et al.  Machine Transliteration of Names in Arabic Texts , 2002, SEMITIC@ACL.

[92]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[93]  Kristy Hollingshead,et al.  Formalizing the Use and Characteristics of Constraints in Pipeline Systems , 2010 .