Two-Fold Filtering for Chinese Subcategorization Acquisition with Diathesis Alternations Used as Heuristic Information

Automatically acquired lexicons with subcategorization information have been shown to be accurate and useful for some purposes, but their accuracy still shows room for improvement and their usefulness in many applications remains to be investigated. This paper proposes a two-fold filtering method, which in experiments improved the performance of a Chinese acquisition system remarkably, with an increased precision rate of 76.94% and a recall rate of 83.83%, making the acquired lexicon much more practical for further manual proofreading and other NLP uses. And as far as we know, at the present time, these figures represent the best overall performance achieved in Chinese subcategorization acquisition and in similar researches focusing on other languages.

[1]  Hema A. Murthy,et al.  Language identification using parallel syllable-like unit recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[3]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Taro Watanabe,et al.  Reordering Constraints for Phrase-Based Statistical Machine Translation , 2004, COLING.

[6]  Miles Osborne,et al.  Shallow Parsing as Part-of-Speech Tagging , 2000, CoNLL/LLL.

[7]  Yonghong Yan,et al.  An approach to automatic language identification based on language-dependent phone recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Bin Ma,et al.  Using local & global phonotactic features in Chinese dialect identification , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  C.-H. Lee,et al.  From knowledge-ignorant to knowledge-rich modeling : a new speech research parading for next generation automatic speech recognition , 2004 .

[10]  Etienne Barnard,et al.  Analysis of phoneme-based features for language identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Etienne Barnard,et al.  Language identification of six languages based on a common set of broad phonemes , 1994, ICSLP.

[12]  Chris Brew,et al.  Inducing German Semantic Verb Classes from Purely Syntactic Subcategorisation Information , 2002, ACL.

[13]  Ralf D. Brown Automated Dictionary Extraction for “Knowledge-Free” Example-Based Translation , 2006 .

[14]  Rob Koeling Chunking with Maximum Entropy Models , 2000, CoNLL/LLL.

[15]  Jeff A. Bilmes,et al.  Mixed-memory Markov models for Automatic Language Identification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[17]  D. Gentner,et al.  Commonalities and differences in similarity comparisons , 1996, Memory & cognition.

[18]  Bin Ma,et al.  A Phonotactic Language Model for Spoken Language Identification , 2005, ACL.

[19]  Tiejun Zhao,et al.  FML-Based SCF Predefinition Learning for Chinese Verbs , 2004, IJCNLP.

[20]  Kenneth Ward Church,et al.  K-vec: A New Approach for Aligning Parallel Texts , 1994, COLING.

[21]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[22]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[23]  Kenneth Ward Church,et al.  Aligning Parallel Texts : Do Methods Developed for English-French Generalize to Asian Languages? , 1993 .

[24]  Worldbet,et al.  ASCII Phonetic Symbols for the World s Languages Worldbet , 1994 .

[25]  Jean-Luc Gauvain,et al.  Language identification incorporating lexical information , 1998, ICSLP.

[26]  Ralph Weischedel,et al.  A statistical parser for Chinese , 2002 .

[27]  Grzegorz Chrupala,et al.  Acquiring Verb Subcategorization from Spanish Corpora , 2003 .

[28]  Bin Ma,et al.  Multilingual speech recognition with language identification , 2002, INTERSPEECH.

[29]  김두식,et al.  English Verb Classes and Alternations , 2006 .

[30]  I. McLean Example-based machine translation using connectionist matching , 1992, TMI.

[31]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[32]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[33]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[34]  Diana McCarthy,et al.  Lexical acquisition at the syntax-semantics interface : diathesis alternations, subcategorization frames and selectional preferences , 2001 .

[35]  Keh-Jiann Chen,et al.  A Study on Word Similarity using Context Vector Models , 2002, Int. J. Comput. Linguistics Chin. Lang. Process..

[36]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[37]  Zhang Yu Automatic Identification of Chinese Base Phrases , 2002 .

[38]  Bin Ma,et al.  A text categorization approach to automatic language identification , 2005, INTERSPEECH.

[39]  Lina Zhou,et al.  Similarity Comparison between Chinese Sentences , 1997, ROCLING/IJCLCLP.

[40]  Daniel Marcu,et al.  Machine translation in the year 2004 , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[41]  Antal van den Bosch,et al.  Shallow Parsing on the Basis of Words Only: A Case Study , 2002, ACL.

[42]  Chunyu Kit,et al.  Learning Case-based Knowledge for Disambiguating Chinese Word Segmentation: A Preliminary Study , 2002, SIGHAN@COLING.

[43]  Ronald A. Cole,et al.  Perceptual benchmarks for automatic language identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Marine Carpuat,et al.  Word Sense Disambiguation vs. Statistical Machine Translation , 2005, ACL.

[45]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[46]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[47]  M. Sugiyama,et al.  Automatic language recognition using acoustic features , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[48]  Yu Shi,et al.  The Basic Processing of Contemporary Chinese Corpus at Peking University SPECIFICATION , 2002 .

[49]  V. Ramasubramanian,et al.  Language identification using parallel sub-word recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[50]  Katrin Kirchhoff,et al.  Multi-stream language identification using data-driven dependency selection , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[51]  Li Su Chunk Parsing with Maximum Entropy Principle , 2003 .

[52]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[53]  Daniel Marcu,et al.  Towards a Unified Approach to Memory- and Statistical-Based Machine Translation , 2001, ACL.

[54]  Jun'ichi Tsujii Future Directions of Machine Translation , 1986, COLING.

[55]  Mosleh H. Al-Adhaileh Example-Based Machine Translation Based on the Synchronous SSTC Annotation Schema , 1999 .

[56]  Xiaoqiang Luo A Maximum Entropy Chinese Character-Based Parser , 2003, EMNLP.

[57]  Douglas A. Reynolds,et al.  Language identification using Gaussian mixture model tokenization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[58]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[59]  Yuji Matsumoto,et al.  Use of Support Vector Learning for Chunk Identification , 2000, CoNLL/LLL.

[60]  Victor Zue,et al.  Recent improvements in an approach to segment-based automatic language identification , 1994, ICSLP.

[61]  Byoung-Tak Zhang,et al.  Text Chunking by Combining Hand-Crafted Rules and Memory-Based Learning , 2003, ACL.

[62]  Gerhard Rigoll,et al.  A Novel Feature Combination Approach for Spoken Document Classification with Support Vector Machines , 2003 .

[63]  Anna Korhonen,et al.  Automatic Extraction of Subcategorization Frames from Corpora -improving Filtering with Diathesis Alternations , 1998 .

[64]  Dan Roth,et al.  Exploring evidence for shallow parsing , 2001, CoNLL.

[65]  David Chiang,et al.  Two Statistical Parsing Models Applied to the Chinese Treebank , 2000, ACL 2000.

[66]  Satoshi Sato Example-based machine translation , 1992 .

[67]  Chin-Hui Lee,et al.  Discriminative training of natural language call routers , 2003, IEEE Trans. Speech Audio Process..

[68]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[69]  Pascale Fung,et al.  A maximum-entropy chinese parser augmented by transformation-based learning , 2004, TALIP.

[70]  Yuval Krymolowski,et al.  Clustering Polysemic Subcategorization Frame Distributions Semantically , 2003, ACL.

[71]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[72]  Jean-Luc Gauvain,et al.  Language identification with language-independent acoustic models , 1997, EUROSPEECH.

[73]  Barbara B. Levin,et al.  English verb classes and alternations , 1993 .

[74]  Jianfeng Gao,et al.  Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[75]  Alexandra Kinyon A Language-Independent Shallow-Parser Compiler , 2001, ACL.

[76]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[77]  Lori Lamel,et al.  Phonetic knowledge, phonotactics and perceptual validation for automatic language identification , 2003 .

[78]  H. Pan Example-Based Machine Translation : A New Paradigm , 2002 .

[79]  Michael Carl Inducing Translation Templates for Example-Based Machine Translation , 1999 .

[80]  Bin Ma,et al.  An acoustic segment modeling approach to automatic language identification , 2005, INTERSPEECH.

[81]  Hsin-Hsi Chen,et al.  Machine Translation: An Integrated Approach , 1995 .

[82]  Jianfeng Gao,et al.  Chinese Chunking with Another Type of Spec , 2004, SIGHAN@ACL.

[83]  William M. Campbell,et al.  Acoustic, phonetic, and discriminative approaches to automatic language identification , 2003, INTERSPEECH.

[84]  Changning Huang,et al.  A Unified Statistical Model for the Identification of English BaseNP , 2000, ACL.

[85]  Anthony Kroch,et al.  The Bracketing Guidelines for the Penn Chinese Treebank (3.0) , 2000 .

[86]  Chin-Hui Lee,et al.  A MFoM learning approach to robust multiclass multi-label text categorization , 2004, ICML.

[87]  Tiejun Zhao,et al.  Subcategorization Acquisition and Evaluation for Chinese Verbs , 2004, COLING.

[88]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[89]  Qin Lu,et al.  Similarity Based Chinese Synonym Collocation Extraction , 2005, Int. J. Comput. Linguistics Chin. Lang. Process..

[90]  Pablo Gamallo,et al.  Using Co-Composition for Acquiring Syntactic and Semantic Subcategorisation , 2002, ACL 2002.

[91]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[92]  Fred Popowich,et al.  What is example-based machine translation? , 2001, MTSUMMIT.

[93]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[94]  Anoop Sarkar,et al.  Automatic Extraction of Subcategorization Frames for Czech , 2000, COLING.