Different Approaches to Bilingual Text Classification Based on Grammatical Inference Techniques

Bilingual documentation has become a common phenomenon in many official institutions and private companies. In this scenario, the categorization of bilingual text is a useful tool, that can be also applied in the machine translation field. To tackle this classification task, different approaches will be proposed. On the one hand, two finite-state transducer algorithms from the grammatical inference domain will be discussed. On the other hand, the well-known naive Bayes approximation will be presented along with a possible modelization based on n-gram language models. Experiments carried out on a bilingual corpus have demonstrated the adequacy of these methods and the relevance of a second information source in text classification, as supported by classification error rates. Relative reduction of 29% with respect to the best previous results on the monolingual version of the same task has been obtained.

[1]  Yaser Al-Onaizan,et al.  Translation with Finite-State Devices , 1998, AMTA.

[2]  David Llorens Piñana Suavizado de autómatas y traductores finitos estocásticos , 2000 .

[3]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[4]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[5]  Enrique Vidal,et al.  Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[7]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[8]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[9]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[10]  José Oncina,et al.  Using domain information during the learning of a subsequential transducer , 1996, ICGI.

[11]  Francisco Casacuberta,et al.  Some Statistical-Estimation Methods for Stochastic Finite-State Transducers , 2004, Machine Learning.

[12]  Alfons Juan-Císcar,et al.  On the use of Bernoulli mixture models for text classification , 2001, Pattern Recognit..

[13]  Francisco Casacuberta,et al.  The EuTrans Spoken Language Translation System , 2004, Machine Translation.

[14]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[15]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[16]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[17]  Eduard Hovy,et al.  Machine Translation and the Information Soup , 2002, Lecture Notes in Computer Science.

[18]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[19]  Enrique Vidal,et al.  Finite-state speech-to-speech translation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.