Can chinese web pages be classified with english data source?

As the World Wide Web in China grows rapidly, mining knowledge in Chinese Web pages becomes more and more important. Mining Web information usually relies on the machine learning techniques which require a large amount of labeled data to train credible models. Although the number of Chinese Web pages increases quite fast, it still lacks Chinese labeled data. However, there are relatively sufficient English labeled Web pages. These labeled data, though in different linguistic representations, share a substantial amount of semantic information with Chinese ones, and can be utilized to help classify Chinese Web pages. In this paper, we propose an information bottleneck based approach to address this cross-language classification problem. Our algorithm first translates all the Chinese Web pages to English. Then, all the Web pages, including Chinese and English ones, are encoded through an information bottleneck which can allow only limited information to pass. Therefore, in order to retain as much useful information as possible, the common part between Chinese and English Web pages is inclined to be encoded to the same code (i.e. class label), which makes the cross-language classification accurate. We evaluated our approach using the Web pages collected from Open Directory Project (ODP). The experimental results show that our method significantly improves several existing supervised and semi-supervised classifiers.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[3]  Noam Slonim,et al.  Maximum Likelihood and the Information Bottleneck , 2002, NIPS.

[4]  Douglas W. Oard,et al.  Cross-language text classification , 2005, SIGIR '05.

[5]  Yiqun Liu,et al.  Automatic search engine performance evaluation with click-through data analysis , 2007, WWW '07.

[6]  Yu-Chieh Wu,et al.  Two-Pass Named Entity Classification for Cross Language Question Answering , 2007, NTCIR.

[7]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[8]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[9]  Inderjit S. Dhillon,et al.  Enhanced word clustering for hierarchical text classification , 2002, KDD.

[10]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[13]  Wen-Lian Hsu,et al.  Question Classification in English-Chinese Cross-Language Question Answering: An Integrated Genetic Algorithm and Machine Learning Approach , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[14]  Marco Maggini,et al.  An EM based training algorithm for cross-language text categorization , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[17]  Núria Bel,et al.  Cross-Lingual Text Categorization , 2003, ECDL.

[18]  Carlo Strapparava,et al.  Cross Language Text Categorization by Acquiring Multilingual Domain Models from Comparable Corpora , 2005, ParallelText@ACL.

[19]  John Shawe-Taylor,et al.  Advanced learning algorithms for cross-language patent retrieval and classification , 2007, Inf. Process. Manag..

[20]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[21]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .

[22]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[23]  GhoshJoydeep,et al.  A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation , 2007 .

[24]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[25]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[26]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[27]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[28]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[29]  Qiang Yang,et al.  Exploring in the weblog space by detecting informative and affective articles , 2007, WWW '07.

[30]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[31]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[32]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.