Estimating Language Models Using Hadoop and Hbase

This thesis presents the work of building a large scale distributed ngram language model using a MapReduce platform named Hadoop and a distributed database called Hbase. We propose a method focusing on the time cost and storage size of the model, exploring different Hbase table structures and compression approaches. The method is applied to build up training and testing processes using Hadoop MapReduce framework and Hbase. The experiments evaluate and compare different table structures on training 100 million words for unigram, bigram and trigram models, and the results suggest a table based on half ngram structure is a good choice for distributed language model. The results of this work can be applied and further developed in machine translation and other large scale distributed language processing areas.

[1]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[2]  Ying Zhang,et al.  Distributed Language Modeling for N-best List Re-ranking , 2006, EMNLP.

[3]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[4]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[6]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Ahmad Emami,et al.  Large-Scale Distributed Language Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.