Decentralized Word2Vec Using Gossip Learning

Advanced NLP models require huge amounts of data from various domains to produce high-quality representations. It is useful then for a few large public and private organizations to join their corpora during training. However, factors such as legislation and user emphasis on data privacy may prevent centralized orchestration and data sharing among these organizations. Therefore, for this specific scenario, we investigate how gossip learning, a massively-parallel, data-private, decentralized protocol, compares to a shared-dataset solution. We find that the application of Word2Vec in a gossip learning framework is viable. Without any tuning, the results are comparable to a traditional centralized setting, with a loss of quality as low as 4.3%. Furthermore, the results are up to 54.8% better than independent local training.

[1]  Tianjian Chen,et al.  Federated Machine Learning: Concept and Applications , 2019 .

[2]  Devavrat Shah Network gossip algorithms , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Lodovico Giaretta,et al.  Gossip Learning: Off the Beaten Path , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[4]  Márk Jelasity,et al.  Decentralized Management of Random Walks over a Mobile Phone Network , 2017, 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  JelasityMárk,et al.  Robust Decentralized Low-Rank Matrix Decomposition , 2016 .

[7]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[8]  Scott Shenker,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[12]  István Hegedüs,et al.  Gossip learning with linear models on fully distributed data , 2011, Concurr. Comput. Pract. Exp..

[13]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .