Estimation of cross-lingual news similarities using text-mining methods

In this research, two estimation algorithms for extracting cross-lingual news pairs based on machine learning from financial news articles have been proposed. Every second, innumerable text data, including all kinds news, reports, messages, reviews, comments, and tweets are generated on the Internet, and these are written not only in English but also in other languages such as Chinese, Japanese, French, etc. By taking advantage of multi-lingual text resources provided by Thomson Reuters News, we developed two estimation algorithms for extracting cross-lingual news pairs from multilingual text resources. In our first method, we propose a novel structure that uses the word information and the machine learning method effectively in this task. Simultaneously, we developed a bidirectional Long Short-Term Memory (LSTM) based method to calculate cross-lingual semantic text similarity for long text and short text, respectively. Thus, when an important news article is published, users can read similar news articles that are written in their native language using our method.

[1]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[2]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[3]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[4]  Rohit Gupta,et al.  MiniExperts: An SVM Approach for Measuring Semantic Textual Similarity , 2015, *SEMEVAL.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[7]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[10]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[11]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[12]  Christopher D. Manning,et al.  Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[13]  Marko Grobelnik,et al.  News Across Languages - Cross-Lingual Document Similarity and Event Tracking , 2015, J. Artif. Intell. Res..

[14]  Dekai Wu,et al.  XMEANT: Better semantic MT evaluation without reference translations , 2014, ACL.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[17]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[19]  Kazem Taghva,et al.  Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[20]  Ion Androutsopoulos,et al.  Learning Textual Entailment using SVMs and String Similarity Measures , 2007, ACL-PASCAL@ACL.