Sentence generation from a bag of words using N-gram model

We are presenting in this paper, a method of sentence generation from a given bag of words. The task of sentence generation has its usage in text summarization, question answering system etc. The focus of our task is to generate all possible correct sentences from a given bag of words. The technique that we have applied is N-gram language model. The N-gram model is trained by a text corpus to generate only candidate sequences from a given bag of words. For N input words, instead of considering all possible N! permuted orders as candidate sequence, we have generated only candidate sequences less then N! by applying DFS (Depth First Search) filtering technique at run time. We have two corpora namely text corpus and annotated corpus of POS tags. We have extracted all valid POS trigram tags from the annotated corpus. Each of the generated candidate sequence has a probability score. The candidate sequences were ranked by matching it with valid trigram POS tag signature and probability score. Preliminary experimental work carried out in this direction by using the above mentioned model shows promising results.

[1]  Ioannis Dologlou,et al.  A Corpus Based Technique for Repairing Ill-formed Sentences with Word Order Errors Using Co-Occurrences of n-Grams , 2011, Int. J. Artif. Intell. Tools.

[2]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[3]  Masafumi Hagiwara,et al.  Natural language generation using automatically constructed lexical resources , 2011, The 2011 International Joint Conference on Neural Networks.

[4]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[5]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[6]  Irene Langkilde Forest-Based Statistical Sentence Generation , 2000, ANLP.

[7]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[8]  Θεολόγος Αθανασέλης Αλγόριθμοι εφαρμογής των N-grams στην αναγνώριση συναισθηματικού λόγου και στην διόρθωση κειμένων , 2007 .

[9]  William J. Byrne,et al.  A Graph-Based Approach to String Regeneration , 2014, EACL.

[10]  Kenneth Ward Church,et al.  A Fast Re-scoring Strategy to Capture Long-Distance Dependencies , 2011, EMNLP.

[11]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.