Corpus-Based Generation of Content and Form in Poetry

We employ a corpus-based approach to generate content and form in poetry. The main idea is to use two different corpora, on one hand, to provide semantic content for new poems, and on the other hand, to generate a specific grammatical and poetic structure. The approach uses text mining methods, morphological analysis, and mor- phological synthesis to produce poetry in Finnish. We present some promising results obtained via the com- bination of these methods and preliminary evaluation results of poetry generated by the system. In order to automatically obtain world knowledge neces- sary for building the content, we use text mining on a back- ground corpus. We construct a word association network based on word co-occurrences in the corpus and then use this network to control the topic and semantic coherence of poetry when we generate it. Many issues with the form, especially the grammar, we solve by using a grammar corpus. Instead of using an ex- plicit, generative specification of the grammar, we take ran- dom instances of actual use of language from the grammar corpus and copy their grammatical structure to the generated poetry. We do this by substituting most words in the exam- ple text by ones that are related to the given topic in the word association network. Our current focus is on testing these corpus-based prin- ciples and their capability to produce novel poetry of good quality on a given topic. At this stage of research, we have not yet considered rhyme, rhythm or other phonetic features of the form. These will be added in the future, as will more elaborate mechanisms of controlling the content. As a result of the corpus-based design, the input to the current poetry generator consists of the background and the grammar corpora, and the topic of the poem. In the intended use case, the topic is directly controlled by the user, but we allow the grammar corpus to influence the content, too. Control over form is indirectly over the choice of the two corpora. The only directly language-dependent component in the system is an off-the-shelf module for morphological analysis and synthesis. The current version of our poetry generation system works in Finnish. Its rich morphology adds another char- acteristic to the current implementation. However, we be- lieve that the flexible corpora-based design will be useful in transferring the ideas to other languages, as well as in devel- oping applications that can adapt to new styles and contents. A possible application could be a news service in the web, with a poem of the day automatically generated from recent news and possibly triggering, in the mind of the reader, new views to the events of the world. After briefly reviewing related work in the next section, we will describe the corpus-based approach in more detail. Then, we will give some examples of generated poetry, with rough English translations. We have carried out an empirical evaluation of the generated poetry with twenty subjects, with encouraging results. We will describe this evaluation and its results, and will then conclude by discussing the proposed approach and the planned future work.