Detecting Inflection Patterns in Natural Language by Minimization of Morphological Model

One of the most important steps in text processing and information retrieval is stemming - reducing of words to stems expressing their base mean- ing, e.g., bake, baked, bakes, baking → bak-. We suggest an unsupervised method of recognition such inflection patterns automatically, with no a priori information on the given language, basing exclusively on a list of words ex- tracted from a large text. For a given word list V we construct two sets of strings: stems S and endings E, such that each word from V is a concatenation of a stem from S and ending from E. To select an optimal model, we minimize the total number of elements in S and E. Though such a simplistic model does not reflect many phenomena of real natural language morphology, it shows sur- prisingly promising results on different European languages. In addition to practical value, we believe that this can also shed light on the nature of human language.