The Typology of Unknown Words: An Experimental Study of Two Corpora

Most current state-of-the-art natural language processing (NLP) systems, when presented with real-life texts, have problems recognizing each and every word present in the input. Depending on the application, the consequences can be severe. For example, in a machine translation system the quality of the processing may suffer and sometimes further processing may even be impossible.There are two main reasons why a word might not be recograzed and thus be considered unknown by the system: