Automatic Sublanguage Identi cation for a New Text

A number of theoretical studies have been devoted to the notion of sublanguage which mainly concerns linguistic phenomena restricted by the domain or context Furthermore there are some successful NLP systems which have explicitly or implicitly addressed the sublanguage restrictions e g TAUM METEO ATR This suggests the following two objectives for future NLP research automatic linguistic knowledge acquisition for sublanguage and automatic de nition of sublanguage and identi cation of it for a new text The two issues become realistic owing to the appearance of large corpora Despite of the recent bloom of the research on the rst objective there are few on the second objective If this objective is achieved NLP systems will be able to optimize to the sublanguage before processing the text and this will be a signi cant help in automatic processing A preliminary experiment aiming at the second objective is addressed in this paper It is conducted on about MB of Wall Street Journal corpus We made up article clusters sublanguages based on word appearance and the closest article cluster among the set of clusters is chosen for each test article The comparison between the new articles and the clusters shows the success of the sublanguage identi cation and also the promising ability of the method Also the result of an experiment using the rst two sentences in the articles indicates the feasibility of applying this method to speech recognition or other systems which can t access the whole article prior to the processing