Information extraction as a basis for portable text classification systems

Knowledge-based natural language processing systems have achieved good success with many tasks, but they often require many person-months of effort to build an appropriate knowledge base. As a result, they are not portable across domains. This knowledge-engineering bottleneck must be addressed before knowledge-based systems will be practical for real-world applications. This dissertation addresses the knowledge-engineering bottleneck for a natural language processing task called "information extraction". A system called AutoSlog is presented which automatically constructs dictionaries for information extraction, given an appropriate training corpus. In the domain of terrorism, AutoSlog created a dictionary using a training corpus and five person-hours of effort that achieved 98% of the performance of a hand-crafted dictionary that took approximately 1500 person-hours to build. This dissertation also describes three algorithms that use information extraction to support high-precision text classification. As more information becomes available on-line, intelligent information retrieval will be crucial in order to navigate the information highway efficiently and effectively. The approach presented here represents a compromise between keyword-based techniques and in-depth natural language processing. The text classification algorithms classify texts with high accuracy by using an underlying information extraction system to represent linguistic phrases and contexts. Experiments in the terrorism domain suggest that increasing the amount of linguistic context can improve performance. Both AutoSlog and the text classification algorithms are evaluated in three domains: terrorism, joint ventures, and microelectronics. An important aspect of this dissertation is that AutoSlog and the text classification systems can be easily ported across domains.