Mining Numbers in Text Using Suffix Arrays and Clustering Based on Dirichlet Process Mixture Models

We propose a system that enables us to search with ranges of numbers Both queries and resulting strings can be both strings and numbers (e.g., “200–800 dollars”) The system is based on suffix-arrays augmented with treatment of number information to provide search for numbers by words, and vice versa Further, the system performs clustering based on a Dirichlet Process Mixture of Gaussians to treat extracted collection of numbers appropriately.