An Approach to Text Mining using Information Extraction

In this paper we describe our approach to Text Mining by introducing TextMiner. We perform term and event extraction on each document to find features that are likely to have meaning in the domain, and then apply mining on the extracted features labelling each document. The system consists of two major components, the Text Analysis component and the Data Mining component. The Text Analysis component converts semi structured data such as documents into structured data stored in a database. The second component applies data mining techniques on the output of the first component. We apply our approach in the financial domain (financial documents collection) and our main targets are: a) To manage all the available information, for example classify documents in appropriate categories and b) To “mine” the data in order to “discover” useful knowledge. This work is designed to primarily support two languages, i.e. English and Greek.

[1]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[2]  Udo Hahn,et al.  Deep Knowledge Discovery from Natural Language Texts , 1997, KDD.

[3]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[4]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[5]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[6]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[7]  Mohamad Saraee,et al.  Data mining in temporal databases , 1998 .

[8]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[9]  Tom M. Mitchell,et al.  Machine Learning and Data Mining , 2012 .

[10]  Jörg Rech,et al.  Knowledge Discovery in Databases , 2001, Künstliche Intell..

[11]  Peter Willett,et al.  Comparison of Hierarchie Agglomerative Clustering Methods for Document Retrieval , 1989, Comput. J..

[12]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[13]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[14]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[15]  Ramakrishnan Srikant,et al.  Discovering Trends in Text Databases , 1997, KDD.

[16]  Martin Rajman,et al.  Text Mining: Natural Language techniques and Text Mining applications , 1998 .

[17]  Yorick Wilks,et al.  Information Extraction as a Core Language Technology , 1997, SCIE.

[18]  Robert J. Gaizauskas,et al.  Conception vs. Lexicons: An Architecture for Multilingual Information Extraction , 1997, SCIE.

[19]  Peter Willett,et al.  Hierarchic Document Clustering Using Ward's Method. , 1986, SIGIR 1986.

[20]  J. Palous,et al.  Machine Learning and Data Mining , 2002 .