Exploring the use of fuzzy signature for text mining

The classical approaches for the traditional problems of text mining, such as document indexing, document clustering or text classification, represent the text as bag-of-words. Words, the units of the representation, are determined by tokenization, using e.g. whitespace and punctuation characters as separator. The bag-of-word based methods face problem with non-segmented text typical for some Asian languages, since the tokenization based solution cannot be applied anymore to determine the representation units. Several solutions were proposed so far, among them frequent max substring mining is adopted here because of its language-independency and favourable speed and store requirements. We present in this paper a fuzzy signature based solution using frequent max substring for non-segmented document representation, and propose how it could be applied for some typical text mining tasks. We show how the flexibility of fuzzy signatures can be exploited for text mining tasks. With the use of this proposed concept, complex decision models in text mining may be constructed more effectively in future.

[1]  T. Takagi,et al.  A New Approach to Design of Fuzzy Controller , 1983 .

[2]  J. Goguen L-fuzzy sets , 1967 .

[3]  Xiaopeng Tao,et al.  Chinese Text Segmentation With MBDP-1: Making the Most of Training Corpora , 2001, ACL.

[4]  Qiang Shen,et al.  Fuzzy Interpolation and Extrapolation: A Practical Approach , 2008, IEEE Transactions on Fuzzy Systems.

[5]  Choochart Haruechaiyasak,et al.  LearnLexTo: a machine-learning based word segmentation for indexing Thai texts , 2008, iNEWS '08.

[6]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[7]  László T. Kóczy VECTORIAL I-FUZZY SETS. , 1982 .

[8]  Virach Sornlertlamvanich,et al.  Character cluster based Thai information retrieval , 2000, IRAL '00.

[9]  László T. Kóczy,et al.  Size reduction by interpolation in fuzzy rule bases , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[10]  László T. Kóczy,et al.  Construction of fuzzy signature from data: an example of SARS pre-clinical diagnosis system , 2004, 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542).

[11]  Kevin Kok Wai Wong,et al.  Non-segmented Document Clustering Using Self-Organizing Map and Frequent Max Substring Technique , 2009, ICONIP.

[12]  L.T. Koczy,et al.  Fuzzy signatures in data mining , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[13]  S. Kovács New Aspects of Interpolative Reasoning , 1996 .

[14]  Hong Xie,et al.  Thai text mining to support Web search for E-commerce , 2008 .

[15]  Péter Baranyi,et al.  Comprehensive analysis of a new fuzzy rule interpolation method , 2000, IEEE Trans. Fuzzy Syst..