A document classification approach by GA feature extraction based corner classification neural network

The CC4 neural network is a new type of corner classification training algorithm for three-layered feed forward neural networks. CC4 is now successfully used in meta search engine Anvish. When the documents are almost of the same size, CC4 neural network is an effective document classification algorithm. However, there is great difference in document sizes in general, and CC4 use the whole dictionary as the space of vector which leads to a lot of documents represented by sparse vectors. This paper brings forward feature extraction based neural network GA-CC4. The method of GA feature extraction extracts the feature items really representing the documents in the document set, which are constructed as the set of feature items. Based on the set of feature items and combining the document frequency, the document can be represented. By this method, the dimensions representing the documents can be reduced, which can solve the precise problem caused by the different document sizes, and it can also map the scalar features to the Boolean input of the neural network by binary coding, by which the quality of input data of neural network is improved

[1]  Hongji Yang,et al.  Pre-fetching web pages through data mining based prediction. , 2002 .

[2]  Baowen Xu,et al.  Data mining algorithms for web pre-fetching , 2000, Proceedings of the First International Conference on Web Information Systems Engineering.

[3]  Jiong Yang,et al.  An Approach to Active Spatial Data Mining Based on Statistical Information , 2000, IEEE Trans. Knowl. Data Eng..

[4]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[5]  Chen Enhong An Extended Corner Classification Neural Network Based Document Classification Approach , 2002 .

[6]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[7]  Jiang-Chun Song,et al.  A new document clustering algorithm based on association rule , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[8]  Frederick Hayes-Roth,et al.  The state of knowledge-based systems , 1994, CACM.

[9]  Subhash C. Kak,et al.  A Neural Network-based Intelligent Metasearch Engine , 1999, Inf. Sci..

[10]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[11]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[12]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[13]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[14]  Zhang Wei RESEARCH ON FRAMEWORK SUPPORTING WEB SEARCH ENGINE , 2000 .

[15]  Carlos Ordonez,et al.  FREM: fast and robust EM clustering for large data sets , 2002, CIKM '02.

[16]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[17]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[18]  Song Qin-bao A Web Document Clustering Algorithm Based on Association Rule , 2002 .

[19]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.