论文信息 - Hybrid Approach to Web Content Outlier Mining Without Query Vector - 字舞流文

Hybrid Approach to Web Content Outlier Mining Without Query Vector

Mining outliers from large datasets is like finding needles in a haystack. Even more challenging is sifting through the dynamic, unstructured, and ever-growing web data for outliers. This paper presents HyCOQ, which is a hybrid algorithm that draws from the power of n-gram-based and word-based systems. Experimental results obtained using embedded motifs without a dictionary show significant improvement over using a domain dictionary irrespective of the type of data used (words, n-grams, or hybrid). Also, there is remarkable improvement in recall with hybrid documents compared to using raw words and n-grams without a domain dictionary.

Reda Alhajj | Ken Barker | Malik Agyemang | K. Barker | R. Alhajj | Malik Agyemang

[1] M Damashek,et al. Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[2] D. A. Bell. INFORMATION AND PATTERN , 1976 .

[3] Philip S. Yu,et al. Discovering unexpected information from your competitors' web sites , 2001, KDD '01.

[4] Anthony K. H. Tung,et al. Mining top-n local outliers in large databases , 2001, KDD '01.

[5] Rajeev Rastogi,et al. Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[6] Jon M. Kleinberg,et al. Mining the Web's Link Structure , 1999, Computer.

[7] Jaideep Srivastava,et al. Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[8] Oren Etzioni,et al. The World-Wide Web: quagmire or gold mine? , 1996, CACM.

[9] Raymond T. Ng,et al. A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[10] Soumen Chakrabarti,et al. Data mining for hypertext: a tutorial survey , 2000, SKDD.

[11] Vic Barnett,et al. Outliers in Statistical Data , 1980 .

[12] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[13] Prabhakar Raghavan,et al. Mining the Link Structure of the World Wide Web , 1998 .

[14] Reda Alhajj,et al. Mining web content outliers using structure oriented weighting techniques and N-grams , 2005, SAC '05.

[15] Theodore Johnson,et al. Fast Computation of 2-Dimensional Depth Contours , 1998, KDD.

[16] Reda Alhajj,et al. Framework for mining web content outliers , 2004, SAC '04.

[17] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[18] Raymond T. Ng,et al. Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[19] Sridhar Ramaswamy,et al. Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[20] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD 2000.