Multilayer SOM With Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection

This paper proposes a new document retrieval (DR) and plagiarism detection (PD) system using multilayer self-organizing map (MLSOM). A document is modeled by a rich tree-structured representation, and a SOM-based system is used as a computationally effective solution. Instead of relying on keywords/lines, the proposed scheme compares a full document as a query for performing retrieval and PD. The tree-structured representation hierarchically includes document features as document, pages, and paragraphs. Thus, it can reflect underlying context that is difficult to acquire from the currently used word-frequency information. We show that the tree-structured data is effective for DR and PD. To handle tree-structured representation in an efficient way, we use an MLSOM algorithm, which was previously developed by the authors for the application of image retrieval. In this study, it serves as an effective clustering algorithm. Using the MLSOM, local matching techniques are developed for comparing text documents. Two novel MLSOM-based PD methods are proposed. Detailed simulations are conducted and the experimental results corroborate that the proposed approach is computationally efficient and accurate for DR and PD.

[1]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[2]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[3]  Tommy W. S. Chow,et al.  Content-based image retrieval by using tree-structured features and multi-layer self-organizing map , 2006, Pattern Analysis and Applications.

[4]  Seiji Yamada,et al.  SVM-based Interactive Document Retrieval with Active Learning , 2007, New Generation Computing.

[5]  Mykola Galushka,et al.  A scaleable document clustering approach for large document corpora , 2006, Inf. Process. Manag..

[6]  Kurt Maly,et al.  An efficient file structure for document retrieval in the automated office environment , 1987, 1987 IEEE Third International Conference on Data Engineering.

[7]  Chris Buckley,et al.  SMART in TREC 8 , 1999, Text Retrieval Conference.

[8]  Parvati Iyer,et al.  Document Similarity Analysis for a Plagiarism Detection System , 2005, IICAI.

[9]  K. J. Lynch,et al.  Generating, integrating, and activating thesauri for concept-based document retrieval , 1993, IEEE Expert.

[10]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[11]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[12]  John Bear,et al.  Using Information Extraction to Improve Document Retrieval , 1998, TREC.

[13]  Rynson W. H. Lau,et al.  CHECK: a document plagiarism detection system , 1997, SAC '97.

[14]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[15]  Tommy W. S. Chow,et al.  A flexible multi-layer self-organizing map for generic processing of tree-structured data , 2007, Pattern Recognit..

[16]  Timo Honkela,et al.  Self-Organizing Maps of Very Large Document Collections: Justification for the WEBSOM Method , 1998 .

[17]  Hong Peng,et al.  Document Classification Based on Support Vector Machine Using a Concept Vector Model , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[18]  Hujun Yin,et al.  Web content management by self-organization , 2005, IEEE Transactions on Neural Networks.

[19]  Philippe Salembier,et al.  Binary partition tree as an efficient representation for image processing, segmentation, and information retrieval , 2000, IEEE Trans. Image Process..

[20]  Markus Hagenbuchner,et al.  Image classification with structured self-organization map , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[21]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[22]  Markus Hagenbuchner,et al.  Extensions and evaluations of adaptive processing of structured information using artifical neural networks , 2002 .

[23]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[24]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[25]  Alexander F. Gelbukh,et al.  PPChecker: Plagiarism Pattern Checker in Document Copy Detection , 2006, TSD.

[26]  Shyi-Ming Chen,et al.  A new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques , 2005, IEEE Transactions on Fuzzy Systems.

[27]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[28]  Tommy W. S. Chow,et al.  A new image classification technique using tree-structured regional features , 2007, Neurocomputing.

[29]  Arkady B. Zaslavsky,et al.  MatchDetectReveal: finding overlapping and similar digital documents , 2000, IRMA Conference.

[30]  Benno Stein,et al.  Intrinsic Plagiarism Detection , 2006, ECIR.

[31]  Ioannis Pitas,et al.  Marginal median SOM for document organization and retrieval , 2004, Neural Networks.

[32]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[33]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[34]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[35]  Thomas P. Way,et al.  SNITCH: a software tool for detecting cut and paste plagiarism , 2006, SIGCSE '06.

[36]  Shyi-Ming Chen,et al.  Document retrieval using fuzzy-valued concept networks , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[37]  Andreas Rauber,et al.  Adaptive Hierarchical Incremental Grid Growing: An architecture for high-dimensional data visualization , 2003 .

[38]  Padmini Srinivasan,et al.  Query Expansion and MEDLINE , 1996, Inf. Process. Manag..

[39]  Yorick Wilks,et al.  Information Extraction: Beyond Document Retrieval , 1998, Int. J. Comput. Linguistics Chin. Lang. Process..

[40]  William I. Grosky,et al.  Narrowing the semantic gap - improved text-based web document retrieval using visual features , 2002, IEEE Trans. Multim..

[41]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[42]  Stavros J. Perantonis,et al.  LSISOM — A Latent Semantic Indexing Approach to Self-Organizing Maps of Document Collections , 2004, Neural Processing Letters.