TagLearner: A P2P Classifier Learning System from Collaboratively Tagged Text Documents

The amount of text data on the Internet is growing at a very fast rate. Online text repositories for news agencies, digital libraries and other organizations currently store giga and tera-bytes of data. Large amounts of unstructured text poses a serious challenge for data mining and knowledge extraction. End user participation coupled with distributed computation can play a crucial role in meeting these challenges. In many applications involving classification of text documents, web users often participate in the tagging process. This collaborative tagging results in the formation of large scale Peer-to-Peer (P2P) systems which can function, scale and self-organize in the presence of highly transient population of nodes and do not need a central server for co-ordination. In this paper, we describe TagLearner, a P2P classifier learning system for extracting patterns from text data where the end users can participate both in the task of labeling the data and building a distributed classifier on it. We present a novel distributed linear programming based classification algorithm which is asynchronous in nature. The paper also provides extensive empirical results on text data obtained from an online repository - the NSF Abstracts Data.

[1]  Hans Friedrich Witschel,et al.  Terminology Extraction and Automatic Indexing Comparison and Qualitative Evaluation of Methods , 2005 .

[2]  Rajeev Motwani,et al.  Estimating Aggregates on a Peer-to-Peer Network , 2003 .

[3]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[4]  Jude W. Shavlik,et al.  Knowledge-Based Kernel Approximation , 2004, J. Mach. Learn. Res..

[5]  George B. Dantzig,et al.  Linear programming and extensions , 1965 .

[6]  Kun Liu,et al.  Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network , 2008, IEEE Transactions on Knowledge and Data Engineering.

[7]  Robert M. Freund,et al.  Interior point methods : current status and future directions , 1996 .

[8]  David P. Anderson,et al.  SETI@home: an experiment in public-resource computing , 2002, CACM.

[9]  Shinichi Honiden,et al.  Web Page Recommender System based on Folksonomy Mining for ITNG ’06 Submissions , 2006, Third International Conference on Information Technology: New Generations (ITNG'06).

[10]  O. Mangasarian Linear and Nonlinear Separation of Patterns by Linear Programming , 1965 .

[11]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[12]  Ran Wolff,et al.  Local L2-Thresholding Based Data Mining in Peer-to-Peer Systems , 2006, SDM.

[13]  J. Brown,et al.  Organizing Knowledge , 1998 .

[14]  Hillol Kargupta,et al.  Uniform Data Sampling from a Peer-to-Peer Network , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[15]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[16]  Haimonti Dutta,et al.  Empowering scientific discovery by distributed data mining on the grid infrastructure , 2007 .

[17]  Paul S. Bradley,et al.  Clustering via Concave Minimization , 1996, NIPS.

[18]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.

[19]  Hillol Kargupta,et al.  An Efficient Local Algorithm for Distributed Multivariate Regression in Peer-to-Peer Networks , 2008, SDM.

[20]  C. B. Stunkel,et al.  Hypercube implementation of the simplex algorithm , 1989, C3P.

[21]  Gavriel Yarmish A Distributed Implementation of the Simplex Method , 2001 .

[22]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[23]  FRED W. SMITH,et al.  Pattern Classifier Design by Linear Programming , 1968, IEEE Transactions on Computers.

[24]  Lars Schmidt-Thieme,et al.  Tag-aware recommender systems by fusion of collaborative filtering algorithms , 2008, SAC '08.

[25]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[26]  Paul S. Bradley,et al.  Feature Selection via Mathematical Programming , 1997, INFORMS J. Comput..

[27]  Bernardo A. Huberman,et al.  The Structure of Collaborative Tagging Systems , 2005, ArXiv.

[28]  Haimonti Dutta,et al.  Distributed Optimization Strategies for Mining on Peer-to-Peer Networks , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[29]  Xin Li,et al.  Tag-based social interest discovery , 2008, WWW.

[30]  M - Estimating Aggregates on a Peer-to-Peer Network , 2003 .

[31]  Haimonti Dutta,et al.  Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[32]  David R. Karger,et al.  Analysis of the evolution of peer-to-peer systems , 2002, PODC '02.

[33]  James K. Ho,et al.  On the efficacy of distributed simplex algorithms for linear programming , 1994, Comput. Optim. Appl..

[34]  Laks V. S. Lakshmanan,et al.  Efficient network aware search in collaborative tagging sites , 2008, Proc. VLDB Endow..

[35]  Robert G. Mann,et al.  AstroDAS: Sharing Assertions Across Astronomy Catalogues Through Distributed Annotation , 2006, IPAW.