Towards automatic column-based data object clustering for multilingual databases

The amount of data in all computer applications is growing tremendously. As a result, the organization of the huge data is crucial. Recently, many researchers consider clustering as one of the important approaches in handling data for wide range of research domains. The examples include Topic Detection and Tracking (TDT), Multilingual Document Clustering, Multilingual News Clustering, Text Clustering and Web Record. Normally, data clustering is time consuming and challenging since they involve heavy programming or scripting. In online news, data clustering analysis is very much needed as the nature of the news across the globe is dynamically changing in every second. The news can come from any web sources in the form of multilingual news. This paper proposes system architecture for an automatic data object clustering in multilingual database for online news, web record and text mining. The architecture provides an overview of a virtual scheme that handles data objects within the database tables as part of the database management system. The proposed technique architecture will provide the platform for quick extraction, data arrangement, data grouping based on pattern similarities. Thus, it will improve query processing performance in multilingual databases without the need to code or script for interface programming. This is the first attempt to apply the data clustering technique prior to data extraction in any database application in the form of semi-structured and structured data (web record).

[1]  Sam Lightstone,et al.  DB2 Design Advisor: Integrated Automatic Physical Database Design , 2004, VLDB.

[2]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[3]  Jiajin Le,et al.  A column-based self-organizing hybrid storage model for data warehouse , 2010, The 2nd International Conference on Information Science and Engineering.

[4]  Neoklis Polyzotis,et al.  Selectivity-based partitioning: a divide-and-union paradigm for effective query optimization , 2005, CIKM '05.

[5]  Martin L. Kersten,et al.  Self-organizing strategies for a column-store database , 2008, EDBT '08.

[6]  Xiao Li,et al.  Extracting structured information from user queries with semi-supervised conditional random fields , 2009, SIGIR.

[7]  Le Gruenwald,et al.  Research issues in automatic database clustering , 2005, SGMD.

[8]  Zhen He,et al.  Vertical partitioning for flash and HDD database systems , 2010, J. Syst. Softw..

[9]  George Tzanetakis,et al.  Audio genre classification using percussive pattern clustering combined with timbral features , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[10]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[11]  Martin L. Kersten,et al.  Self-organizing tuple reconstruction in column-stores , 2009, SIGMOD Conference.

[12]  Tzung-Pei Hong,et al.  Attribute Clustering in High Dimensional Feature Spaces , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[13]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[14]  Dragomir R. Radev,et al.  NewsInEssence: summarizing online news topics , 2005, Commun. ACM.

[15]  Bruno Pouliquen,et al.  Navigating multilingual news collections using automatically extracted information , 2005, 27th International Conference on Information Technology Interfaces, 2005..

[16]  Christian S. Jensen,et al.  Sharing-aware horizontal partitioning for exploiting correlations during query processing , 2010, Proc. VLDB Endow..

[17]  Hanan Samet,et al.  NewsStand: a new view on news , 2008, GIS '08.

[18]  Shamkant B. Navathe,et al.  Vertical partitioning for database design: a graphical algorithm , 1989, SIGMOD '89.

[19]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[20]  Fidel Cacheda,et al.  Finding and Extracting Data Records from Web Pages , 2007, EUC.

[21]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[22]  Judith L. Klavans,et al.  Columbia Newsblaster: Multilingual News Summarization on the Web , 2004, NAACL.

[23]  Bruno Pouliquen,et al.  An introduction to the Europe Media Monitor family of applications , 2013, ArXiv.

[24]  Luis Gravano,et al.  Building query optimizers for information extraction: the SQoUT project , 2009, SGMD.

[25]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[26]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[27]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[28]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[29]  Jinlin Chen,et al.  An adaptive bottom up clustering approach for Web news extraction , 2009, 2009 18th Annual Wireless and Optical Communications Conference.

[30]  Shamkant B. Navathe,et al.  Vertical partitioning algorithms for database design , 1984, TODS.

[31]  Robert M. McGraw,et al.  Overview of clustering algorithms , 2001, SPIE Defense + Commercial Sensing.

[32]  Qingzhong Li,et al.  A Novel Method for Extracting Information from Web Pages with Multiple Presentation Templates , 2010, J. Softw..

[33]  Atsuhiro Takasu,et al.  UpdateNews: a news clustering and summarization system using efficient text processing , 2007, JCDL '07.

[34]  Wei Liu,et al.  Extracting multiple news attributes based on visual features , 2011, Journal of Intelligent Information Systems.

[35]  Evangelos E. Milios,et al.  Narrative text classification for automatic key phrase extraction in web document corpora , 2005, WIDM '05.

[36]  Xudong Jiang,et al.  Efficient fingerprint search based on database clustering , 2007, Pattern Recognit..

[37]  Marcin Zukowski,et al.  Positional update handling in column stores , 2010, SIGMOD Conference.

[38]  Jeremy M. Brown,et al.  The importance of data partitioning and the utility of Bayes factors in Bayesian phylogenetics. , 2007, Systematic biology.

[39]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[40]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[41]  Durgesh Kumar Mishra,et al.  Architecture for Preserving Privacy During Data Mining by Hybridization of Partitioning on Medical Data , 2010, 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation.

[42]  Jemal H. Abawajy,et al.  A rough set approach for selecting clustering attribute , 2010, Knowl. Based Syst..

[43]  Glenn Fung,et al.  A Comprehensive Overview of Basic Clustering Algorithms , 2001 .

[44]  Qing Li,et al.  An Evaluation of Vertical Class Partitioning for Query Processing in Object-Oriented Databases , 2002, IEEE Trans. Knowl. Data Eng..

[45]  Panayiotis Tsaparas,et al.  Structured annotations of web queries , 2010, SIGMOD Conference.

[46]  Adnan Yazici,et al.  Exploiting information extraction techniques for automatic semantic video indexing with an application to Turkish news videos , 2011, Knowl. Based Syst..

[47]  Rasmus Resen Amossen Vertical partitioning of relational OLTP databases using integer programming , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[48]  Goran Velinov,et al.  Horizontal Partitioning by Predicate Abstraction and Its Application to Data Warehouse Design , 2010, ADBIS.

[49]  Saudi Arabia,et al.  An Optimized Scheme for Vertical Partitioning of a Distributed Database , 2008 .

[50]  Steven Minton,et al.  Materializing multi-relational databases from the web using taxonomic queries , 2011, WSDM '11.

[51]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[52]  David J. DeWitt,et al.  Read-optimized databases, in depth , 2008, Proc. VLDB Endow..

[53]  Flora S. Tsai,et al.  Database optimization for novelty mining of business blogs , 2011, Expert Syst. Appl..

[54]  Fabio Crestani,et al.  An Approach to Indexing and Clustering News Stories Using Continuous Language Models , 2010, NLDB.