Visual integration tool for heterogeneous data type by unified vectorization

Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. One of the critical issues of data integration is the detection of similar entities based on the content. This complexity is due to three factors: the data type of the databases are heterogenous, the schema of databases are unfamiliar and heterogenous as well, and the amount of records is voluminous and time consuming to analyze. As solution to these problems we extend our work in another of our papers by introducing a new measure to handle heterogenous textual and numerical data type for co-incident meaning extraction. Firstly, to in order accommodate the heterogeneous data types we propose a new weight called Bin Frequency - Inverse Document Bin Frequency (BF-IDBF) for effective heterogeneous data pre-processing and classification by unified vectorization. Secondly in order to handle the unfamiliar data structure, we use the unsupervised algorithm Self-Organizing Map. Finally to help the user to explore and browse the semantically similar entities among the copious amount of data, we use a SOM based visualization tool to map the database tables based on their semantical content.

[1]  Xia Lin,et al.  Map Displays for Information Retrieval , 1997, J. Am. Soc. Inf. Sci..

[2]  Ian Witten,et al.  Data Mining , 2000 .

[3]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[6]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[7]  Abdelmalek Amine,et al.  Concept-based clustering of textual documents using SOM , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[8]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[9]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[10]  Hannu Vanharanta,et al.  Comparing numerical data and text information from annual reports using self-organizing maps , 2001, Int. J. Account. Inf. Syst..

[11]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[12]  John Wang,et al.  Data Mining: Opportunities and Challenges , 2003 .

[13]  Gerald Salton,et al.  Automatic text processing , 1988 .

[14]  Samuel Kaski,et al.  Mining massive document collections by the WEBSOM method , 2004, Inf. Sci..

[15]  Ying Zhu,et al.  Visualization and Integration of Databases Using Self-Organizing Map , 2009, 2009 First International Confernce on Advances in Databases, Knowledge, and Data Applications.

[16]  RahmErhard,et al.  A survey of approaches to automatic schema matching , 2001, VLDB 2001.