Inferring variable labels using outlines of data in Data Jackets by considering similarity and co-occurrence

The Data Jacket (DJ) is a technique for sharing information related to data, where the data are hidden, by summarizing them in natural language. In DJs, variables are described by variable labels (VLs), which are the names/meanings of variables, and the utility of data is estimated through combinations of VLs. However, DJs do not always contain VLs because the rules describing DJs cannot compel data owners to enter all relevant information. Owing to a lack of VLs in some DJs, even if the DJs can be combined, their combinations cannot be implemented through the string matching of the VLs. In this paper, we propose a method for inferring VLs in DJs using the text in their outlines. We focus on similarity among the outlines of DJs and create two models for inferring VLs, i.e., based on the similarity of the outlines and the co-occurrence of the VLs. We implemented our models on a similarity and a co-occurrence matrix and applied the proposed method to two types of test data: the DJs of public data and business data. The results of experiments show that our method is significantly superior to the technique that uses only the string matching of the VLs.

[1]  Yukio Ohsawa,et al.  KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[2]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[3]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[4]  Wendy L. Tate,et al.  The use of secondary data in purchasing and supply management (P/SM) research , 2016 .

[5]  E. Rabinovich,et al.  Expanding Horizons and Deepening Understanding via the Use of Secondary Data Sources , 2011 .

[6]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[7]  Yuji Matsumoto,et al.  Japanese Dependency Structure Analysis Based on Support Vector Machines , 2000, EMNLP.

[8]  Yukio Ohsawa,et al.  Innovators Marketplace on Data Jackets for Externalizing the Value of Data via Stakeholders' Requirement Communication , 2014, AAAI Spring Symposia.

[9]  Chunxiao Jiang,et al.  Information Security in Big Data: Privacy and Data Mining , 2014, IEEE Access.

[10]  Alessandro Acquisti,et al.  Predicting Social Security numbers from public data , 2009, Proceedings of the National Academy of Sciences.

[11]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[12]  Chang Liu,et al.  Innovators Marketplace on Data Jackets, for Valuating, Sharing, and Synthesizing Data , 2015 .

[13]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[14]  W. Lowe,et al.  Modelling functional priming and the associative boost , 1998 .

[15]  Yukio Ohsawa,et al.  Data Jackets for Synthesizing Values in the Market of Data , 2013, KES.

[16]  Yukio Ohsawa,et al.  Processing Combinatorial Thinking: Innovators Marketplace as Role-Based Game Plus Action Planning , 2013, Int. J. Knowl. Syst. Sci..

[17]  Purnamrita Sarkar,et al.  A Latent Space Approach to Dynamic Embedding of Co-occurrence Data , 2007, AISTATS.

[18]  Yukio Ohsawa,et al.  Matrix-Based Method for Inferring Variable Labels Using Outlines of Data in Data Jackets , 2017, PAKDD.