First Study on Data Readiness Level

We introduce the idea of Data Readiness Level (DRL) to measure the relative richness of data to answer specific questions often encountered by data scientists. We first approach the problem in its full generality explaining its desired mathematical properties and applications and then we propose and study two DRL metrics. Specifically, we define DRL as a function of at least four properties of data: Noisiness, Believability, Relevance, and Coherence. The information-theoretic based metrics, Cosine Similarity and Document Disparity, are proposed as indicators of Relevance and Coherence for a piece of data. The proposed metrics are validated through a text-based experiment using Twitter data.

[1]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[2]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[3]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[4]  K. Burger Papers In Structural And Transformational Linguistics , 2016 .

[5]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[6]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[7]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[8]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[9]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[10]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[13]  David A. Belsley,et al.  Regression Analysis and its Application: A Data-Oriented Approach.@@@Applied Linear Regression.@@@Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1981 .

[14]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[15]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[16]  Yun He,et al.  A generalized divergence measure for robust image registration , 2003, IEEE Trans. Signal Process..