Prevent Low-Quality Analytics by Automatic Selection of the Best-Fitting Training Data

Data analysis pipelines consist of a sequence of various analysis tools. Most of these tools are based on supervised machine learning techniques and thus rely on labeled training data. Selecting appropriate training data has a crucial impact on analytics quality. Yet, most of the times, domain experts who construct analysis pipelines neglect the task of selecting appropriate training data. They rely on default training data sets, e.g., since they do not know which other training data sets exist and what they are used for. Yet, default training data sets may be very different from the domain-specific input data that is to be analyzed, leading to low-quality results. Moreover, these input data sets are usually unlabeled. Thus, information on analytics quality is not measurable with evaluation metrics. Our contribution comprises a method that (1) indicates the expected quality to the domain expert while constructing the analysis pipeline, without need for labels and (2) automatically selects the best-fitting training data. It is based on a measurement of the similarity between input and training data. In our evaluation, we consider the part-of-speech tagger tool and show that Latent Semantic Analysis (LSA) and Cosine Similarity are suited as indicators for the quality of analysis results and as basis for an automatic selection of the best-fitting

[1]  Iryna Gurevych,et al.  DKPro Similarity: An Open Source Framework for Text Similarity , 2013, ACL.

[2]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[3]  Albert Y. Kim,et al.  Hypothesis Testing , 2019, Encyclopedic Dictionary of Archaeology.

[4]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[5]  Laura Sebastian-Coleman,et al.  Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework , 2012 .

[6]  Peng Bi,et al.  Handbook of Linguistic Annotation , 2018, J. Quant. Linguistics.

[7]  Cornelia Kiefer,et al.  Assessing the Quality of Unstructured Data: An Initial Overview , 2016, LWDA.

[8]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[9]  El Habib Benlahmar,et al.  Survey of Plagiarism Detection Approaches and Big data Techniques related to Plagiarism Candidate Retrieval , 2017, BDCA.

[10]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[11]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[12]  Jeff Mielke A phonetically based metric of sound similarity , 2012 .

[13]  Walt Detmar Meurers,et al.  Short Answer Assessment: Establishing Links Between Research Strands , 2012, BEA@NAACL-HLT.

[14]  José Francisco Martínez Trinidad,et al.  A review of instance selection methods , 2010, Artificial Intelligence Review.

[15]  Luis González Abril,et al.  A similarity measure between videos using alignment, graphical and speech features , 2012, Expert Syst. Appl..

[16]  Pascal Hirmer,et al.  FlexMash 2.0 - Flexible Modeling and Execution of Data Mashups , 2016, RMC.

[17]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[18]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[19]  Teh Ying Wah,et al.  A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data , 2015, PloS one.

[20]  E. Valuations A REVIEW ON EVALUATION METRICS FOR DATA CLASSIFICATION EVALUATIONS , 2015 .

[21]  Daniel Sonntag,et al.  Assessing the Quality of Natural Language Text Data , 2004, GI Jahrestagung.

[22]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[23]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[24]  Leon N. Cooper,et al.  Training Data Selection for Support Vector Machines , 2005, ICNC.

[25]  J. I The Design of Experiments , 1936, Nature.

[26]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[27]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[28]  Christopher D. Manning Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? , 2011, CICLing.

[29]  Jieping Ye,et al.  Learning Adversarial Networks for Semi-Supervised Text Classification via Policy Gradient , 2018, KDD.