Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques

Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85–100%). We further implement Synthesize in an open source web application, Synthesizer (https://github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.

[1]  Shlomo Argamon,et al.  Automatic Identification of Conceptual Metaphors With Limited Knowledge , 2013, AAAI.

[2]  Min Xu,et al.  Automated multidimensional phenotypic profiling using large public microarray repositories , 2009, Proceedings of the National Academy of Sciences.

[3]  Rong Chen,et al.  Ontology-driven indexing of public datasets for translational bioinformatics , 2009, BMC Bioinformatics.

[4]  Ruben Verborgh,et al.  Using OpenRefine , 2013 .

[5]  A. Martínez-Torteya,et al.  SurvExpress: An Online Biomarker Validation Tool and Database for Cancer Gene Expression Data Using Survival Analysis , 2013, PloS one.

[6]  Benjamin E. Gross,et al.  The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. , 2012, Cancer discovery.

[7]  Benjamin Haibe-Kains,et al.  curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome , 2013, Database J. Biol. Databases Curation.

[8]  Aedín C. Culhane,et al.  Public data and open source tools for multi-assay genomic investigation of disease , 2015, Briefings Bioinform..

[9]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[10]  A. Butte,et al.  Creation and implications of a phenome-genome network , 2006, Nature Biotechnology.

[11]  Subha Madhavan,et al.  G-DOC: a systems medicine platform for personalized oncology. , 2011, Neoplasia.

[12]  M. Newton,et al.  Fundamental differences in cell cycle deregulation in human papillomavirus-positive and human papillomavirus-negative head/neck and cervical cancers. , 2007, Cancer research.

[13]  Lei Zhang,et al.  Sentiment Analysis and Opinion Mining , 2017, Encyclopedia of Machine Learning and Data Mining.

[14]  Christine H Chung,et al.  Increased epidermal growth factor receptor gene copy number is associated with poor prognosis in head and neck squamous cell carcinomas. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[15]  Jeffrey T. Leek,et al.  Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction , 2014, Bioinform..