Cloud-Based Phrase Mining and Analysis of User-Defined Phrase-Category Association in Biomedical Publications.

The rapid accumulation of biomedical textual data has far exceeded the human capacity of manual curation and analysis, necessitating novel text-mining tools to extract biological insights from large volumes of scientific reports. The Context-aware Semantic Online Analytical Processing (CaseOLAP) pipeline, developed in 2016, successfully quantifies user-defined phrase-category relationships through the analysis of textual data. CaseOLAP has many biomedical applications. We have developed a protocol for a cloud-based environment supporting the end-to-end phrase-mining and analyses platform. Our protocol includes data preprocessing (e.g., downloading, extraction, and parsing text documents), indexing and searching with Elasticsearch, creating a functional document structure called Text-Cube, and quantifying phrase-category relationships using the core CaseOLAP algorithm. Our data preprocessing generates key-value mappings for all documents involved. The preprocessed data is indexed to carry out a search of documents including entities, which further facilitates the Text-Cube creation and CaseOLAP score calculation. The obtained raw CaseOLAP scores are interpreted using a series of integrative analyses, including dimensionality reduction, clustering, temporal, and geographical analyses. Additionally, the CaseOLAP scores are used to create a graphical database, which enables semantic mapping of the documents. CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner (processes 100,000 words/sec). Following this protocol, users can access a cloud-computing environment to support their own configurations and applications of CaseOLAP. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.

[1]  Panos Vassiliadis,et al.  Advanced visualization for OLAP , 2003, DOLAP '03.

[2]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[3]  Bo Zhao,et al.  Text Cube: Computing IR Measures for Multidimensional Text Database Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  Vipin Saxena,et al.  OLAP CUBE REPRESENTATION FOR OBJECT - ORIENTED DATABASE , 2012 .

[5]  Namsoo Kim,et al.  A Multi-dimensional Analysis and Data Cube for Unstructured Text and Social Media , 2014, 2014 IEEE Fourth International Conference on Big Data and Cloud Computing.

[6]  Tatsuo Tsuji,et al.  An Efficient Implementation for MOLAP Basic Data Structure and Its Evaluation , 2007, DASFAA.

[7]  Jiawei Han,et al.  Automated Phrase Mining from Massive Text Corpora , 2017, IEEE Transactions on Knowledge and Data Engineering.

[8]  Jiawei Han,et al.  Multi-Dimensional, Phrase-Based Summarization in Text Cubes , 2016, IEEE Data Eng. Bull..

[9]  Jiawei Han,et al.  A Text Cube Approach to Human, Social and Cultural Behavior in the Twitter Stream , 2013, SBP.

[10]  Tatsuo Tsuji,et al.  A storage scheme for multidimensional data alleviating dimension dependency , 2008, 2008 Third International Conference on Digital Information Management.

[11]  Anja Bog Benchmarking transaction and analytical processing systems: the creation of a mixed workload benchmark and its application , 2013 .

[12]  Yuanyuan Tian,et al.  Hybrid Transactional/Analytical Processing: A Survey , 2017, SIGMOD Conference.

[13]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[14]  Zhuo Qi Lee,et al.  Accelerating Topic Exploration of Multi-Dimensional Documents , 2017, IPDPS Workshops.

[15]  Mukesh K. Mohania,et al.  Advances in Databases: Concepts, Systems and Applications , 2007 .

[16]  Bo Zhao,et al.  TopCells: Keyword-based search of top-k aggregated documents in text cube , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[17]  Xuan Wang,et al.  Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease. , 2018, American journal of physiology. Heart and circulatory physiology.

[18]  Bo Zhao,et al.  Efficient Keyword-Based Search for Top-K Cells in Text Cube , 2011, IEEE Transactions on Knowledge and Data Engineering.

[19]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[20]  Olivier Teste,et al.  Olap aggregation function for textual data warehouse , 2016, ICEIS.