论文信息 - Cloud-Based Phrase Mining and Analysis of User-Defined Phrase-Category Association in Biomedical Publications.

Cloud-Based Phrase Mining and Analysis of User-Defined Phrase-Category Association in Biomedical Publications.

The rapid accumulation of biomedical textual data has far exceeded the human capacity of manual curation and analysis, necessitating novel text-mining tools to extract biological insights from large volumes of scientific reports. The Context-aware Semantic Online Analytical Processing (CaseOLAP) pipeline, developed in 2016, successfully quantifies user-defined phrase-category relationships through the analysis of textual data. CaseOLAP has many biomedical applications. We have developed a protocol for a cloud-based environment supporting the end-to-end phrase-mining and analyses platform. Our protocol includes data preprocessing (e.g., downloading, extraction, and parsing text documents), indexing and searching with Elasticsearch, creating a functional document structure called Text-Cube, and quantifying phrase-category relationships using the core CaseOLAP algorithm. Our data preprocessing generates key-value mappings for all documents involved. The preprocessed data is indexed to carry out a search of documents including entities, which further facilitates the Text-Cube creation and CaseOLAP score calculation. The obtained raw CaseOLAP scores are interpreted using a series of integrative analyses, including dimensionality reduction, clustering, temporal, and geographical analyses. Additionally, the CaseOLAP scores are used to create a graphical database, which enables semantic mapping of the documents. CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner (processes 100,000 words/sec). Following this protocol, users can access a cloud-computing environment to support their own configurations and applications of CaseOLAP. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.

[1] Panos Vassiliadis,et al. Advanced visualization for OLAP , 2003, DOLAP '03.

[2] Nimrod Megiddo,et al. Range queries in OLAP data cubes , 1997, SIGMOD '97.

[3] Bo Zhao,et al. Text Cube: Computing IR Measures for Multidimensional Text Database Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4] Vipin Saxena,et al. OLAP CUBE REPRESENTATION FOR OBJECT - ORIENTED DATABASE , 2012 .

[5] Namsoo Kim,et al. A Multi-dimensional Analysis and Data Cube for Unstructured Text and Social Media , 2014, 2014 IEEE Fourth International Conference on Big Data and Cloud Computing.

[6] Tatsuo Tsuji,et al. An Efficient Implementation for MOLAP Basic Data Structure and Its Evaluation , 2007, DASFAA.

[7] Jiawei Han,et al. Automated Phrase Mining from Massive Text Corpora , 2017, IEEE Transactions on Knowledge and Data Engineering.

[8] Jiawei Han,et al. Multi-Dimensional, Phrase-Based Summarization in Text Cubes , 2016, IEEE Data Eng. Bull..

[9] Jiawei Han,et al. A Text Cube Approach to Human, Social and Cultural Behavior in the Twitter Stream , 2013, SBP.

[10] Tatsuo Tsuji,et al. A storage scheme for multidimensional data alleviating dimension dependency , 2008, 2008 Third International Conference on Digital Information Management.

[11] Anja Bog. Benchmarking transaction and analytical processing systems: the creation of a mixed workload benchmark and its application , 2013 .

[12] Yuanyuan Tian,et al. Hybrid Transactional/Analytical Processing: A Survey , 2017, SIGMOD Conference.

[13] Surajit Chaudhuri,et al. An overview of data warehousing and OLAP technology , 1997, SGMD.

[14] Zhuo Qi Lee,et al. Accelerating Topic Exploration of Multi-Dimensional Documents , 2017, IPDPS Workshops.

[15] Mukesh K. Mohania,et al. Advances in Databases: Concepts, Systems and Applications , 2007 .

[16] Bo Zhao,et al. TopCells: Keyword-based search of top-k aggregated documents in text cube , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[17] Xuan Wang,et al. Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease. , 2018, American journal of physiology. Heart and circulatory physiology.

[18] Bo Zhao,et al. Efficient Keyword-Based Search for Top-K Cells in Text Cube , 2011, IEEE Transactions on Knowledge and Data Engineering.

[19] Jiawei Han,et al. Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[20] Olivier Teste,et al. Olap aggregation function for textual data warehouse , 2016, ICEIS.