Prospecting Information Extraction by Text Mining Based on Convolutional Neural Networks–A Case Study of the Lala Copper Deposit, China

With geological big data becoming a focus of geoscience research, the vast amount of textual geoscience data provides both opportunities and challenges for data analysis and data mining. In fact, it does not seem possible to meet the demands of the big data age through the traditional manual reading for information extraction and gaining knowledge. In this paper, a workflow is proposed to extract prospecting information by text mining based on convolutional neural networks (CNNs). The aim is to classify the text data and extract the prospecting information automatically. The procedure involves three parts: 1) text data acquisition; 2) text classification based on CNN; and 3) statistics and visualization. First, the large amount of available text data was acquired based on geoscience big data acquisition methodologies. After text preprocessing, the CNN was used to classify the geoscience text data into four categories (geology, geophysics, geochemistry, and remote sensing), with each category consisting of three levels of text scales (word, sentence, and paragraph). Second, the word frequency statistics, co-occurrence matrix statistics, and term frequency–inverse document frequency (TF-IDF) statistics were for words, sentences, and paragraphs, respectively, which aimed to obtain the key nodes and links derived from the content-words. Finally, the deep semantic information of the big data mining of relevant geoscience texts was visualized by word clouds, knowledge graphs (e.g., the chord and bigram graphs), and TF-IDF statistical graphs. The Lala copper deposit in Sichuan province was taken as a test case, for which the prospecting information was extracted successfully by the developed text mining methodologies. This paper provides a strong basis for research into establishing mineral deposits prospecting models based on logical knowledge trees. In addition, it shows the great potential of this method for intelligent information extraction within geoscience big data.

[1]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[2]  Daling Wang,et al.  Detecting Multiple Coexisting Emotions in Microblogs with Convolutional Neural Networks , 2017, Cognitive Computation.

[3]  C. C. Fries The structure of English;: An introduction to the construction of English sentences , 2005 .

[4]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[5]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[6]  Dmitry Paranyushkin,et al.  Identifying the Pathways for Meaning Circulation using Text Network Analysis , 2011 .

[7]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[8]  Song Guo,et al.  Big Data Meet Green Challenges: Greening Big Data , 2016, IEEE Systems Journal.

[9]  Xiaogang Ma,et al.  Linked Geoscience Data in practice: where W3C standards meet domain knowledge, data visualization and OGC standards , 2017, Earth Science Informatics.

[10]  Graça Bressan,et al.  Age Groups Classification in Social Network Using Deep Learning , 2017, IEEE Access.

[11]  Jon Patrick The Scamseek Project - Text Mining for Financial Scams on the Internet , 2006, Selected Papers from AusDM.

[12]  Wai Lam,et al.  Stock prediction: Integrating text mining approach using real-time news , 2003, 2003 IEEE International Conference on Computational Intelligence for Financial Engineering, 2003. Proceedings..

[13]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[14]  Simone Paolo Ponzetto,et al.  Knowledge-based graph document modeling , 2014, WSDM.

[15]  Ramakrishnan Srikant,et al.  Discovering Trends in Text Databases , 1997, KDD.

[16]  John C. Platt,et al.  Learning Discriminative Projections for Text Similarity Measures , 2011, CoNLL.

[17]  Jimeng Sun,et al.  Automatic identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records , 2014, Int. J. Medical Informatics.

[18]  Jianguo Chen,et al.  Information extraction and knowledge graph construction from geoscience literature , 2018, Comput. Geosci..

[19]  Eduard H. Hovy,et al.  Automated Text Summarization and the SUMMARIST System , 1998, TIPSTER.

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  Luca Cernuzzi,et al.  Toward Open Government in Paraguay , 2014, IT Professional.

[22]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[23]  Jinshuan Peng,et al.  Intelligent Method for Identifying Driving Risk Based on V2V Multisource Big Data , 2018, Complex..

[24]  Song Guo,et al.  Big Data Meet Green Challenges: Big Data Toward Green Applications , 2016, IEEE Systems Journal.

[25]  Regina Barzilay,et al.  Molding CNNs for text: non-linear, non-consecutive convolutions , 2015, EMNLP.

[26]  Kate Ehrlich,et al.  Searching for experts in the enterprise: combining text and social network analysis , 2007, GROUP.

[27]  Yelong Shen,et al.  Learning semantic representations using convolutional neural networks for web search , 2014, WWW.

[28]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[29]  Ismael Rafols,et al.  Is science becoming more interdisciplinary? Measuring and mapping six research fields over time , 2009, Scientometrics.

[30]  Joshua,et al.  Network analysis of mineralogical systems k , 2017 .

[31]  Zhang Guo-xuan Study on the Text Mining and Chinese Text Mining Framework , 2007 .

[32]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.