bigNN: An open-source big data toolkit focused on biomedical sentence classification

Every single day, a massive amount of text data is generated by different medical data sources, such as scientific literature, medical web pages, health-related social media, clinical notes, and drug reviews. Processing this wealth of data is indeed a daunting task, and it forces us to adopt smart and scalable computational strategies, including machine intelligence, big data analytics, and distributed architecture. In this contribution, we designed and developed an open-source big data neural network toolkit, namely bigNN which tackles the problem of large-scale biomedical text classification in an efficient fashion, facilitating fast prototyping and reproducible text analytics researches. bigNN scales up a word2vec-based neural network model over Apache Spark 2.10 and Hadoop Distributed File System (HDFS) 2.7.3, allowing for more efficient big data sentence classification. The toolkit supports big data computing, and simplifies rapid application development in sentence analysis by allowing users to configure and examine different internal parameters of both Apache Spark and the neural network model. bigNN is fully documented, and it is publicly and freely available at https://github.com/bircatmcri/bigNN.

[1]  Noah A. Smith,et al.  Transition-Based Dependency Parsing with Stack Long Short-Term Memory , 2015, ACL.

[2]  Kevin Gimpel,et al.  Tailoring Continuous Word Representations for Dependency Parsing , 2014, ACL.

[3]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[4]  Trong Duc Nguyen,et al.  Combining Word2Vec with Revised Vector Space Model for Better Code Retrieval , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[5]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[6]  Peggy L. Peissig,et al.  Machine Learning-as-a-Service and Its Application to Medical Informatics , 2017, MLDM.

[7]  Halil Kilicoglu,et al.  Biomedical Text Mining for Research Rigor and Integrity: Tasks, Challenges, Directions , 2017, bioRxiv.

[8]  Karin M. Verspoor,et al.  Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges , 2014, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics.

[9]  Victor Guimar Boosting Named Entity Recognition with Neural Character Embeddings , 2015 .

[10]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[11]  L. Lenert,et al.  EHR Big Data Deep Phenotyping , 2014, Yearbook of Medical Informatics.

[12]  Alok N. Choudhary,et al.  Twitter Trending Topic Classification , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Erik Ordentlich,et al.  Network-Efficient Distributed Word2vec Training System for Large Vocabularies , 2016, CIKM.

[15]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[16]  L. Muflikhah,et al.  Document Clustering Using Concept Space and Cosine Similarity Measurement , 2009, 2009 International Conference on Computer Technology and Development.

[17]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[18]  Aixin Sun,et al.  Short text classification using very few words , 2012, SIGIR '12.

[19]  José M. F. Moura,et al.  VisualWord2Vec (Vis-W2V): Learning Visually Grounded Word Embeddings Using Abstract Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[21]  Chen Fu,et al.  A New Clustering Model Based on Word2vec Mining on Sina Weibo Users' Tags , 2014 .

[22]  Eric R. LaRose,et al.  Adverse Drug Event Discovery Using Biomedical Literature: A Big Data Neural Network Adventure , 2017, JMIR medical informatics.

[23]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[24]  Lijun Liu,et al.  An Efficient Method for Document Categorization Based on Word2vec and Latent Semantic Analysis , 2015, 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing.

[25]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.

[26]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.

[27]  Jun Ye,et al.  Cosine similarity measures for intuitionistic fuzzy sets and their applications , 2011, Math. Comput. Model..

[28]  Casey S. Greene,et al.  Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery , 2015, Briefings Bioinform..