Limitations of information extraction methods and techniques for heterogeneous unstructured big data

During the recent era of big data, a huge volume of unstructured data are being produced in various forms of audio, video, images, text, and animation. Effective use of these unstructured big data is a laborious and tedious task. Information extraction (IE) systems help to extract useful information from this large variety of unstructured data. Several techniques and methods have been presented for IE from unstructured data. However, numerous studies conducted on IE from a variety of unstructured data are limited to single data types such as text, image, audio, or video. This article reviews the existing IE techniques along with its subtasks, limitations, and challenges for the variety of unstructured data highlighting the impact of unstructured big data on IE techniques. To the best of our knowledge, there is no comprehensive study conducted to investigate the limitations of existing IE techniques for the variety of unstructured big data. The objective of the structured review presented in this article is twofold. First, it presents the overview of IE techniques from a variety of unstructured data such as text, image, audio, and video at one platform. Second, it investigates the limitations of these existing IE techniques due to the heterogeneity, dimensionality, and volume of unstructured big data. The review finds that advanced techniques for IE, particularly for multifaceted unstructured big data sets, are the utmost requirement of the organizations to manage big data and derive strategic information. Further, potential solutions are also presented to improve the unstructured big data IE systems for future research. These solutions will help to increase the efficiency and effectiveness of the data analytics process in terms of context-aware analytics systems, data-driven decision-making, and knowledge management.

[1]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[2]  Chin-Hui Lee,et al.  An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification, and Recognition , 2013, Proceedings of the IEEE.

[3]  Deepak Khosla,et al.  Automated scene understanding via fusion of image and object features , 2017, 2017 IEEE International Symposium on Technologies for Homeland Security (HST).

[4]  Jing Wang,et al.  A Probabilistic Method for Linking BI Provenances to Open Knowledge Base , 2016, BIH.

[5]  Gang Hua,et al.  Multimedia Big Data Computing , 2015, IEEE Multim..

[6]  Natalia Konstantinova,et al.  Review of Relation Extraction Methods: What Is New Out There? , 2014, AIST.

[7]  Nidhi Desai,et al.  Feature Extraction and Classification Techniques for Speech Recognition: A Review , 2013 .

[8]  Abdel Belaïd,et al.  Document Information Extraction and Its Evaluation Based on Client's Relevance , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[9]  John Atkinson,et al.  A multi-strategy approach to biological named entity recognition , 2012, Expert Syst. Appl..

[10]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[11]  K U Jaseena,et al.  ISSUES , CHALLENGES , AND SOLUTIONS : BIG DATA MINING , 2014, NETCOM 2014.

[12]  Terry Anthony Byrd,et al.  Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations , 2018 .

[13]  Eric P. Xing,et al.  Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  T. Santhanam,et al.  A SURVEY ON VARIOUS APPROACHES OF TEXT EXTRACTION IN IMAGES , 2012 .

[15]  Chong Peng,et al.  A Supervised Learning Model for High-Dimensional and Large-Scale Data , 2016, ACM Trans. Intell. Syst. Technol..

[16]  Sudha Selvaraj,et al.  Industrial information extraction through multi-phase classification using ontology for unstructured documents , 2018, Comput. Ind..

[17]  Mahesh Chandra,et al.  Multiple camera in car audio-visual speech recognition using phonetic and visemic information , 2015, Comput. Electr. Eng..

[18]  Xindong Wu,et al.  Employing Semantic Context for Sparse Information Extraction Assessment , 2018, ACM Trans. Knowl. Discov. Data.

[19]  Vera Lúcia Strube de Lima,et al.  Open Information Extraction Based on Lexical-Syntactic Patterns , 2013, 2013 Brazilian Conference on Intelligent Systems.

[20]  Pierre Zweigenbaum,et al.  Medical Entity Recognition: A Comparaison of Semantic and Statistical Methods , 2011, BioNLP@ACL.

[21]  Wei Luo,et al.  Scientific Literature based Big Data Analysis for Technology Insight , 2019, Journal of Physics: Conference Series.

[22]  Amardeep Kaur,et al.  Hybrid Approach for Named Entity Recognition , 2015 .

[23]  Mónica Marrero,et al.  Named Entity Recognition: Fallacies, challenges and opportunities , 2013, Comput. Stand. Interfaces.

[24]  Euiseong Seo,et al.  Extensible Video Processing Framework in Apache Hadoop , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[25]  Gholamreza Haffari,et al.  Multi-domain evaluation framework for named entity recognition tools , 2017, Comput. Speech Lang..

[26]  Haoshan Shi,et al.  No-reference video quality assessment based on temporal information extraction , 2013, 2013 2nd International Symposium on Instrumentation and Measurement, Sensor Network and Automation (IMSNA).

[27]  Urszula Markowska-Kaczmar,et al.  Automatic information extraction from heatmaps , 2014, IISA 2014, The 5th International Conference on Information, Intelligence, Systems and Applications.

[28]  P. Palanisamy,et al.  Information extraction and unfilled-form structure retrieval from filled-up forms , 2013, 2013 International Conference on Recent Trends in Information Technology (ICRTIT).

[29]  Ralph Grishman,et al.  Combining Neural Networks and Log-linear Models to Improve Relation Extraction , 2015, ArXiv.

[30]  Kishore Prahallad,et al.  Extraction of Linguistic Information with the AID of Acoustic Data to Build Speech Systems , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[31]  Bo Hu,et al.  A Semi-automated Entity Relation Extraction Mechanism with Weakly Supervised Learning for Chinese Medical Webpages , 2016, ICSH.

[32]  Elmar Haussmann,et al.  Open Information Extraction via Contextual Sentence Decomposition , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[33]  Chia-Hui Chang,et al.  Named Entity Extraction via Automatic Labeling and Tri-training: Comparison of Selection Methods , 2014, AIRS.

[34]  Mohammed Ramdani,et al.  A novel approach for open domain event schema discovery from twitter , 2015, 2015 10th International Conference on Intelligent Systems: Theories and Applications (SITA).

[35]  Narendra D. Londhe,et al.  Chhattisgarhi speech corpus for research and development in automatic speech recognition , 2018, International Journal of Speech Technology.

[36]  Daniela Barreiro Claro,et al.  A systematic mapping study on open information extraction , 2018, Expert Syst. Appl..

[37]  Wael Khreich,et al.  A Survey of Techniques for Event Detection in Twitter , 2015, Comput. Intell..

[38]  Claire Cardie,et al.  Major Life Event Extraction from Twitter based on Congratulations/Condolences Speech Acts , 2014, EMNLP.

[39]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Arindam Dey,et al.  Named Entity Recognition for Nepali language: A Semi Hybrid Approach , 2014 .

[41]  R. R. Deshmukh,et al.  Feature Extraction Techniques for Speech Recognition: A Review , 2015 .

[42]  Khaled S. Younis,et al.  A New Implementation of Deep Neural Networks for Optical Character Recognition and Face Recognition , 2017 .

[43]  Qi Tian,et al.  Personalized Social Image Recommendation Method Based on User-Image-Tag Model , 2017, IEEE Transactions on Multimedia.

[44]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[45]  Sophia Ananiadou,et al.  Comparable Study of Event Extraction in Newswire and Biomedical Domains , 2014, COLING.

[46]  Sachan Priyamvada Rajendra A Survey of Automatic Video Summarization Techniques , 2014 .

[47]  Raúl Ernesto Gutiérrez de Piñerez Reyes,et al.  Support Vector Machines for Semantic Relation Extraction in Spanish Language , 2018 .

[48]  Xin Liu,et al.  Recognition and extraction of named entities in online medical diagnosis data based on a deep neural network , 2019, J. Vis. Commun. Image Represent..

[49]  Verónica Bolón-Canedo,et al.  Recent advances and emerging challenges of feature selection in the context of big data , 2015, Knowl. Based Syst..

[50]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[51]  Uzay Kaymak,et al.  An Overview of Event Extraction from Text , 2011, DeRiVE@ISWC.

[52]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[54]  Luiz Eduardo Guarino de Vasconcelos,et al.  Automated Extraction Information System from HUDs Images Using ANN , 2015, 2015 12th International Conference on Information Technology - New Generations.

[55]  Nitesh V. Chawla,et al.  Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline , 2017, BIRS-IMLKE.

[56]  Youness Tabii,et al.  Video Summarization: Techniques and Applications , 2015 .

[57]  Stavros Christodoulakis,et al.  A digital library system for semantic spatial information extraction from images , 2015, 2015 1st International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM).

[58]  Cyril Labbé,et al.  Named Entity Recognition Over Electronic Health Records Through a Combined Dictionary-based Approach , 2016, CENTERIS/ProjMAN/HCist.

[59]  Bairong Shen,et al.  Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing , 2012, PloS one.

[60]  Sukanya Ratanotayanon,et al.  Automatic text imprint analysis from pill images , 2017, 2017 9th International Conference on Knowledge and Smart Technology (KST).

[61]  Jun Yan,et al.  Large‐scale extraction of drug–disease pairs from the medical literature , 2017, J. Assoc. Inf. Sci. Technol..

[62]  Sarah Kate Bennett,et al.  Doing a Literature Review in Health and Social Care: A Practical Guide , 2012 .

[63]  Ludovic Bonnefoy,et al.  Large Scale Text Mining Approaches for Information Retrieval and Extraction , 2014, Innovations in Intelligent Machines.

[64]  Anupam Agrawal,et al.  Target detection in SAR images using SIFT , 2015, 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[65]  Hou Mingliang,et al.  Study of Information Extraction Algorithm of Poisson Noise Images Based on Fractional Order Differentiation , 2013, 2013 International Conference on Computational and Information Sciences.

[66]  Jiawei Han,et al.  World Knowledge as Indirect Supervision for Document Clustering , 2016, ACM Trans. Knowl. Discov. Data.

[67]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[68]  Ming Zhou,et al.  Named entity recognition for tweets , 2013, TIST.

[69]  Jules J. Berman Providing Structure to Unstructured Data , 2013 .

[70]  Monia Mannai,et al.  Bayesian information extraction network for Medline abstract , 2013, 2013 World Congress on Computer and Information Technology (WCCIT).

[71]  Naomie Salim,et al.  Chemical named entities recognition: a review on approaches and applications , 2014, Journal of Cheminformatics.

[72]  Mohd Fadzil Hassan,et al.  Rule-based pattern extractor and named entity recognition: A hybrid approach , 2010, 2010 International Symposium on Information Technology.

[73]  Heng Ji,et al.  Cross-media Event Extraction and Recommendation , 2016, NAACL.

[74]  David Kim,et al.  A Bayesian network-based approach for fault analysis , 2017, Expert Syst. Appl..

[75]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[76]  S. Chitrakala,et al.  Scene understanding — A survey , 2017, 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP).

[77]  Bao-Quoc Ho,et al.  A Hybrid approach for biomedical event extraction , 2013, BioNLP@ACL.

[78]  P. Balasubramanie,et al.  Information extraction and text mining of Ancient Vattezhuthu characters in historical documents using image zoning , 2016, 2016 International Conference on Asian Language Processing (IALP).

[79]  Deepti Chopra,et al.  Named Entity Recognition in Indian Languages Using Gazetteer Method and Hidden Markov Model: A Hybrid Approach , 2012 .

[80]  Shourya Roy,et al.  Predicting Complications in Critical Care Using Heterogeneous Clinical Data , 2016, IEEE Access.

[81]  Nguyen Bach,et al.  A Review of Relation Extraction , 2007 .

[82]  Dong ping Tian,et al.  A Review on Image Feature Extraction and Representation Techniques , 2013 .

[83]  Ivan Grech,et al.  Comparative study of automatic speech recognition techniques , 2013, IET Signal Process..

[84]  Deepti Chopra,et al.  Survey of Named Entity Recognition Techniques for Various Indian Regional Languages , 2017 .

[85]  Li Deng,et al.  Speech-Centric Information Processing: An Optimization-Oriented Approach , 2013, Proceedings of the IEEE.

[86]  Elizabeth Sherly,et al.  A Hybrid Statistical Approach for Named Entity Recognition for Malayalam Language , 2013 .

[87]  Jun Feng,et al.  A new algorithm for water information extraction from high resolution remote sensing imagery , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[88]  M. Sandler,et al.  EXTRACTION OF METRICAL STRUCTURE FROM MUSIC RECORDINGS , 2015 .

[89]  Andrew McCallum,et al.  Information Extraction , 2005, ACM Queue.

[90]  Anil M. Cheriyadat,et al.  Semantic information extraction from multispectral geospatial imagery via a flexible framework , 2010, 2010 IEEE International Geoscience and Remote Sensing Symposium.

[91]  Yueming Hu,et al.  Extraction of building information using geographic object-based image analysis , 2016, 2016 4th International Workshop on Earth Observation and Remote Sensing Applications (EORSA).

[92]  Awais Ahmad,et al.  Deep learning in big data Analytics: A comparative study , 2017, Comput. Electr. Eng..

[93]  Tong Lu,et al.  Introduction to Video Text Detection , 2014 .

[94]  Ralph Deters,et al.  Topics and Terms Mining in Unstructured Data Stores , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[95]  Madian Khabsa,et al.  Scholarly big data information extraction and integration in the CiteSeerχ digital library , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[96]  T. Ravi,et al.  Modelings and techniques in named entity recognition: an information extraction task , 2012 .

[97]  Vishal Gupta,et al.  Big data analytics techniques: A survey , 2015, 2015 International Conference on Green Computing and Internet of Things (ICGCIoT).

[98]  Georg Grossmann,et al.  Variety Management for Big Data , 2018, Semantic Applications.

[99]  Kaiqiang Wang,et al.  User information extraction in big data environment , 2017, 2017 3rd IEEE International Conference on Computer and Communications (ICCC).

[100]  Kuldeep,et al.  Texture based information extraction from high resolution images using object based classification approach , 2014, 2014 Third International Workshop on Earth Observation and Remote Sensing Applications (EORSA).

[101]  Zainab Abu Bakar,et al.  Information extraction: Evaluating named entity recognition from classical Malay documents , 2016, 2016 Third International Conference on Information Retrieval and Knowledge Management (CAMP).

[102]  Jongyoul Park,et al.  Visual Relationship Detection with Language prior and Softmax , 2018, 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS).

[103]  Tao Wang,et al.  Entity Relation Mining in Large-Scale Data , 2015, DASFAA Workshops.

[104]  Andy Koronios,et al.  Unlock the Value of Unstructured Data in EAM , 2015 .

[105]  Muhammad Shakir,et al.  Video Summarization: Techniques and Classification , 2012, ICCVG.

[106]  Wiem Lahbib,et al.  A Hybrid Approach for Arabic Semantic Relation Extraction , 2013, FLAIRS Conference.

[107]  Sebastián Ventura,et al.  Evolutionary Strategy to Perform Batch-Mode Active Learning on Multi-Label Data , 2018, ACM Trans. Intell. Syst. Technol..

[108]  I. Halcu,et al.  Converting unstructured and semi-structured data into knowledge , 2013, 2013 11th RoEduNet International Conference.

[109]  Roger Blake,et al.  From Content to Context , 2017, ACM J. Data Inf. Qual..

[110]  Jun Guo,et al.  Text extraction from natural scene image: A survey , 2013, Neurocomputing.

[111]  Jakub Piskorski,et al.  Information Extraction: Past, Present and Future , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[112]  Md. Zakirul Alam Bhuiyan,et al.  A Survey on Deep Learning in Big Data , 2017, 22017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC).

[113]  S Balaji,et al.  IMPACT OF BIG DATA AND EMERGING RESEARCH TRENDS , 2015 .

[114]  Lamjed Ben Said,et al.  A Hybrid Approach for Drug Abuse Events Extraction from Twitter , 2016, KES.

[115]  Daniel L. Rubin,et al.  Automatic information extraction from unstructured mammography reports using distributed semantics , 2018, J. Biomed. Informatics.

[116]  Abdelaziz Marzak,et al.  Mixed method for extraction of domain terminology from text: Linguistic and statistical filtering , 2014, 2014 Third IEEE International Colloquium in Information Science and Technology (CIST).

[117]  Yongdong Zhang,et al.  Convolutional Attention Networks for Scene Text Recognition , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[118]  Pengfei Li,et al.  Knowledge-oriented convolutional neural network for causal relation extraction from natural language texts , 2019, Expert Syst. Appl..

[119]  Adnan Yazici,et al.  A hybrid named entity recognizer for Turkish , 2012, Expert Syst. Appl..

[120]  Yan Xu,et al.  Color space transformation and object oriented based information extraction of aerial images , 2013, 2013 21st International Conference on Geoinformatics.

[121]  Martino Pesaresi,et al.  Interscale learning and classification for global HR/VHR image information extraction , 2014, 2014 IEEE Geoscience and Remote Sensing Symposium.

[122]  Tingting He,et al.  Leveraging Chinese Encyclopedia for Weakly Supervised Relation Extraction , 2015, JIST.