Approximate matching-based unsupervised document indexing approach: application to biomedical domain

Document indexing is considered as a crucial phase in the information retrieval field because textual information is constantly increasing. With this accumulation of documents, the satisfaction of user needs becomes more and more complex. For these reasons, several information retrieval systems have been designed in order to respond to user requests. The main contribution of the current work resides in the suggestion of a novel hybrid approach for biomedical document indexing. We improve the estimation of the correspondence between a document and a given concept using two methods: vector space model (VSM) and description logics (DL). VSM performs partial matching between documents and external resource terms. DL allows representing knowledge in a relevant manner for better matching. The proposed contribution reduces the limitation of exact matching. It serves to index documents by exploiting medical subject headings (MeSH) thesaurus services with approximate matching. The latter partially matches document terms with biomedical vocabularies to extract other morphological variants in that resource. It also generates irrelevant concepts. The filtering step solves this problem and grants the selection of the most important concepts by exploiting the knowledge provided by MeSH. The experiments, carried out on different corpora, show encouraging results of around 25% improvement in average accuracy compared to other approaches studied in the literature.

[1]  Min Song Exploring concept graphs for biomedical literature mining , 2015, 2015 International Conference on Big Data and Smart Computing (BIGCOMP).

[2]  Vishal Gupta,et al.  A cognitive inspired unsupervised language-independent text stemmer for Information retrieval , 2018, Cognitive Systems Research.

[3]  Ting Wang,et al.  Using semantic similarity to reduce wrong labels in distant supervision for relation extraction , 2018, Inf. Process. Manag..

[4]  Fabio A. González,et al.  BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies , 2018, J. Intell. Fuzzy Syst..

[5]  Jean-Pierre Chevallet,et al.  Description Logic to Model a Domain Specific Information Retrieval System , 2008, DEXA.

[6]  Mohamed Nazih Omri,et al.  Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge , 2012, Int. J. Inf. Retr. Res..

[7]  M. Chidambaram,et al.  An Efficient Indexing Mesh Term Description Logic Using in Medical Subject Headings , 2018 .

[8]  Mohamed Nazih Omri,et al.  Collaborative information retrieval model based on fuzzy confidence network , 2016, J. Intell. Fuzzy Syst..

[9]  Susanne M. Humphrey,et al.  The NLM Indexing Initiative's Medical Text Indexer , 2004, MedInfo.

[10]  Longbing Cao,et al.  Concept coupling learning for improving concept lattice-based document retrieval , 2018, Eng. Appl. Artif. Intell..

[11]  Luca Soldaini QuickUMLS: a fast, unsupervised approach for medical concept extraction , 2016 .

[12]  Kit Yan Chan,et al.  CredSaT: Credibility ranking of users in big social data incorporating semantic analysis and temporal factor , 2018, J. Inf. Sci..

[13]  Ian Horrocks,et al.  Efficient Reasoning with Range and Domain Constraints , 2004, Description Logics.

[14]  Enrico Motta,et al.  Improving comprehension of knowledge representation languages: A case study with Description Logics , 2019, Int. J. Hum. Comput. Stud..

[15]  Ling Chen,et al.  Knowledge based collection selection for distributed information retrieval , 2018, Inf. Process. Manag..

[16]  Mohamed Nazih Omri,et al.  Information retrieval approach based on indexing text documents: Application to biomedical domain , 2017, 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD).

[17]  Hailong Zhu,et al.  Predicting protein functions using incomplete hierarchical labels , 2015, BMC Bioinformatics.

[18]  Samir Elloumi,et al.  Formal context coverage based on isolated labels: An efficient solution for text feature extraction , 2012, Inf. Sci..

[19]  Lina Fatima Soualmia,et al.  BioDI: A New Approach to Improve Biomedical Documents Indexing , 2013, DEXA.

[20]  Sonia Mudel A Study to Show the Relation between Creative Accounting and Corporate Governance , 2016, BIOINFORMATICS 2016.

[21]  Pornpit Wongthongtham,et al.  Ontology-based approach for identifying the credibility domain in social Big Data , 2018, J. Organ. Comput. Electron. Commer..

[22]  Sylvie Ranwez,et al.  USI: a fast and accurate approach for conceptual document annotation , 2015, BMC Bioinformatics.

[23]  Mark A. Musen,et al.  NCBO Resource Index: Ontology-based search and mining of biomedical resources , 2010, J. Web Semant..

[24]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[25]  Sunghwan Sohn,et al.  Research Paper: Optimal Training Sets for Bayesian Prediction of MeSH® Assignment , 2008, J. Am. Medical Informatics Assoc..

[26]  Mohamed Nazih Omri,et al.  Hybridization of an Index Based on Concept Lattice with a Terminology Extraction Model for Semantic Information Retrieval Guided by WordNet , 2016, HIS.

[27]  Volker Haarslev,et al.  Description of the RACER System and its Applications , 2001, Description Logics.

[28]  Nina Wacholder,et al.  Evaluating the impact of MeSH (Medical Subject Headings) terms on different types of searchers , 2017, Inf. Process. Manag..

[29]  Sougata Mukherjea,et al.  Enhancing a biomedical information extraction system with dictionary mining and context disambiguation , 2004, IBM J. Res. Dev..

[30]  Kit Yan Chan,et al.  Twitter mining for ontology-based domain discovery incorporating machine learning , 2018, J. Knowl. Manag..

[31]  Dipankar Chaki,et al.  A Novel Approach to Extract Important Keywords from Documents Applying Latent Semantic Analysis , 2018, 2018 10th International Conference on Knowledge and Smart Technology (KST).

[32]  Federico Lecumberry,et al.  Beef quality parameters estimation using ultrasound and color images , 2015, BMC Bioinformatics.

[33]  Lynda Tamine,et al.  Combining Global and Local Semantic Contexts for Improving Biomedical Information Retrieval , 2011, ECIR.

[34]  Liu Yuan Supporting Relevance Feedback with Concept Learning for Semantic Information Retrieval in Large OWL Knowledge Base , 2018, PKAW.

[35]  Yarden Katz,et al.  Pellet: A practical OWL-DL reasoner , 2007, J. Web Semant..

[36]  Peng Sun,et al.  The Keyword Extraction of Chinese Medical Web Page Based on WF-TF-IDF Algorithm , 2017, 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC).

[37]  Grigori Sidorov,et al.  Unsupervised Sentence Representations as Word Information Series: Revisiting TF-IDF , 2017, Comput. Speech Lang..

[38]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[39]  Wei You,et al.  An automatic keyphrase extraction system for scientific documents , 2012, Knowledge and Information Systems.

[40]  Xiaohua Hu,et al.  MaxMatcher: Biological Concept Extraction Using Approximate Dictionary Lookup , 2006, PRICAI.

[41]  Luis E. Anido-Rifón,et al.  Leveraging Wikipedia knowledge to classify multilingual biomedical documents , 2018, Artif. Intell. Medicine.

[42]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[43]  P ORTER STEMMER A New Stemmer to Improve Information Retrieval , 2013 .

[44]  Mohamed Nazih Omri,et al.  SAID: A new stemmer algorithm to indexing unstructured Document , 2015, 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA).

[45]  Wahiba Ben,et al.  A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL , 2013 .

[46]  Gilles Falquet,et al.  Description Logics-Based Modelling for Precise Information Retrieval , 2008, Description Logics.

[47]  F. Ren,et al.  Multilingual single document keyword extraction for information retrieval , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[48]  Mohamed Nazih Omri,et al.  IRAFCA: an O(n) information retrieval algorithm based on formal concept analysis , 2015, Knowledge and Information Systems.

[49]  Mohamed Nazih Omri,et al.  Information Retrieval Based on Description Logic: Application to Biomedical Documents , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).

[50]  Kabil BOUKHARI,et al.  RAID : Robust Algorithm for stemmIng text Document , 2016 .

[51]  Rubén González Crespo,et al.  Natural language interface model for the evaluation of ergonomic routines in occupational health (ILENA) , 2019, J. Ambient Intell. Humaniz. Comput..

[52]  Mohamed Nazih Omri,et al.  Information Retrieval Model using Uncertain Confidence's Network , 2017, Int. J. Inf. Retr. Res..

[53]  Yi Guan,et al.  Transfer learning based clinical concept extraction on data from multiple sources , 2014, J. Biomed. Informatics.

[54]  Anita Burgun-Parenthoine,et al.  Automatic concept extraction from spoken medical reports , 2003, Int. J. Medical Informatics.

[55]  Mohand Boughanem,et al.  A probabilistic model to exploit user expectations in XML information retrieval , 2017, Inf. Process. Manag..

[56]  Shehzad Khalid,et al.  Comprehensive stemmer for morphologically rich urdu language , 2019, Int. Arab J. Inf. Technol..