Effective Navigation of Query Results Based on Concept Hierarchies

Search queries on biomedical databases, such as PubMed, often return a large number of results, only a small subset of which is relevant to the user. Ranking and categorization, which can also be combined, have been proposed to alleviate this information overload problem. Results categorization for biomedical databases is the focus of this work. A natural way to organize biomedical citations is according to their MeSH annotations. MeSH is a comprehensive concept hierarchy used by PubMed. In this paper, we present the BioNav system, a novel search interface that enables the user to navigate large number of query results by organizing them using the MeSH concept hierarchy. First, the query results are organized into a navigation tree. At each node expansion step, BioNav reveals only a small subset of the concept nodes, selected such that the expected user navigation cost is minimized. In contrast, previous works expand the hierarchy in a predefined static manner, without navigation cost modeling. We show that the problem of selecting the best concepts to reveal at each node expansion is NP-complete and propose an efficient heuristic as well as a feasible optimal algorithm for relatively small trees. We show experimentally that BioNav outperforms state-of-the-art categorization systems by up to an order of magnitude, with respect to the user navigation cost. BioNav for the MEDLINE database is available at http://db.cse.buffalo.edu/bionav.

[1]  Hagit Shatkay,et al.  Discovering semantic features in the literature: a foundation for building functional associations , 2006, BMC Bioinformatics.

[2]  Michael Schroeder,et al.  GoPubMed: ontology-based literature search applied to Gene Ontology and PubMed , 2004, German Conference on Bioinformatics.

[3]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[4]  Tao Li,et al.  Addressing diverse user preferences in SQL-query-result navigation , 2007, SIGMOD '07.

[5]  Ulf Leser,et al.  ALIBABA: PubMed as a graph , 2006, Bioinform..

[6]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[7]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[8]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[9]  Sukhamay Kundu,et al.  A Linear Tree Partitioning Algorithm , 1977, SIAM J. Comput..

[10]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[11]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[12]  Tiffani J. Bright,et al.  PubMatrix: a tool for multiplex literature mining , 2003, BMC Bioinformatics.

[13]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[14]  Hagit Shatkay,et al.  Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis , 2000, ISMB.

[15]  Joyce A. Mitchell,et al.  Gene Indexing: Characterization and Analysis of NLM's GeneRIFs , 2003, AMIA.

[16]  José Luís Oliveira,et al.  Concept-based query expansion for retrieving gene related publications from MEDLINE , 2010, BMC Bioinformatics.

[17]  Adam D. Schuyler,et al.  SciMiner: web-based literature mining tool for target identification and functional enrichment analysis , 2009, Bioinform..

[18]  Uriel Feige,et al.  The Dense k -Subgraph Problem , 2001, Algorithmica.

[19]  Louiqa Raschid,et al.  Exploiting Ontology Structure and Patterns of Annotation to Mine Significant Associations between Pairs of Controlled Vocabulary Terms , 2008, DILS.

[20]  Seung-won Hwang,et al.  Automatic categorization of query results , 2004, SIGMOD '04.

[21]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[22]  Jimmy J. Lin,et al.  PubMed related articles: a probabilistic topic-based model for content similarity , 2007, BMC Bioinformatics.

[23]  A Aszódi,et al.  High-throughput functional annotation of novel gene products using document clustering. , 2000, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[24]  Miguel A. Andrade-Navarro,et al.  Génie: literature-based gene prioritization at multi genomic scale , 2011, Nucleic Acids Res..

[25]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[26]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[27]  A. F. Scott,et al.  OMIM: Online Mendelian Inheritance in Man , 2002 .

[28]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[29]  P Bork,et al.  XplorMed: a tool for exploring MEDLINE abstracts. , 2001, Trends in biochemical sciences.

[30]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[31]  Vagelis Hristidis,et al.  BioNav: Effective Navigation on Query Results of Biomedical Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[32]  Mika Käki,et al.  Findex: search result categories help users when document ranking fails , 2005, CHI.

[33]  Peer Bork,et al.  Exploring MEDLINE abstracts with XplorMed. , 2002, Drugs of today.

[34]  Jimmy J. Lin,et al.  Answer Extraction, Semantic Clustering, and Extractive Summarization for Clinical Question Answering , 2006, ACL.

[35]  L. Comtet,et al.  Advanced Combinatorics: The Art of Finite and Infinite Expansions , 1974 .