QSEA for fuzzy subgraph querying of KEGG pathways

As biological pathway databases continually increase in size and availability, efficient tools and techniques to query these databases are needed to mine useful biological information. A plethora of existing techniques already allow for exact or approximate query matching. Despite initial success, powerful techniques used for XML and RDF query matching have yet to be sufficiently exploited for use in query matching in the bioinformatics domain. In this paper, we employ the transitive closure to focus on matching hierarchical queries, i.e., finding pathways or graphs that possess a query's overall hierarchical structure. This approach allows for a greater latitude in fuzzy matching by focusing on the overall hierarchies of queries and graphs. Since hierarchies are only inherent in directed acyclic graphs, we have also developed a robust heuristic to heuristically solve the minimum feedback arc set problem. Analysis on 53 H. sapiens and 23 S. cerevisiae cyclic KEGG pathways have shown that our heuristic performs quite favorably. We have implemented the techniques in an easy to use GUI software QSEA (Query Structure Enrichment Analysis). Binaries are freely available at http://code.google.com/p/s-e-a/ for Windows and MAC.

[1]  Lincoln Stein,et al.  Reactome knowledgebase of human biological pathways and processes , 2008, Nucleic Acids Res..

[2]  Jignesh M. Patel,et al.  SAGA: a subgraph matching tool for biological graphs , 2007, Bioinform..

[3]  Roded Sharan,et al.  QNet: A Tool for Querying Protein Interaction Networks , 2007, RECOMB.

[4]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  Joseph Naor,et al.  Approximating Minimum Feedback Sets and Multicuts in Directed Graphs , 1998, Algorithmica.

[6]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[7]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[8]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[9]  Sergei Maslov,et al.  Detection of the dominant direction of information flow and feedback links in densely interconnected regulatory networks , 2008, BMC Bioinformatics.

[10]  Gregory Gutin,et al.  Digraphs - theory, algorithms and applications , 2002 .

[11]  Emden R. Gansner,et al.  A Technique for Drawing Directed Graphs , 1993, IEEE Trans. Software Eng..

[12]  Luigi Palopoli,et al.  Biological Network Querying Techniques: Analysis and Comparison , 2011, J. Comput. Biol..

[13]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[14]  Roded Sharan,et al.  QPath: a method for querying pathways in a protein-protein interaction network , 2006, BMC Bioinformatics.

[15]  T. Ideker,et al.  Modeling cellular machinery through biological network comparison , 2006, Nature Biotechnology.

[16]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[17]  J. Anthonisse The rush in a directed graph , 1971 .

[18]  Emden R. Gansner,et al.  Graphviz - Open Source Graph Drawing Tools , 2001, GD.

[19]  U. Alon Network motifs: theory and experimental approaches , 2007, Nature Reviews Genetics.

[20]  Noga Alon,et al.  Color-coding , 1995, JACM.

[21]  Lincoln Stein,et al.  Reactome: a database of reactions, pathways and biological processes , 2010, Nucleic Acids Res..

[22]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[23]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[24]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[25]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[26]  Li Chen,et al.  Stack-based Algorithms for Pattern Matching on DAGs , 2005, VLDB.