High-performance information extraction with AliBaba

A wealth of information is available only in web pages, patents, publications etc. Extracting information from such sources is challenging, both due to the typically complex language processing steps required and to the potentially large number of texts that need to be analyzed. Furthermore, integrating extracted data with other sources of knowledge often is mandatory for subsequent analysis. In this demo, we present the AliBaba system for scalable information extraction from biomedical documents. Unlike many other systems, AliBaba performs both entity extraction and relationship extraction and graphically visualizes the resulting network of inter-connected objects. It leverages the PubMed search engine for selection of relevant documents. The technical novelty of AliBaba is twofold: (a) its ability to automatically learn language patterns for relationship extraction without an annotated corpus, and (b) its high performance pattern matching algorithm. We show that a simple yet effective pattern filtering technique improves the runtime of the system drastically without harming its extraction effectiveness. Although AliBaba has been implemented for biomedical texts, its underlying principles should also be applicable in any other domain.

[1]  Eugene W. Myers,et al.  A Table-Driven, Full-Sensitivity Similarity Search Algorithm , 2003, J. Comput. Biol..

[2]  Amit P. Sheth,et al.  A Framework for Schema-Driven Relationship Discovery from Unstructured Text , 2006, SEMWEB.

[3]  Ulf Leser,et al.  ALIBABA: PubMed as a graph , 2006, Bioinform..

[4]  Kevin Chen-Chuan Chang,et al.  EntityRank: Searching Entities Directly and Holistically , 2007, VLDB.

[5]  Andrew Tomkins,et al.  How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..

[6]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[7]  Peer Bork,et al.  Extraction of regulatory gene/protein networks from Medline , 2006, Bioinform..

[8]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[9]  Jun Yang,et al.  Efficient Information Extraction over Evolving Text Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[10]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from the literature: Part II , 2005, Bioinform..

[11]  Raghu Ramakrishnan,et al.  Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[12]  U. Leser,et al.  Gene mention normalization and interaction extraction with context models and sentence motifs , 2008, Genome Biology.

[13]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.