Information extraction from articles for the elaboration of the regulatory networks involved in Arabidopsis seed development

Seed is the main vector for breeding and production of annual field crops, and the accumulation of seed storage compounds (sugars, lipids, proteins) is of primary importance for food, feed and industrial uses. Seed development requires the coordinated growth of different tissues and involves complex genetics and environmental regulations. A comprehensive understanding of the molecular network underlying these regulations remains a major scientific challenge with important potential impact for agriculture and industry. Knowledge on these regulations is spread in a high number of scientific articles (e.g. Pubmed query “Arabidopsis seed” yields more than 6000 references) and is difficult to analyze. The molecular and genetic mechanisms are described by complex expressions that involve biological entities linked by various specific semantic relations. The aim of this work is to automatically extract the information (i.e. entities and relations between entities) by developing generic Natural Language Processing and Machine Learning methods. The approach consists in 1) the formal annotation of examples in a set of documents with respect to an annotation model, 2) training methods on the examples and, 3) the application of the methods to new texts to extract knowledge. Last we plan to integrate the extracted knowledge in a comprehensive regulatory model, with database and graphical representation tools. We expect these tools to be useful for analyzing other gene regulatory networks.