Extraction de connaissances à partir de données de protéomique de découverte haut-débit

Ce memoire presente mon travail d'encadrement d'activites de recherche pour les annees 2012-2016, ainsi que des perspectives pour les cinq annees a venir. Au travers de la presentation de deux de mes projets de recherche, j'analyse les differentes questions suscitees par l'animation d'un groupe de recherche dont l'objectif est le developpement d'outils et de methodes permettant l'extraction de connaissances automatisees a partir de donnees de quantification relative en proteomique label-free obtenues par spectrometrie de masse haut-debit. Ces questions concernent notamment (i) l'encadrement de jeunes chercheurs et la valorisation de leurs activites; (ii) la recherche de financements; et surtout (iii) la gestion des difficultes specifiques au contexte interdisciplinaire (mode de diffusion/valorisation, equilibre entre recherche et ingenierie, pilotage et priorisation des sujets de recherche, etc.). Le premier projet presente, ProStaR, est un outil logiciel permettant de faciliter l'analyse statistique de donnees proteomiques. Au-dela de l'important travail d'ingenierie que sa realisation a necessite, je montre qu'il peut etre le support de nombreux petits projets relativement independants mais novateurs en science des donnees. Le second projet, Reveal-MS, propose de resoudre le demultiplexage de spectres de peptides par des methodes innovantes de factorisation non-negative de matrices de grandes tailles. A l'inverse du precedent projet et dans une logique complementaire, celui-ci est moins motive par les besoins quotidiens de la proteomique que par la possibilite a long terme de permettre une rupture dans l'etat de l'art.

[1]  John D. Venable,et al.  Automated approach for quantitative analysis of complex peptide mixtures from tandem mass spectra , 2004, Nature Methods.

[2]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[3]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[4]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[5]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[6]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[7]  James A. Cadzow,et al.  Signal enhancement-a composite property mapping algorithm , 1988, IEEE Trans. Acoust. Speech Signal Process..

[8]  Tao Xu,et al.  Bioinformatics Applications Note Sequence Analysis Xdia: Improving on the Label-free Data-independent Analysis , 2022 .

[9]  M. Mann,et al.  Comparative Proteomic Analysis of Eleven Common Cell Lines Reveals Ubiquitous but Varying Expression of Most Proteins* , 2012, Molecular & Cellular Proteomics.

[10]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[11]  Yan Zhao Intensity-based protein identification by machine learning from a library of tandem mass spectra , 2010 .

[12]  Sébastien Destercke,et al.  Kolmogorov-Smirnov Test for Interval Data , 2014, IPMU.

[13]  Prakash P. Shenoy,et al.  Local Computation in Hypertrees , 1991 .

[14]  A. Hyman,et al.  Quantitative proteomics combined with BAC TransgeneOmics reveals in vivo protein interactions , 2010, The Journal of cell biology.

[15]  M. Mann,et al.  Proteomics on an Orbitrap Benchtop Mass Spectrometer Using All-ion Fragmentation , 2010, Molecular & Cellular Proteomics.

[16]  Bradley Efron,et al.  Local False Discovery Rates , 2005 .

[17]  Sanjeev Arora,et al.  Computing a nonnegative matrix factorization -- provably , 2011, STOC '12.

[18]  D. Goodlett,et al.  Multiplexed and data-independent tandem mass spectrometry for global proteome profiling. , 2014, Mass spectrometry reviews.

[19]  Lennart Martens,et al.  Computational and Statistical Methods for Protein Quantification by Mass Spectrometry , 2013 .

[20]  Richard D. LeDuc,et al.  Mapping Intact Protein Isoforms in Discovery Mode Using Top Down Proteomics , 2011, Nature.

[21]  G. Gigerenzer Mindless statistics , 2004 .

[22]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[23]  Thierry Denoeux,et al.  Estimation and Prediction Using Belief Functions: Application to Stochastic Frontier Analysis , 2015, Econometrics of Risk.

[24]  Antoine Cornuéjols,et al.  Apprentissage artificiel - Concepts et algorithmes , 2003 .

[25]  Frank Kjeldsen,et al.  Deconvolution of mixture spectra and increased throughput of peptide identification by utilization of intensified complementary ions formed in tandem mass spectrometry. , 2013, Journal of proteome research.

[26]  John D. Lafferty,et al.  Diffusion Kernels on Statistical Manifolds , 2005, J. Mach. Learn. Res..

[27]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[28]  Christian Bauckhage,et al.  Metro Maps of Plant Disease Dynamics—Automated Mining of Differences Using Hyperspectral Images , 2015, PloS one.

[29]  Stephan C Peipei Ping Peek a peak: A glance at statistics for quantitative label-free proteomics , 2013 .

[30]  Till F M Andlauer,et al.  Drep-2 is a novel synaptic protein important for learning and memory , 2014, eLife.

[31]  Sylvie Huet,et al.  Including shared peptides for estimating protein abundances: A significant improvement for quantitative proteomics , 2012, Proteomics.

[32]  Xuelong Li,et al.  A survey of graph edit distance , 2010, Pattern Analysis and Applications.

[33]  Adele Bourmaud,et al.  Technical considerations for large-scale parallel reaction monitoring analysis. , 2014, Journal of proteomics.

[34]  Jarrett D. Egertson,et al.  Multiplexed MS/MS for Improved Data Independent Acquisition , 2013, Nature Methods.

[35]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[36]  Suyash P. Awate,et al.  Robust Dictionary Learning on the Hilbert Sphere in Kernel Feature Space , 2016, ECML/PKDD.

[37]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[38]  Ronald J. Moore,et al.  Sources of technical variability in quantitative LC-MS proteomics: human brain tissue sample analysis. , 2013, Journal of proteome research.

[39]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[40]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[41]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[42]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[43]  Vikas Sindhwani,et al.  Fast Conical Hull Algorithms for Near-separable Non-negative Matrix Factorization , 2012, ICML.

[44]  Thierry Denoeux,et al.  Rejoinder on "Likelihood-based belief function: Justification and some extensions to low-quality data" , 2014, Int. J. Approx. Reason..

[45]  Lars Kai Hansen,et al.  Archetypal analysis for machine learning , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[46]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[47]  Guillermo Sapiro,et al.  Sparse Representation for Computer Vision and Pattern Recognition , 2010, Proceedings of the IEEE.

[48]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[49]  Francis Bach,et al.  Global alignment of protein–protein interaction networks by graph matching methods , 2009, Bioinform..

[50]  B. W. Wright,et al.  High-speed peak matching algorithm for retention time alignment of gas chromatographic data for chemometric analysis. , 2003, Journal of chromatography. A.

[51]  S. Mallat A wavelet tour of signal processing , 1998 .

[52]  Marco Y. Hein,et al.  The Perseus computational platform for comprehensive analysis of (prote)omics data , 2016, Nature Methods.

[53]  Christian P. Robert,et al.  Large-scale inference , 2010 .

[54]  Chih-Chiang Tsou,et al.  DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics , 2015, Nature Methods.

[55]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[56]  M. Mann,et al.  More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC-MS/MS. , 2011, Journal of proteome research.

[57]  Arie Tzvieli Possibility theory: An approach to computerized processing of uncertainty , 1990, J. Am. Soc. Inf. Sci..

[58]  Ruedi Aebersold,et al.  Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs , 2012, BMC Bioinformatics.

[59]  Kerstin Kaufmann,et al.  Proteomics-based identification of low-abundance signaling and regulatory protein complexes in native plant tissues , 2012, Nature Protocols.

[60]  Jonathon J. O'Brien,et al.  The Midpoint Mixed Model with a Missingness Mechanism (M5): A Likelihood-Based Framework for Quantification of Mass Spectrometry Proteomics Data (Preprint) , 2015, 1507.06907.

[61]  Marc-André Delsuc,et al.  Efficient denoising algorithms for large experimental datasets and their applications in Fourier transform ion cyclotron resonance mass spectrometry , 2014, Proceedings of the National Academy of Sciences.

[62]  Arthur P. Dempster,et al.  Upper and Lower Probabilities Induced by a Multivalued Mapping , 1967, Classic Works of the Dempster-Shafer Theory of Belief Functions.

[63]  C. Ji An Archetypal Analysis on , 2005 .

[64]  Eystein Oveland,et al.  PeptideShaker enables reanalysis of MS-derived proteomics data sets , 2015, Nature Biotechnology.

[65]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[66]  Christian Bauckhage,et al.  Descriptive matrix factorization for sustainability Adopting the principle of opposites , 2011, Data Mining and Knowledge Discovery.

[67]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[68]  Bernhard Kuster,et al.  Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present , 2012, Analytical and Bioanalytical Chemistry.

[69]  Jianhua Huang,et al.  A statistical framework for protein quantitation in bottom-up MS-based proteomics , 2009, Bioinform..

[70]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[71]  Marco Dorigo,et al.  Ant system: optimization by a colony of cooperating agents , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[72]  Antoine Cornuéjols,et al.  What is the place of Machine Learning between Pattern Recognition and Optimization , 2008 .

[73]  Ben C. Collins,et al.  OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data , 2014, Nature Biotechnology.

[74]  M. Mann,et al.  Andromeda: a peptide search engine integrated into the MaxQuant environment. , 2011, Journal of proteome research.

[75]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[76]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[77]  Christophe Bruley,et al.  A toolbox for validation of mass spectrometry peptides identification and generation of database: IRMa , 2009, Bioinform..

[78]  J. Yates,et al.  A model for random sampling and estimation of relative protein abundance in shotgun proteomics. , 2004, Analytical chemistry.

[79]  Chaim Zins,et al.  Conceptual approaches for defining data, information, and knowledge , 2007, J. Assoc. Inf. Sci. Technol..

[80]  Antonio J. Plaza,et al.  Hyperspectral Unmixing Overview: Geometrical, Statistical, and Sparse Regression-Based Approaches , 2012, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[81]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[82]  Gennifer E. Merrihew,et al.  Deconvolution of mixture spectra from ion-trap data-independent-acquisition tandem mass spectrometry. , 2010, Analytical chemistry.

[83]  Robert LIN,et al.  NOTE ON FUZZY SETS , 2014 .

[84]  Jürgen Cox,et al.  Super-SILAC Allows Classification of Diffuse Large B-cell Lymphoma Subtypes by Their Protein Expression Profiles* , 2012, Molecular & Cellular Proteomics.

[85]  Hongyu Zhao,et al.  Bayesian Analysis of iTRAQ Data with Nonrandom Missingness: Identification of Differentially Expressed Proteins , 2009, Statistics in biosciences.

[86]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[87]  Lawrence K. Saul,et al.  Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifold , 2003, J. Mach. Learn. Res..

[88]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[89]  Heiko Horn,et al.  In Vivo Phosphoproteomics Analysis Reveals the Cardiac Targets of β-Adrenergic Receptor Signaling , 2013, Science Signaling.

[90]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[91]  Christophe Bruley,et al.  hEIDI: An Intuitive Application Tool To Organize and Treat Large-Scale Proteomics Data. , 2016, Journal of proteome research.

[92]  Anthony A. Hyman,et al.  Stoichiometry of chromatin-associated protein complexes revealed by label-free quantitative mass spectrometry-based proteomics , 2012, Nucleic acids research.

[93]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[94]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[95]  Y. Benjamini,et al.  Adaptive linear step-up procedures that control the false discovery rate , 2006 .

[96]  John D. Storey A direct approach to false discovery rates , 2002 .

[97]  Kyoungmi Kim,et al.  Accounting for undetected compounds in statistical analyses of mass spectrometry ‘omic studies , 2013, Statistical applications in genetics and molecular biology.

[98]  F. T. Wright,et al.  Order restricted statistical inference , 1988 .

[99]  Steven N. Goodman,et al.  Aligning statistical and scientific reasoning , 2016, Science.

[100]  Gene H. Golub,et al.  Methods for modifying matrix factorizations , 1972, Milestones in Matrix Computation.

[101]  S. Stigler,et al.  The History of Statistics: The Measurement of Uncertainty before 1900 by Stephen M. Stigler (review) , 1986, Technology and Culture.

[102]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[103]  Jens M. Rick,et al.  Quantitative mass spectrometry in proteomics: a critical review , 2007, Analytical and bioanalytical chemistry.

[104]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[105]  Suyash P. Awate,et al.  Kernel Principal Geodesic Analysis , 2014, ECML/PKDD.

[106]  R. Aebersold,et al.  ProbIDtree: An automated software program capable of identifying multiple peptides from a single collision‐induced dissociation spectrum collected by a tandem mass spectrometer , 2005, Proteomics.

[107]  Samuel I. Miller,et al.  Precursor acquisition independent from ion count: how to dive deeper into the proteomics ocean. , 2009, Analytical chemistry.

[108]  Richard D Smith,et al.  Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. , 2015, Journal of proteome research.

[109]  F. J. Anscombe,et al.  A Definition of Subjective Probability , 1963 .

[110]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[111]  Tobias Isenberg,et al.  Weighted graph comparison techniques for brain connectivity analysis , 2013, CHI.

[112]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[113]  Arthur P. Dempster,et al.  A Generalization of Bayesian Inference , 1968, Classic Works of the Dempster-Shafer Theory of Belief Functions.

[114]  Eyke Hüllermeier,et al.  Does machine learning need fuzzy logic? , 2015, Fuzzy Sets Syst..

[115]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[116]  Koji Tsuda Subspace classifier in the Hilbert space , 1999, Pattern Recognit. Lett..

[117]  Richard D. Smith,et al.  Normalization and missing value imputation for label-free LC-MS analysis , 2012, BMC Bioinformatics.

[118]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[119]  P. Bourne,et al.  MixGF: Spectral Probabilities for Mixture Spectra from more than One Peptide* , 2014, Molecular & Cellular Proteomics.

[120]  Chad R. Weisbrod,et al.  Accurate peptide fragment mass analysis: multiplexed peptide identification and quantification. , 2012, Journal of proteome research.

[121]  E. Candès The restricted isometry property and its implications for compressed sensing , 2008 .

[122]  Mathias Wilhelm,et al.  A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets , 2015, Molecular & Cellular Proteomics.

[123]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[124]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[125]  M. Gorenstein,et al.  Quantitative proteomic analysis by accurate mass retention time pairs. , 2005, Analytical chemistry.

[126]  Joel A. Tropp,et al.  Factoring nonnegative matrices with linear programs , 2012, NIPS.

[127]  Matthias Mann,et al.  Deep Proteomic Evaluation of Primary and Cell Line Motoneuron Disease Models Delineates Major Differences in Neuronal Characteristics* , 2014, Molecular & Cellular Proteomics.

[128]  Gilbert Saporta,et al.  Probabilités, Analyse des données et statistique , 1991 .

[129]  Johann Dréo,et al.  Metaheuristics for Hard Optimization: Methods and Case Studies , 2005 .

[130]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[131]  Victoria Stodden,et al.  When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? , 2003, NIPS.

[132]  Brendan MacLean,et al.  Bioinformatics Applications Note Gene Expression Skyline: an Open Source Document Editor for Creating and Analyzing Targeted Proteomics Experiments , 2022 .

[133]  Ross L Prentice,et al.  A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation , 2014, Biometrics.

[134]  Lukas Käll,et al.  Solution to Statistical Challenges in Proteomics Is More Statistics, Not Less. , 2015, Journal of proteome research.

[135]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[136]  Richard D. Smith,et al.  Detecting differential protein expression in large-scale population proteomics , 2014, Bioinform..

[137]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[138]  Lakhmi C. Jain,et al.  Introduction to Bayesian Networks , 2008 .

[139]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[140]  Volkan Cevher,et al.  Sparse projections onto the simplex , 2012, ICML.

[141]  B. Searle,et al.  Improving sensitivity by probabilistically combining results from multiple MS/MS search methodologies. , 2008, Journal of proteome research.

[142]  Peter Chu,et al.  Design and Analysis of Quantitative Differential Proteomics Investigations Using LC-MS Technology , 2008, J. Bioinform. Comput. Biol..

[143]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[144]  B. Kuster,et al.  Mass-spectrometry-based draft of the human proteome , 2014, Nature.