A new ensemble-based algorithm for identifying breath gas marker candidates in liver disease using ion molecule reaction mass spectrometry

MOTIVATION Alcoholic fatty liver disease (AFLD) and non-AFLD (NAFLD) can progress to severe liver diseases such as steatohepatitis, cirrhosis and cancer. Thus, the detection of early liver disease is essential; however, minimal invasive diagnostic methods in clinical hepatology still lack specificity. RESULTS Ion molecule reaction mass spectrometry (IMR-MS) was applied to a total of 126 human breath gas samples comprising 91 cases (AFLD, NAFLD and cirrhosis) and 35 healthy controls. A new feature selection modality termed Stacked Feature Ranking (SFR) was developed to identify potential liver disease marker candidates in breath gas samples, relying on the combination of different entropy- and correlation-based feature ranking methods including statistical hypothesis testing using a two-level architecture with a suggestion and a decision layer. We benchmarked SFR against four single feature selection methods, a wrapper and a recently described ensemble method, indicating a significantly higher discriminatory ability of up to 10-15% for the SFR selected gas compounds expressed by the area under the ROC curve (AUC) of 0.85-0.95. Using this approach, we were able to identify unexpected breath gas marker candidates in liver disease of high predictive value. A literature study further supports top-ranked markers to be associated with liver disease. We propose SFR as a powerful tool for biomarker search in breath gas and other biological samples using mass spectrometry. AVAILABILITY The algorithm SFR and IMR-MS datasets are available under http://biomed.umit.at/page.cfm?pageid=526.

[1]  Maguelonne Teisseire,et al.  Successes and New Directions in Data Mining , 2007 .

[2]  InzaIñaki,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004 .

[3]  Christian Baumgartner,et al.  Biomarker Discovery, Disease Classification, and Similarity Query Processing on High-Throughput MS/MS Data of Inborn Errors of Metabolism , 2006, Journal of biomolecular screening.

[4]  N. Dubrawsky Cancer statistics , 1989, CA: a cancer journal for clinicians.

[5]  Bradford G. Stone,et al.  Effect of regulating cholesterol biosynthesis on breath isoprene excretion in men , 1993, Lipids.

[6]  A. Miyajima,et al.  Molecular mechanism of liver development and regeneration. , 2007, International review of cytology.

[7]  P. van der Putten,et al.  A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000 , 2004 .

[8]  O. Niemelä,et al.  Aldehyde-protein adducts in the liver as a result of ethanol-induced oxidative stress. , 1999, Frontiers in bioscience : a journal and virtual library.

[9]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[10]  P. Greenwel,et al.  Acetaldehyde-mediated collagen regulation in hepatic stellate cells. , 1999, Alcoholism, clinical and experimental research.

[11]  Christian Baumgartner,et al.  Data Mining and Knowledge Discovery in Metabolomics Armin , 2008 .

[12]  Marko Grobelnik,et al.  Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II , 2009 .

[13]  A. Jemal,et al.  Cancer Statistics, 2008 , 2008, CA: a cancer journal for clinicians.

[14]  Bogusław Buszewski,et al.  Human exhaled air analytics: biomarkers of diseases. , 2007, Biomedical chromatography : BMC.

[15]  T. Risby,et al.  Clinical application of breath biomarkers of oxidative stress status. , 1999, Free radical biology & medicine.

[16]  T. Barrett,et al.  Trimethylamine and foetor hepaticus. , 1999, Scandinavian journal of gastroenterology.

[17]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[18]  P. Mazzone,et al.  Analysis of volatile organic compounds in the exhaled breath for the diagnosis of lung cancer. , 2008, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[19]  M. Rao,et al.  Theme Lipid Metabolism and Liver Inflammation . II . Fatty liver disease and fatty acid oxidation , 2006 .

[20]  O. Niemelä,et al.  Distribution of ethanol-induced protein adducts in vivo: relationship to tissue injury. , 2001, Free radical biology & medicine.

[21]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[22]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[24]  J. Futrell,et al.  Detection of isoprene in expired air from human subjects using proton-transfer-reaction mass spectrometry. , 1997, Rapid communications in mass spectrometry : RCM.

[25]  Maarten van Someren,et al.  A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000 , 2004, Machine Learning.

[26]  R. Rector,et al.  Non-alcoholic fatty liver disease and the metabolic syndrome: an update. , 2008, World journal of gastroenterology.

[27]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[28]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[29]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[30]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[31]  Manfred Thiel,et al.  Real-time Monitoring of Propofol in Expired Air in Humans Undergoing Total Intravenous Anesthesia , 2007, Anesthesiology.

[32]  Terence H Risby,et al.  Breath biomarkers for detection of human liver diseases: preliminary study , 2002, Biomarkers : biochemical indicators of exposure, response, and susceptibility to chemicals.

[33]  Janardan K Reddy,et al.  Lipid metabolism and liver inflammation. II. Fatty liver disease and fatty acid oxidation. , 2006, American journal of physiology. Gastrointestinal and liver physiology.

[34]  A. Lagg,et al.  Methanol in human breath. , 1995, Alcoholism, clinical and experimental research.

[35]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[36]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[37]  Ian Witten,et al.  Data Mining , 2000 .

[38]  Bernhard Pfeifer,et al.  A new rule-based algorithm for identifying metabolic markers in prostate cancer using tandem mass spectrometry , 2008, Bioinform..

[39]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[40]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[41]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.