A Review of Feature Extraction Software for Microarray Gene Expression Data

When gene expression data are too large to be processed, they are transformed into a reduced representation set of genes. Transforming large-scale gene expression data into a set of genes is called feature extraction. If the genes extracted are carefully chosen, this gene set can extract the relevant information from the large-scale gene expression data, allowing further analysis by using this reduced representation instead of the full size data. In this paper, we review numerous software applications that can be used for feature extraction. The software reviewed is mainly for Principal Component Analysis (PCA), Independent Component Analysis (ICA), Partial Least Squares (PLS), and Local Linear Embedding (LLE). A summary and sources of the software are provided in the last section for each feature extraction method.

[1]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[2]  Rick L. Edgeman,et al.  LISP-STAT: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics , 1992 .

[3]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[4]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Forrest W. Young,et al.  ViSta: A Visual Statistics System 1 , 1995 .

[7]  Arnaud Delorme,et al.  EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis , 2004, Journal of Neuroscience Methods.

[8]  I. Helland ON THE STRUCTURE OF PARTIAL LEAST SQUARES REGRESSION , 1988 .

[9]  A. Boulesteix,et al.  Penalized Partial Least Squares with Applications to B-Spline Transformations and Functional Data , 2006, math/0608576.

[10]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[11]  A. Malony,et al.  HiPerSAT Technical Report , 2006 .

[12]  S.Anita S. Jothi,et al.  Data Mining Classification Techniques Applied For Cancer Disease – A Case Study Using Xlminer , 2012 .

[13]  Guy Perrière,et al.  MADE4: an R package for multivariate analysis of gene expression data , 2005, Bioinform..

[14]  Ron Shamir,et al.  SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification , 2009, PloS one.

[15]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[16]  Anne-Béatrice Dufour,et al.  The ade4 Package: Implementing the Duality Diagram for Ecologists , 2007 .

[17]  Michael Hahsler,et al.  Getting Things in Order: An Introduction to the R Package seriation , 2008 .

[18]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[19]  Forrest W. Young,et al.  Towards a structured data analysis environment: a cognition-based design , 1992 .

[20]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[21]  Herman Wold,et al.  Soft modelling: The Basic Design and Some Extensions , 1982 .

[22]  Eric Moulines,et al.  A blind source separation technique using second-order statistics , 1997, IEEE Trans. Signal Process..

[23]  Forrest W. Young,et al.  Graphical Sensitivity Analysis for Multidimensional Scaling , 1994 .

[24]  Gavin Simpson,et al.  Analogue Methods in Palaeoecology: Using the analogue Package , 2007 .

[25]  Informatique ViSta, The Visual Statistics System , 2010 .

[26]  Victor Mitrana,et al.  A Formal Language-Based Approach in Biology , 2004, Comparative and functional genomics.

[27]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[28]  Forrest W. Young,et al.  27 Multivariate statistical visualization , 1993, Computational Statistics.

[29]  Pierre Comon,et al.  Comparative Speed Analysis of FastICA , 2007, ICA.

[30]  Jean Thioulouse,et al.  ADE-4: a multivariate analysis and graphical display software , 1997, Stat. Comput..

[31]  Masashi Sugiyama,et al.  The Degrees of Freedom of Partial Least Squares Regression , 2010, 1002.4112.

[32]  H. Kölbl,et al.  The humoral immune system has a key prognostic impact in node-negative breast cancer. , 2008, Cancer research.

[33]  Susmita Datta,et al.  Surrogate variable analysis using partial least squares (SVA-PLS) in gene expression studies , 2012, Bioinform..

[34]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.

[35]  J. Cardoso,et al.  Blind beamforming for non-gaussian signals , 1993 .

[36]  Damian Counsell,et al.  Bioinformatics and Molecular Evolution , 2005, Comparative and Functional Genomics.

[37]  P. Valero-Mora,et al.  Using Interactive Graphics to Teach Multivariate Data Analysis to Psychology Students , 2011 .

[38]  Hirotugu Akaike,et al.  Likelihood and the Bayes procedure , 1980 .

[39]  Sunduz Keles,et al.  Sparse Partial Least Squares Classification for High Dimensional Data , 2010, Statistical applications in genetics and molecular biology.

[40]  S. N. Deepa,et al.  Comparative analysis of XLMiner and WEKA for pattern classification , 2012, 2012 IEEE International Conference on Advanced Communication Control and Computing Technologies (ICACCCT).

[41]  Dmitry Grapov,et al.  imDEV: a graphical user interface to R multivariate analysis tools in Microsoft Excel , 2012, Bioinform..

[42]  R. Liu,et al.  AMUSE: a new blind identification algorithm , 1990, IEEE International Symposium on Circuits and Systems.

[43]  Safaai Deris,et al.  Multivariate analysis of gene expression data and missing value imputation based on llsimpute algorithm , 2011 .

[44]  Sébastien Lê,et al.  FactoMineR: An R Package for Multivariate Analysis , 2008 .

[45]  David Lubinsky,et al.  Guiding Data Analysts with Visual Statistical Strategies , 1995 .

[46]  Shri Sant,et al.  Reconstruction of a Complete Dataset from an Incomplete Dataset by PCA (Principal Component Analysis) Technique: Some Results , 2010 .

[47]  T. Næs,et al.  A comparison of methods for analysing regression models with both spectral and designed variables , 2004 .

[48]  Robert A. Stine,et al.  Statistical computing environments for social research , 1997 .

[49]  Dan Lin,et al.  Performance of Gene Selection and Classification Methods in a Microarray Setting: A Simulation Study , 2008, Commun. Stat. Simul. Comput..

[50]  Kristian Hovde Liland,et al.  Powered partial least squares discriminant analysis , 2009 .

[51]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[52]  E. Oja,et al.  Independent Component Analysis , 2013 .

[53]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[54]  Forrest W. Young,et al.  Multivariate Statistical Visualization , 1992 .

[55]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[56]  Giorgio Russolillo,et al.  Tools for Partial Least Squares Path Modeling (PLS-PM) , 2015 .

[57]  Youping Deng,et al.  Gene selection and classification for cancer microarray data based on machine learning and similarity measures , 2011, BMC Genomics.