A high-order representation and classification method for transcription factor binding sites recognition in Escherichia coli

BACKGROUND Identifying transcription factors binding sites (TFBSs) plays an important role in understanding gene regulatory processes. The underlying mechanism of the specific binding for transcription factors (TFs) is still poorly understood. Previous machine learning-based approaches to identifying TFBSs commonly map a known TFBS to a one-dimensional vector using its physicochemical properties. However, when the dimension-sample rate is large (i.e., number of dimensions/number of samples), concatenating different physicochemical properties to a one-dimensional vector not only is likely to lose some structural information, but also poses significant challenges to recognition methods. MATERIALS AND METHOD In this paper, we introduce a purely geometric representation method, tensor (also called multidimensional array), to represent TFs using their physicochemical properties. Accompanying the multidimensional array representation, we also develop a tensor-based recognition method, tensor partial least squares classifier (abbreviated as TPLSC). Intuitively, multidimensional arrays enable borrowing more information than one-dimensional arrays. The performance of each method is evaluated by average F-measure on 51 Escherichia coli TFs from RegulonDB database. RESULTS In our first experiment, the results show that multiple nucleotide properties can obtain more power than dinucleotide properties. In the second experiment, the results demonstrate that our method can gain increased prediction power, roughly 33% improvements more than the best result from existing methods. CONCLUSION The representation method for TFs is an important step in TFBSs recognition. We illustrate the benefits of this representation on real data application via a series of experiments. This method can gain further insights into the mechanism of TF binding and be of great use for metabolic engineering applications.

[1]  Matheus P. Freitas,et al.  On the use of PLS and N-PLS in MIA-QSAR : Azole antifungals , 2009 .

[2]  Xiaofeng Gong,et al.  Tensor decomposition of EEG signals: A brief review , 2015, Journal of Neuroscience Methods.

[3]  Richard H. Lathrop,et al.  DNA sequence and structure: direct and indirect recognition in protein-DNA binding , 2002, ISMB.

[4]  Constantine Kotropoulos,et al.  Non-Negative Multilinear Principal Component Analysis of Auditory Temporal Modulations for Music Genre Classification , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Hadi Fanaee-T,et al.  Tensor-based anomaly detection: An interdisciplinary survey , 2016, Knowl. Based Syst..

[6]  Xiang Zhou,et al.  Differential expression analysis for RNAseq using Poisson mixed models , 2016, bioRxiv.

[7]  Martin Andersson,et al.  A comparison of nine PLS1 algorithms , 2009 .

[8]  Lin Yang,et al.  TFBSshape: a motif database for DNA shape features of transcription factor binding sites , 2013, Nucleic Acids Res..

[9]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[10]  Ivan G. Costa,et al.  Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications , 2014, Bioinform..

[11]  J Gottfries,et al.  Diagnosis of dementias using partial least squares discriminant analysis. , 1995, Dementia.

[12]  F. van Roy,et al.  A flexible integrative approach based on random forest improves prediction of transcription factor binding sites , 2012, Nucleic acids research.

[13]  M. Fried Measurement of protein‐DNA interaction parameters by electrophoresis mobility shift assay , 1989, Electrophoresis.

[14]  Lin Yang,et al.  GBshape: a genome browser database for DNA shape annotations , 2014, Nucleic Acids Res..

[15]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[16]  Kathleen Marchal,et al.  Use of structural DNA properties for the prediction of transcription-factor binding sites in Escherichia coli , 2010, Nucleic Acids Res..

[17]  D. Galas,et al.  DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. , 1978, Nucleic acids research.

[18]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[19]  Haiping Lu,et al.  Multilinear Subspace Learning: Dimensionality Reduction of Multidimensional Data , 2013 .

[20]  Minghe Sun,et al.  A SVM ensemble learning method using tensor data: An application to cross selling recommendation , 2015, 2015 12th International Conference on Service Systems and Service Management (ICSSSM).

[21]  Atsushi Imiya,et al.  Dimension Reduction and Construction of Feature Space for Image Pattern Recognition , 2016, Journal of Mathematical Imaging and Vision.

[22]  Eugene Berezikov,et al.  CONREAL web server: identification and visualization of conserved transcription factor binding sites , 2005, Nucleic Acids Res..

[23]  Fangping Mu,et al.  Improved predictions of transcription factor binding sites using physicochemical features of DNA , 2012, Nucleic acids research.

[24]  Emily R. Davenport,et al.  Epigenetic Modifications are Associated with Inter-species Gene Expression Variation in Primates , 2014 .

[25]  D. Latchman Transcription factors: an overview. , 1997, The international journal of biochemistry & cell biology.

[26]  Jie Li,et al.  A Prior Neurophysiologic Knowledge Free Tensor-Based Scheme for Single Trial EEG Classification , 2009, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[27]  Hong Yan,et al.  Dimensionality reduction and topographic mapping of binary tensors , 2013, Pattern Analysis and Applications.

[28]  S. Rudaz,et al.  Multi-way PLS for discrimination: Compact form equivalent to the tri-linear PLS2 procedure and its monotony convergence , 2014 .

[29]  Naotaka Fujii,et al.  Higher Order Partial Least Squares (HOPLS): A Generalized Multilinear Regression Method , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  William G. Bardsley,et al.  A Partial Least Squares Algorithm for Microarray Data Analysis Using the VIP Statistic for Gene Selection and Binary Classification , 2013 .

[31]  R. Brereton,et al.  Partial least squares discriminant analysis: taking the magic away , 2014 .

[32]  Shiquan Sun,et al.  A Kernel-Based Multivariate Feature Selection Method for Microarray Data Classification , 2014, PloS one.

[33]  Stephen A. Ramsey,et al.  A DNA shape-based regulatory score improves position-weight matrix-based recognition of transcription factor binding sites , 2015, Bioinform..

[34]  R. Mann,et al.  Quantitative modeling of transcription factor binding specificities using DNA shape , 2015, Proceedings of the National Academy of Sciences.

[35]  Dong Xu,et al.  Multilinear Discriminant Analysis for Face Recognition , 2007, IEEE Transactions on Image Processing.

[36]  Philip S. Yu,et al.  Incremental tensor analysis: Theory and applications , 2008, TKDD.

[37]  Lin Yang,et al.  DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale , 2013, Nucleic Acids Res..

[38]  Fangping Mu,et al.  Using Sequence-Specific Chemical and Structural Properties of DNA to Predict Transcription Factor Binding Sites , 2010, PLoS Comput. Biol..

[39]  S. Humphries,et al.  Characterization of DNA-binding proteins using multiplexed competitor EMSA. , 2009, Journal of molecular biology.

[40]  Haiping Lu,et al.  Regularized Common Spatial Pattern With Aggregation for EEG Classification in Small-Sample Setting , 2010, IEEE Transactions on Biomedical Engineering.

[41]  R. Saritha,et al.  Computational transcription factor binding prediction using random forests , 2014, 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT).

[42]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[43]  M. McCarthy,et al.  Tensor decomposition for multi-tissue gene expression experiments , 2016, Nature Genetics.

[44]  Naoshi Kondo,et al.  Determination of K value for fish flesh with ultraviolet–visible spectroscopy and interval partial least squares (iPLS) regression method , 2016 .

[45]  Xiaokang Zhang,et al.  Global feature selection from microarray data using Lagrange multipliers , 2016, Knowl. Based Syst..

[46]  Andrey Eliseyev,et al.  L1-Penalized N-way PLS for subset of electrodes selection in BCI experiments , 2012, Journal of neural engineering.

[47]  Andrew J. Hampshire,et al.  Footprinting: a method for determining the sequence selectivity, affinity and kinetics of DNA-binding ligands. , 2007, Methods.

[48]  Mona Singh,et al.  Comparative analysis of methods for representing and searching for transcription factor binding sites , 2004, Bioinform..

[49]  Swetlana Nikolajewa,et al.  DiProDB: a database for dinucleotide properties , 2008, Nucleic Acids Res..

[50]  Julio Collado-Vides,et al.  RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more , 2012, Nucleic Acids Res..

[51]  Andrey Eliseyev,et al.  Recursive N-Way Partial Least Squares for Brain-Computer Interface , 2013, PloS one.

[52]  Rasmus Bro,et al.  The N-way Toolbox for MATLAB , 2000 .

[53]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[54]  Lu Wang,et al.  Multilinear principal component analysis for face recognition with fewer features , 2010, Neurocomputing.

[55]  Makedonka Mitreva,et al.  Exploration of bacterial community classes in major human habitats , 2014, Genome Biology.