暂无分享,去创建一个
S. K. Nandy | Ranjani Narayan | Anupam Chattopadhyay | Farhad Merchant | Soumyendu Raha | Tarun Vatwani | A. Chattopadhyay | S. Raha | R. Narayan | S. Nandy | Farhad Merchant | Tarun Vatwani
[1] Kleanthis Psarris,et al. Scalable matrix decompositions with multiple cores on FPGAs , 2013, Microprocess. Microsystems.
[2] Robert A. van de Geijn,et al. Algorithm, Architecture, and Floating-Point Unit Codesign of a Matrix Factorization Accelerator , 2014, IEEE Transactions on Computers.
[3] Kenichi Miura,et al. Performance Analysis of ClearSpeed's CSX600 Interconnects , 2009, 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications.
[4] Samuel Williams,et al. Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.
[5] S. K. Nandy,et al. Micro-architectural Enhancements in Distributed Memory CGRAs for LU and QR Factorizations , 2015, 2015 28th International Conference on VLSI Design.
[6] Jack Dongarra,et al. Multithreading in the PLASMA Library , 2014 .
[7] Robert A. van de Geijn,et al. A Linear Algebra Core Design for Efficient Level-3 BLAS , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.
[8] Jason N. Dale,et al. Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..
[9] Jack Dongarra,et al. LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.
[10] Stefania Perri,et al. A matrix product accelerator for field programmable systems on chip , 2008, Microprocess. Microsystems.
[11] Rafael C Núñez,et al. LAPACKrc: Fast linear algebra kernels/solvers for FPGA accelerators , 2009 .
[12] Robert H. Halstead,et al. Matrix Computations , 2011, Encyclopedia of Parallel Computing.
[13] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[14] S. K. Nandy,et al. Efficient QR Decomposition Using Low Complexity Column-wise Givens Rotation (CGR) , 2014, 2014 27th International Conference on VLSI Design and 2014 13th International Conference on Embedded Systems.
[15] S. K. Nandy,et al. Generic routing rules and a scalable access enhancement for the Network-on-Chip RECONNECT , 2009, 2009 IEEE International SOC Conference (SOCC).
[16] Nicholas J. Higham,et al. INVERSE PROBLEMS NEWSLETTER , 1991 .
[17] Jack Dongarra,et al. QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .
[18] S. K. Nandy,et al. Achieving Efficient QR Factorization by Algorithm-Architecture Co-design of Householder Transformation , 2016, 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID).
[19] Derek Chiou,et al. On the asymptotic costs of multiplexer-based reconfigurability , 2012, DAC Design Automation Conference 2012.
[20] Robert A. van de Geijn,et al. Deriving dense linear algebra libraries , 2013, Formal Aspects of Computing.
[21] S. K. Nandy,et al. Efficient Realization of Table Look-Up Based Double Precision Floating Point Arithmetic , 2016, 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID).
[22] Ahmed Hemani,et al. 39.9 GOPs/watt multi-mode CGRA accelerator for a multi-standard basestation , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).
[23] S. K. Nandy,et al. Scalable and energy-efficient reconfigurable accelerator for column-wise givens rotation , 2014, 2014 22nd International Conference on Very Large Scale Integration (VLSI-SoC).
[24] S. K. Nandy,et al. Efficient and scalable CGRA-based implementation of Column-wise Givens Rotation , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.
[25] S. K. Nandy,et al. RECONNECT: A NoC for polymorphic ASICs using a low overhead single cycle router , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.
[26] S. K. Nandy,et al. Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design , 2017, Parallel Process. Lett..
[27] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .
[28] Sasko Ristov,et al. Superlinear Speedup for Matrix Multiplication in GPU Devices , 2012, ICT Innovations.
[29] R. C. Whaley,et al. Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005, Softw. Pract. Exp..
[30] S. K. Nandy,et al. REDEFINE: Runtime reconfigurable polymorphic ASIC , 2009, TECS.
[31] Jack J. Dongarra,et al. The LINPACK Benchmark: An Explanation , 1988, ICS.
[32] Robert A. van de Geijn,et al. Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures , 2012, IEEE Transactions on Computers.
[33] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[34] Viktor K. Prasanna,et al. Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems , 2007, IEEE Transactions on Parallel and Distributed Systems.
[35] Yi Yang,et al. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing , 2015, ICS.
[36] John D. Davis,et al. BLAS Comparison on FPGA, CPU and GPU , 2010, 2010 IEEE Computer Society Annual Symposium on VLSI.
[37] Qian Wang,et al. AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[38] Kenichi Miura,et al. Performance Improvement Methodology for ClearSpeed's CSX600 , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[39] Paolo Bonzini,et al. EGRA: A Coarse Grained Reconfigurable Architectural Template , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[40] Georgi Gaydadjiev,et al. Architectural Exploration of the ADRES Coarse-Grained Reconfigurable Array , 2007, ARC.