An Energy-Efficient Heterogeneous System for Embedded Learning and Classification

Embedded learning applications in automobiles, surveillance, robotics, and defense are computationally intensive, and process large amounts of real-time data. Systems for such workloads have to balance stringent performance constraints within limited power budgets. High performance computer processing units (CPUs) and graphics processing units (GPUs) cannot be used in an embedded platform due to power issues. In this letter, we propose a low power heterogeneous system consisting of an Atom processor supported by multiple accelerators that target these workloads, and seek to find if such a system can satisfy performance requirements in an energy-efficient manner. We build our low-power system using an Atom processor, an ION, a GPU, and a field-programmable gate array (FPGA)-based custom accelerator, and study its performance and power characteristics using four representative workloads. With such a system, we show an energy improvement of 42-85% over a server comprising a 2.27 GHz quadcore Xeon coupled to a 1.3 GHz 240 core Tesla GPU.

[1]  Srihari Cadambi,et al.  A programmable parallel accelerator for learning and classification , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  Trevor N. Mudge,et al.  Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008, 2008 International Symposium on Computer Architecture.

[4]  Srihari Cadambi,et al.  A Massively Parallel Coprocessor for Convolutional Neural Networks , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[5]  Amar Phanishayee,et al.  FAWNdamentally Power-efficient Clusters , 2009, HotOS.

[6]  Gernot A. Fink,et al.  Face Detection Using GPU-Based Convolutional Neural Networks , 2009, CAIP.

[7]  Nectarios Koziris,et al.  Optimizing sparse matrix-vector multiplication using index and value compression , 2008, CF '08.

[8]  Kushagra Vaid,et al.  Web search using mobile cores: quantifying and mitigating the price of efficiency , 2010, ISCA.

[9]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[10]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[11]  Yanjun Qi,et al.  Learning to rank with (a lot of) word features , 2010, Information Retrieval.

[12]  Srihari Cadambi,et al.  A Massively Parallel FPGA-Based Coprocessor for Support Vector Machines , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.