Machine learning concerns forming representations of input observations to facilitate tasks such as classification. A recent insight in deep learning [1] is to use a deep architecture that stacks multiple levels of nonlinear operations in an inference hierarchy to extract different layers of abstractions. Deep learning is a promising direction and has attained state-of-the-art performance in some application areas such as computer vision and speech recognition. In this work, we aim to improve the computational efficiency of deep learning. A key step in deep learning is to represent input signals as layers of sparse representations. In image processing, this means to progressively describe objects using features of larger spatial scales. For example, in the bottom layer objects can be represented by edges of different widths, lengths and orientations over small regions. In higher layers objects may be represented by shapes such as squares, triangles and so on over large regions. The representations are computed based on some designed or learned dictionary. At each layer an input signal is represented using just a few dictionary atoms. We consider the orthogonal matching pursuit (OMP) algorithm [5], which forms sparse representations by greedily selecting representing dictionary atoms. The computation cost of OMP is proportional to the dimensions of the dictionary. Suppose that the input signal is an M×1 vector x, and the dictionary has N M×1 atoms di, i = 1, ..., N . The bulk of the sparse representation computation amounts to computing N correlations between x and di for all i. Thus the total computation cost is O(MN). Note that N is usually governed by the characteristics of a given machine learning task. For example, if the task is to classify objects with a large number of categories, N tends to be large for an increased chance of representing x with just a few dictionary atoms. On the other hand, with an easier task such as differentiating between only a few different looking objects, a relatively small N may be sufficient to derive discernible representations. M , however, is driven by the input signal size and the size of intermediate sparse representations computed in the hierarchy. This means M can be large especially in higher layers of the learning framework. For example, in image processing, sparse representations for small local regions of an image are formed in the bottom layer. The representations are then aggregated and vectorized over a larger neighborhood as the input signal for the next layer, which can easily be very long. We show that this O(MN) cost can be reduced to O(N logN), a complexity independent of the signal or representation size M . This means that the computation cost is only dictated by the desired classification resolution.
[1]
Richard G. Baraniuk,et al.
Signal Processing With Compressive Measurements
,
2010,
IEEE Journal of Selected Topics in Signal Processing.
[2]
W. B. Johnson,et al.
Extensions of Lipschitz mappings into Hilbert space
,
1984
.
[3]
Joel A. Tropp,et al.
Greed is good: algorithmic results for sparse approximation
,
2004,
IEEE Transactions on Information Theory.
[4]
Yoshua. Bengio,et al.
Learning Deep Architectures for AI
,
2007,
Found. Trends Mach. Learn..
[5]
Dieter Fox,et al.
Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
,
2011,
NIPS.