Technical Perspective: What led computer vision to deep learning?

vincing the computer vision community would require results on the realworld datasets that we used. Geoff did take this advice to heart and I like to think that conversation was one of the inspirations behind KSH. What was the secret sauce behind KSH’s success? Besides the technical innovations (such as the use of ReLUs), we must give a lot of credit to “big data” and “big computation.” By big data here I mean the availability of large datasets with category labels, such as ImageNet from Fei-Fei Li’s group, which provided the training data for these large, deep networks with millions of parameters. Previous datasets like Caltech-101 or PASCAL VOC did not have enough training data, and MNIST and CIFAR were regarded as “toy datasets” by the computer vision community. This strand of labeling datasets for benchmarking and for extracting image statistics itself was enabled by the desire of people to upload their photo collections to the Internet on sites such as Flickr. The way big computation proved most helpful was through GPUs, a hardware development initially driven by the needs of the video game industry. Let me turn now to the impact of the KSH paper. As of this writing, it has 10,245 citations on Google Scholar, remarkable for a paper not yet five years old. I was present at the ECCV ImageNet workshop where the KSH results were presented. Everyone was impressed by the results, but there was debate about their generality. Would the success on whole image classification problems extend to more tasks such as object detection? Was the finding a very fragile one, or was it a robust one that other groups would be able to replicate? Stochastic gradient descent (SGD) can only find local minima, so what is the guarantee the minima we find will be good? In the true spirit of science, many incorporating convolutional structure. LeCun et al. took the additional step of using backpropagation to train the weights of this network, and what we today call convolutional neural networks were born. The 1990s and 2000s saw diminished interest in neural networks. Indeed, one of the inside jokes was that having the phrase “neural networks” in the title of a paper was a negative predictor of its chance of getting accepted at the NIPS conference! A few true believers such as Yoshua Bengio, Geoffrey Hinton, Yann LeCun, and Juergen Schmidhuber persisted, with a lot of effort directed towards developing unsupervised techniques. These did not lead to much success on the benchmark problems that the field cared about, so they remained a minority interest. There were a few technical innovations—max-pooling, dropout, and the use of half-wave rectification (a.k.a ReLU) as the activation function nonlinearity—but before the publication of the KSH paper in 2012, the mainstream computer vision community did not think that neural network based techniques could produce results competitive with our hand designed features and architectures. I was one of those skeptics, and I recall telling Geoff Hinton that con-