Pushing Image Recognition in the Real World: Towards Recognizing Millions of Entities

Building a system that can recognize "what," "who," and "where" from arbitrary images has motivated researchers in computer vision, multimedia and machine learning areas for decades. Significant progresses have been made in recently years based on distributed computation and/or deep neural networks techniques. However, it is still very challenging to realize a general purpose real world image recognition engine that has reasonable recognition accuracy, semantic coverage, and recognition speed. In this talk, firstly we will review the current status of this area, analyze the difficulties, and discuss the potential solutions. Then two promising schemes to attack this challenge will be introduced, including (1) learning millions of concepts from search engine click logs, and (2) recognizing whatever you want without data labeling. The first work tries to build large-scale recognition models by mining search engine click logs. Challenges in training data selection and model selection will be discussed, and efficient and scalable approaches for model training and prediction will be introduced. The second work aims at building image recognition engines for any set of entities without using any human labeled training data, which helps generalize image recognition to a wide range of semantic concepts. Automatic training data generation steps will be presented, and techniques for improving recognition accuracy, which effectively leveraging massive amount of Internet data will be discussed. Different parallelization strategies for different computation tasks will be introduced, which guarantee the efficiency and scalability of the entire system. And last, we will discuss possible directions in pushing image recognition in the real world.