Plant identification with noisy web data

One of the main problems in image based plant identification has been the lack of quality training image data. A few attempts for solving this problem through generating high quality plant images from crowd sourced Web image collections like Flickr are proposed in this paper. These methods try to automatically identify correct and informative training images from those Web images, which typically have very noisy metadata (for example, user tags in Flickr), to enhance existing manually labeled training set. Firstly, for each plant, a set of images is collected from searching Flickr by using the plant name as the query. Then, images are clustered into visually consistent clusters, and in each cluster hopefully a majority of the images are all relevant or irrelevant to the particular plant. From these clusters, a managed plant image data set from ImageCLEF is used as reference to automatically select the highest quality cluster for each plant. The image quality of the selected clusters is further improved by two algorithms: an iterative method and image similarity based ranking. We show that the larger training data set automatically selected by this method significantly increases the accuracy of image based plant identification. In addition, this approach is a generic solution to almost all image recognition problems as long as additional (noisy) training data can be obtained from the Internet automatically.