Mining citizen science data to predict orevalence of wild bird species

The Cornell Laboratory of Ornithology's mission is to interpret and conserve the earth's biological diversity through research, education, and citizen science focused on birds. Over the years, the Lab has accumulated one of the largest and longest-running collections of environmental data sets in existence. The data sets are not only large, but also have many attributes, contain many missing values, and potentially are very noisy. The ecologists are interested in identifying which features have the strongest effect on the distribution and abundance of bird species as well as describing the forms of these relationships. We show how data mining can be successfully applied, enabling the ecologists to discover unanticipated relationships. We compare a variety of methods for measuring attribute importance with respect to the probability of a bird being observed at a feeder and present initial results for the impact of important attributes on bird prevalence.