Using ACM DL paper metadata as an auxiliary source for building educational collections

Some digital libraries harvest metadata records from multiple content providers to build their collections. However, the quality and quantity of such metadata records are limited by what is harvested. To ensure collection growth, and to expand the scope beyond just what can be harvested, additional content acquisition methods are needed. Accordingly, we discuss how the Ensemble project (a pathway effort in the NSDL) is broadening its collection with the help of machine learning. Since Ensemble aims to aid computing education, we make use of ACM Digital Library records as a resource to help with transfer learning. We have built classifiers that can identify if a potential additional resource is about computing education. We approached this as a cross-domain text classification problem and developed suitable methods for feature extraction and bootstrapping for classifier training. Our experiments on three datasets of computing education metadata records show our approach can enhance the quality and quantity of records being added to Ensemble.