论文信息 - AuDoLab: Automatic document labelling and classification for extremely unbalanced data

AuDoLab: Automatic document labelling and classification for extremely unbalanced data

AuDoLab provides a novel approach to one-class document classification for heavily imbalanced datasets, even if labelled training data is not available. Our package enables the user to create specific out-of-domain training data to classify a heavily underrepresented target class in a document dataset using a recently developed integration of Web Scraping, Latent Dirichlet Allocation Topic Modelling and One-class Support Vector Machines (Thielmann, Weisser, Krenz, & Säfken, 2021). AuDoLab can achieve high quality results even on highly specific classification problems without the need to invest in the time and cost intensive labelling of training documents by humans. Hence, AuDoLab has a broad range of scientific research or business real world applications. In the following, a few potential use cases will be briefly discussed that should illustrate the broad range of applications in various domains. For example AuDoLab could be used to identify emails with very specific topics such as fraud or money laundering that might have an extremely low prevalence. Similarly, AuDoLab could be used in the medical field to classify medical documents that are concerned with very specific topics such as heart attacks or dental problems. Furthermore, AuDoLab may be used to identify legal documents with very specific topics such as machine learning. Note that, the only limiting factor to the broad range of use cases, is the availability of out-of-domain training data, that can be generated via Web Scraping from IEEEXplore (IEEE Xplore, 2020), ArXiv or PubMed. Given that a broad range of training documents can be obtained from these websites AuDoLab has a correspondingly broad range of applications. The following section provides an overview of AuDoLab. AuDoLab can be installed conveniently via pip. A detailed description of the package and installation and can be found in the packages repository or on the documentation website.1

[1] Benjamin Säfken,et al. Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling , 2021, Journal of applied statistics.

[2] Benjamin Säfken,et al. Learning Deep Textwork , 2021 .

[3] Christoph Weisser,et al. One-Class Support Vector Machine and LDA Topic Model Integration—Evidence for AI Patents , 2021 .

[4] Gillian Kant,et al. TTLocVis: A Twitter Topic Location Visualization Package , 2020, J. Open Source Softw..

[5] Michelle Wilde. IEEE Xplore Digital Library , 2016 .

[6] Learning deep , 2020 .

[7] Kenneth E. Shirley,et al. LDAvis: A method for visualizing and interpreting topics , 2014 .

[8] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9] Petr Sojka,et al. Software Framework for Topic Modelling with Large Corpora , 2010 .

[10] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11] Malik Yousef,et al. One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[12] Bernhard Schölkopf,et al. Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.