Clustering writing components from medieval manuscripts

This article explores a minimally supervised method for extracting components, mostly letters, from historical manuscripts, and clustering them into classes capturing linguistic equivalence. The clustering uses the DBSCAN algorithm and an additional classification step. This pipeline gives us cheap, but partial, manuscript transcription in combination with human annotation. Experiments with different parameter settings suggest that a system like this should be tuned separately for different categories, rather than rely on one-pass application of algorithms partitioning the same components into non-overlapping clusters. The method could also be used to extract features for manuscript classification, e.g. dating and scribe attribution, as well as to extract data for further palaeographic analysis.