A CURATEd CATalog: Rethinking the Extraction of Pretraining Corpora for Mid-Resourced Languages