Abstract. Continental- to global-scale hydrologic and land surface models increasingly include representations of the groundwater system, driven by crucial Earth science and sustainability problems. These models are essential for examining, communicating, and understanding the dynamic interactions between the Earth System above and below the land surface as well as the opportunities and limits of groundwater resources. A key question for this nascent and rapidly developing field is how to evaluate the realism and performance of such large-scale groundwater models given limitations in data availability and commensurability. Our objective is to provide clear recommendations for improving the evaluation of groundwater representation in continental- to global-scale models. We identify three evaluation approaches, including comparing model outputs with available observations of groundwater levels or other state or flux variables (observation-based evaluation); comparing several models with each other with or without reference to actual observations (model-based evaluation); and comparing model behavior with expert expectations of hydrologic behaviors that we expect to see in particular regions or at particular times (expert-based evaluation). Based on current and evolving practices in model evaluation as well as innovations in observations, machine learning and expert elicitation, we argue that combining observation-, model-, and expert-based model evaluation approaches may significantly improve the realism of groundwater representation in large-scale models, and thus our quantification, understanding, and prediction of crucial Earth science and sustainability problems. We encourage greater community-level communication and cooperation on these challenges, including among global hydrology and land surface modelers, local to regional hydrogeologists, and hydrologists focused on model development and evaluation.