SCALE: Scaling up the Complexity for Advanced Language Model Evaluation