Multi-Stage Speaker Extraction with Utterance and Frame-Level Reference Signals