Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation