Elo Uncovered: Robustness and Best Practices in Language Model Evaluation