DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM EvaluationEliya HabbaOfir Arvivet al.2025ACL 2025
Navigating the Modern Evaluation Landscape: Considerations in Benchmarks and Frameworks for Large Language Models (LLMs)Leshem ChoshenAriel Geraet al.2024LREC-COLING 2024