Control Flow Operators in PyTorch
Yidi Wu, Thomas Bohnstingl, et al.
ICML 2025
Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. In this work, we present \dataname{}~(\explicitdataname{}) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an \emph{holistic} perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against \dataname{}, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. \dataname{} consists of more than \datasetsize{} prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation.
Yidi Wu, Thomas Bohnstingl, et al.
ICML 2025
Gosia Lazuka, Andreea Simona Anghel, et al.
SC 2024
Ben Fei, Jinbai Liu
IEEE Transactions on Neural Networks
Robert Farrell, Rajarshi Das, et al.
AAAI-SS 2010