C.A. Micchelli, W.L. Miranker
Journal of the ACM
We study contextual bandit algorithms for high-dimensional combinatorial action spaces involving a class of structured discrete constrained optimization. We propose a novel approach based on general-purpose regression oracles, leveraging Inverse Gap Weighting (IGW) for seamless integration within an optimization framework. IGW sampling is efficiently managed by: (a) Column generation reformulation of the underlying Mixed Integer Programming (MIP) model, enabling flexible lower-level predictors, causal coherence, and efficient representation of large action spaces. (b) Diverse solution pool generation to balance the exploration-exploitation trade-off in large action spaces. To address discontinuities in the reward function’s gradient due to optimization constraints, we incorporate a risk-averse phased learning strategy. Through ablation studies, we demonstrate that each component is critical for maximizing cumulative rewards. Finally, we validate the practical viability of our approach on a real-world Auto-Scaling problem in IT automation.