Bounds on performance gap between greedy policies and MDP-based policies in MDP environment - overview


While reinforcement learning (RL) algorithms based on the assumption of an underlying contextual multi- armed bandit (CMAB) are appealing due to the simplicity of the underlying model, they also might lead to sub-optimal policies, since they ignore the possibility of feedback from the actions to the contexts; assuming an underlying Markov decision process (MDP) on the other hand explicitly allows for such feedbacks. However, while the MDP framework includes CMABs as a special case, it is not advisable to indiscriminately use the more complex framework: with the increased flexibility also comes increased sample complexity, slower convergence rates, more involved modelling, etc. Upper bounds on the performance gap between policies learned assuming a CMAB environment and those learned assuming an underlying MDP that can be computed a priori, i.e. before learning commences, would be a valuable tool for practitioners implementing RL in large-scale environments, where the difference in complexity between CMAB and MDP-based learning can be significant.