My main research interest is in the old goal of Artificial Intelligence (AI) which is building autonomous system that can learn to be competent in uncertain environments. The field of machine learning and especially reinforcement learning (RL) has focused on this goal, which is different from supervised learning in that correct input/output pairs which have never been presented. Further, there is a focus on online performance, which involves finding a balance between exploration (exploring action for increasing the knowledge about the environment) and exploitation (of current knowledge). To answer this question, I am investigating “the multi-armed bandit problem” and study the statistical and algorithmic principles behind it. The multi-armed bandit problem (MAB): This problem models an agent that simultaneously attempts to acquire new knowledge and optimize his decisions based on existing knowledge. A particularly useful version is the contextual multi-armed bandit problem. In this problem, at each iteration, an agent has to choose between N arms. Before making the choice, the agent sees d-dimensional feature or context vectors, associated with each arm. The learner uses these feature vectors along with the feature vectors and rewards of the arms played by it in the past to make the choice of the arm to play in the current round. Over time, the learner's aim is to gather enough information about how the feature vectors and rewards relate to each other, so that it can predict, which arm is likely to give the best reward by looking at the feature vectors. There are many practical applications such as, recommender system, information retrieval, text mining, e-commerce, advertising, finance trading and Robotics.