How to make decisions in a bandit game?

Suppose you are faced with a 10-arm bandit. For each arm, it has a distribution of reward. Your goal is to get as much reward as possible. But the problem is, you do not know the distribution (mean, variance, etc.). Your only method is trial-and-error, i.e. learn by trying. Now, I think that I should first evaluate my action and then take it by strategy. How to evaluate an action? 1....

Dynamic Programming for MDP

Before we delve into solving MDP by dynamic programming, let’s review concepts in MDP! Markov Decision Process We can use MDP to describe our problems. It includes Action, State, Reward. Markov Property The next state could be fully derived by only the current state. $$P(s_t,r_t|s_{t-1}, a_{t-1}, \dots s_0) = P(s_t,r_t|s_{t-1},a_{t-1})$$ Reward Hypothesis What we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)....