Reinforcement Learning: an Introduction

Eligibility Trace, Andy Barto

Harry Klopf’s Hedonistic Hypothesis Neurons will maximize the local analog of pleasure and minimize the local analog of pain, i.e. it is a RL agent. specific hypothesis When a neuron fires an action potential, all the contributing synapses become eligible to undergo changes in their efficacies, or weights. If the action potential is followed within an appropriate time period by an increase in reward, an efficacies of all eligible synapses increase (or decrease in the case of punishment)....

Experience Replay and Data Efficiency

To make better use of data, we can use experience replay to increase data efficiency. Experience replay We can put ${s,a,s’,r}$ pairs in the buffer and update Q using mini batch methods. To decrease noise in the replay, we average over several samples. (That’s why minibatch) Infer-Collect framework

Meta Reinforcment Learning

Meta learning system This part is based on Lilian Weng’s post meta rl. Meta RL aims to adopt fast when new situations come. To make it fast for RL, we introduce inductive bias in the system. In meta-RL, we impose certain types of inductive biases from the task distribution and store them in memory. Which inductive bias to adopt at test time depends on the algorithm. In the training setting, there will be a distribution of environments....

Policy Gradient

Now, we want to parameterize the policy $\pi(a|s,\Theta) = Pr(A_t=a|S_t=s,\Theta_t=\Theta$ as long as it is differentiable. To find the best policy, we want to optimize according to $J$ using gradient ascent. $$ \Theta_{t+1} = \Theta + \alpha \nabla J(\Theta_t) $$ Discrete action: Soft-max policy We compute preference $h(s,a,\Theta) = \Theta^T x(s,a)$. One of the most common way to parameterize policy is soft-max: $$ \pi (a|s,\Theta) = \frac{e^{h(s,a,\Theta)}}{\sum_b e^{h(s,b,\Theta)}} $$ One advantage of soft max parameterization is that it can make the optimal determinstic policy....

RL method overview

There are three kinds of RL in general. The first two are model-free methods. Value: Parameterize value function; Estimate value function $q$ to generate the best policy. Q learning, SARSA Policy: Parameterize policy gradient; Objective is to maximize the average return so that find the best policy directly; always combine with value-based method. Actor Critic Model: Parameterize the model; Improve model accuracy; always combine with value-based method. Dyna-Q & Dyna-Q+...