Experience Replay and Data Efficiency

To make better use of data, we can use experience replay to increase data efficiency. Experience replay We can put ${s,a,s’,r}$ pairs in the buffer and update Q using mini batch methods. To decrease noise in the replay, we average over several samples. (That’s why minibatch) Infer-Collect framework

Meta Reinforcment Learning

Meta learning system This part is based on Lilian Weng’s post meta rl. Meta RL aims to adopt fast when new situations come. To make it fast for RL, we introduce inductive bias in the system. In meta-RL, we impose certain types of inductive biases from the task distribution and store them in memory. Which inductive bias to adopt at test time depends on the algorithm. In the training setting, there will be a distribution of environments....

Summer camp: R Day2

Create dataframe Create variables ## i. name names <- c("Ada","Robert","Mia") ## ii. age ages <- c(20,21,22) ## iii. Factor, so that you can add levels that not exist in the data year <- c("Freshman","Sophomore","Junior") year <- factor(year, levels=c("Freshman","Sophomore", "Junior","Senior")) Create dataframe students <- data.frame(names,ages,year) Query a dataframe students$names Set working dictory get working dicrectory getwd() set working directory setwd() or go to “session” and set working directory or build R file in the directory you want to work in

Policy Gradient

Now, we want to parameterize the policy $\pi(a|s,\Theta) = Pr(A_t=a|S_t=s,\Theta_t=\Theta$ as long as it is differentiable. To find the best policy, we want to optimize according to $J$ using gradient ascent. $$ \Theta_{t+1} = \Theta + \alpha \nabla J(\Theta_t) $$ Discrete action: Soft-max policy We compute preference $h(s,a,\Theta) = \Theta^T x(s,a)$. One of the most common way to parameterize policy is soft-max: $$ \pi (a|s,\Theta) = \frac{e^{h(s,a,\Theta)}}{\sum_b e^{h(s,b,\Theta)}} $$ One advantage of soft max parameterization is that it can make the optimal determinstic policy....

RL method overview

There are three kinds of RL in general. The first two are model-free methods. Value: Parameterize value function; Estimate value function $q$ to generate the best policy. Q learning, SARSA Policy: Parameterize policy gradient; Objective is to maximize the average return so that find the best policy directly; always combine with value-based method. Actor Critic Model: Parameterize the model; Improve model accuracy; always combine with value-based method. Dyna-Q & Dyna-Q+...