Now, we want to parameterize the policy $\pi(a|s,\Theta) = Pr(A_t=a|S_t=s,\Theta_t=\Theta$ as long as it is differentiable. To find the best policy, we want to optimize according to $J$ using gradient ascent.
$$ \Theta_{t+1} = \Theta + \alpha \nabla J(\Theta_t) $$
Discrete action: Soft-max policy
We compute preference $h(s,a,\Theta) = \Theta^T x(s,a)$. One of the most common way to parameterize policy is soft-max:
$$ \pi (a|s,\Theta) = \frac{e^{h(s,a,\Theta)}}{\sum_b e^{h(s,b,\Theta)}} $$
One advantage of soft max parameterization is that it can make the optimal determinstic policy.
The second advantage is that it enables the selection with arbitrary probabilities.
If policy is a simpler function to approximate, it is faster to use it than action-valution functions.
The Policy Gradient Thereom
What is the objective of policy gradient?
We define the performance measure as $J(\Theta) = v_{\pi_{\Theta}}(s_0)$
Actually, this is the expected return.
If we define the objective as the average reward $r(\pi)$, we can unroll it as the following:
Therefore, the gradient is
After derivation, according to the theorem:
To transform it into a stochastic gradient ascent step:
We can have the update rule as:
Actor-Critic
Now we have a parameterized actor to perform and improve policy.But how to compute value function? We use another function approximator to estimate $q$ (or $v$), as a critic.
For Average Reward Semi-Gradient TD(0)
after adding a baseline:
We can prove that the baseline does not affect! Adavantage is it reduces update variance.
For Discounted Return Semi-Gradient TD(0)
Continuous action: Gaussian Policies
Measure how RL agents perform
exponentially weighted average reward