《强化学习导论》：Policy Gradient Methods
Policy Approximation and its Advantages
- the approximate policy can approach a deterministic policy, whereas withε-greedy action selection over action values there is always an ε probability of selecting a random action
- In problems with significant function approximation, the best approximate policy may be stochastic
The Policy Gradient Theorem
there is also an important theoretical advantage:
With continuous policy parameterization the action probabilities change smoothly as a function of the learned parameter, whereas inε-greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values
Policy Gradient Theorem 证明
REINFORCE: Monte Carlo Policy Gradient
REINFORCE with Baseline
The baseline can be any function, even a random variable, as long as it does not vary with a; the equation remains valid because the subtracted quantity is zero
One natural choice for the baseline is an estimate of the state value, ˆv(St,w),
Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic.
REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn slowly (produce estimates of high variance) and to be inconvenient to implement online or for continuing problems.
First consider one-step actor–critic methods, the analog of the TD methods introduced in Chapter 6such as TD(0), Sarsa(0), and Q-learning.
Policy Gradient for Continuing Problems
μ is the steady-state distribution underπ
Policy Gradient Theorem 连续版本证明