The ultimate goal in RL is to learn a policy π:S→Δ(A) that maximises the expected return in an MDP M=(S,A,P,r,ρ0,γ) with reward r:S×A×S→R provided by the environment. At time t the agent samples at∼π(⋅∣st) and the environment transitions to st+1∼P(⋅∣st,at) where at∈A and st∈S.
It’s convenient to bundle one full episode into a trajectory as τ=(s0,a0,…,aT,sT+1) for t∈[0,T], representing T+1 transitions.
In general,
Observe the current state (or observation) st (s0∼ρ0(⋅) at episode start)
Select the action at
Transition to the next state st+1∼P(⋅∣st,at)
Retrieve reward from the environment rt=r(st,at,st+1)
Repeat
So far I described the stochastic setting. There also exists a deterministic special case with the policy at=μ(st), and if dynamics are deterministic, st+1=f(st,at)when environments are noise free (rare in practice). In most real world cases, stochasticity is impossible to ignore.
From the oversimplified RL algorithm above 2) is the chief concern of most RL research. Since this post is titled “Policy Gradients” I’m going to focus on action selection. There’s plenty of room for later posts on reward functions (stay tuned).
Markov Decision Processes
Before we start its useful to formally define a Markov Decision Process (MDP). It is a stochastic system where the future depends only on the present and not the past. Mathematically,
P(St+1∣St)=P(St+1∣S1,…,St).
If you’re familiar with greedy algorithms, it follows the same general principles.
Action Selection
Select the action at
The best action, should by definition yield the best reward or more concretely the expected return and it comes from the policy at∼π(⋅∣st). The effectiveness of this policy can be quantified from the value function,
Vπ(s)=Eτ∼π[t=0∑∞γtrts0=s].
This explains how valuable a state st is when the action is sampled directly from the policy at∼π(⋅∣st) assuming the start of the episode is at t=0.
Eq. 1 employs a discounted reward with γ to determine how myopic (γ≈0) or foresighted (γ≈1) the agent should be. The infinite sum case requires that γ∈[0,1) to converge. If the present reward isn’t more valuable than a future reward then discounting can be removed in favour of a finite time horizon,
This is assuming the episode starts t=0, otherwise we could express it more generally for any t,
R(τ)={∑k=0∞γkrt+k,∑k=0T−trt+k.
The big E represents the expected value Eτ∼π[R(τ)] of a certain policy π; at its core RL is basically an optimisation problem around this equation. It’s useful for determining how effective the policy is performing. We can use it to formally solve for the optimal policy,
π∗=πargmaxEτ∼π[R(τ)].
The value of a state Vπ(s) is useful but not the complete picture. We might be interested about the effect of an arbitrary action a coupled with a given state. We can use the Q-function for this,
Qπ(s,a)=Eτ∼π[R(τ)∣s0=s,a0=a].
Unless stated otherwise, Q(s,a) denotes Qπ(s,a) for the current policy; Q∗ denotes the optimal action-value.
If the action space A and state space S are both small and discrete, its feasible to sample a large quantity of Q-values into a Q-table with ∣S∣∣A∣ entries. This can be used to produce the next best action based on,
a=a∈AargmaxQπ(s,a).
In discrete action spaces argmaxa∈AQπ(s,a) can be chosen via a finite O(∣A∣) scan of the Q-table which can be defined in one line of numpy.
In continuous spaces argmaxa∈AQπ(s,a) is tractable only if Qπ(s,⋅) is concave and differentiable, allowing convex optimisation techniques to find the optimal action.
Most real world problems (especially robotics) unfortunately suffer from the curse of dimensionality with extremely large continuous action spaces of torque control, action controls and gripper commands. State spaces are likewise high dimensional and continuous, spanning joint angles and velocities, torques, motor currents and other sensors readings.
It’s always easier to grok compute limits with real examples. A typical Q-table takes up ∣S∣∣A∣×dsize bytes of memory where dsize=4 bytes if you’re using single precision floats. An indoor mobile navigator operating in a 100 x 100 m workspace with a 0.1 m discretisation per cell has a total state space of 1000×1000=106 cells. Even before adding orientation and speed the Q-table already needs 106 rows. Now add 6 actions and you’re total memory footprint is 6×106×4 bytes=24 MB. What happens when we add a z-axis (discretised the same) for a new 100 x 100 x 100 m workspace for a drone? That’s 24 GB. As you can see this does not scale well, especially on embedded hardware that often has significant compute limits.
Hence why pure argmaxaQπ(s,a) is often impractical, motivating other methods like actor-critic and deterministic policy gradients; I’ll cover when to use value-based methods (SARSA, Q-learning, DQN) in a separate post; they’re still useful.
However, we’re here to talk about policy gradients.
Policy Gradients
The real world is high dimensional and continuous, so we cannot sample V(s) densely enough to find a global best action without catastrophe like having a robot arm swing at you, or drone fly through a wall.
Instead we can borrow concepts from supervised learning with gradient descent to optimise a policy parametrised over θ∈Rd directly from interaction, updating πθ with gradients of the expected return estimated from rollouts in the real environment. In short, policy gradients learn by doing.
Gradient updates in supervised learning often minimise a loss θ←θ−η∇θJ(θ) via descent whereas in policy gradients we want to maximise an objective,
J(θ)=Eτ∼πθ[R(τ)],
therefore we’re interested in computing gradient ascent,
θ←θ+α∇θJ(θ).
Beyond gradient-based non-convex optimisation, little carries over from supervised learning; techniques like dropout and very deep networks do not transfer cleanly into RL and can even decrease learning stability (way more where that came from in this talk by John Schulman).
RL gets exciting once we attempt to compute ∇θJ(θ),
A big issue here is the intractable trajectory integral ∫τP(τ∣θ)R(τ)dτ. In continuous MDPs the trajectory space τ∈T is often enormous and uncountable, involves unknown dynamics P(st+1∣st,at) and the integral has no closed form. The log-derivative trick can help substitute ∇θP(τ∣θ) for something tractable,
We can instead augment the definition of P(τ∣θ) where ρ0 is the initial state distribution with the log-derivative trick to removeterms with unknown dynamics,