목차
Advanced Actor Critic
\[\nabla_\theta J_\theta \simeq \underset{t=0}{\overset{\infty}{\sum}} \int_{s_t,a_t} \nabla_\theta \text{ln} P_\theta (a_t \vert s_t) Q(s_t, a_t) P_\theta(s_t, a_t) \, d s_t,a_t\]위 식에서 Q 대신 action의 함수가 아닌 \(b(s_t)\)를 넣어보자
\[\underset{t=0}{\overset{\infty}{\sum}} \int_{s_t,a_t} \nabla_\theta \text{ln} P_\theta (a_t \vert s_t) b(s_t) P_\theta(s_t, a_t) \, d s_t,a_t\] \[=\underset{t=0}{\overset{\infty}{\sum}} \int_{s_t,a_t} \nabla_\theta \text{ln} P_\theta (a_t \vert s_t) b(s_t) P_\theta(a_t \vert s_t) P(s_t) \, d s_t,a_t\] \[=\underset{t=0}{\overset{\infty}{\sum}} \int_{s_t,a_t} \frac{\nabla_\theta P_\theta (a_t \vert s_t)}{P_\theta (a_t \vert s_t)} b(s_t) P_\theta(a_t \vert s_t) P(s_t) \, d s_t,a_t\] \[=\underset{t=0}{\overset{\infty}{\sum}} \int_{s_t,a_t} \nabla_\theta P_\theta (a_t \vert s_t)b(s_t) P(s_t) \, d s_t,a_t\] \[=\underset{t=0}{\overset{\infty}{\sum}} \nabla_\theta \int_{s_t,a_t} P_\theta (a_t \vert s_t)b(s_t) P(s_t) \, d s_t,a_t\] \[=\underset{t=0}{\overset{\infty}{\sum}} \int_{s_t} \nabla_\theta \int_{a_t} P_\theta (a_t \vert s_t) \, da_t b(s_t) P(s_t) \, d s_t\] \[=\underset{t=0}{\overset{\infty}{\sum}} \int_{s_t} \nabla_\theta 1 b(s_t) P(s_t) \, d s_t=0\]즉, \(b(s_t)\)가 Q에 대한 함수가 아니면 0 이기 때문에 Q-b=Q가 성립한다.
V[X] = V[X-a] but V[X[Q]] \(\neq\) V[X[Q-V]]
Q-V는 optimal은 아니나 variance가 낮아진다.
- \[Advantage = Q(s_t, a_t) - V(s_t)\]
Q,V actor 3개의 network를 사용해야하므로 Q를 V로 표현
\[Q(s_t,a_t) = E_{s_{t+1}}[R_t + \gamma V(s_{t+1}) \vert s_t, a_t]\]E[X] -a = E[X-a] (a= 상수)
\[Q-V = E_{s_{t+1}}[R_t + \gamma V(s_{t+1}) - V(s_t) \vert s_t, a_t]\] \[\nabla_\theta J_\theta \simeq \underset{t=0}{\overset{\infty}{\sum}}E_{s_t,a_t}[\nabla_\theta \text{ln} P_\theta (a_t | s_t) (Q-V)]\] \[=\underset{t=0}{\overset{\infty}{\sum}} E_{s_t, a_t} [\nabla_\theta \text{ln} P_\theta (a_t \vert s_t) E_{s_{t+1}}[R_t + \gamma V(s_{t+1}) - V(s_t) \vert s_t, a_t]]\] \[= \underset{t=0}{\overset{\infty}{\sum}} \int_{s_t,a_t} \nabla_\theta \text{ln} P_\theta(a_t \vert s_t) \int_{s_{t+1}}(R_t + \gamma V(s_{t+1}) - V(s_t)) P(s_{t+1} \vert s_t, a_t) \, ds_{t+1} P_\theta (s_t,a_t) \, ds_t, a_t\] \[= \int_{s_t, a_t, s_{t+1}} \underset{t=0}{\overset{\infty}{\sum}} \nabla_\theta \text{ln} P_\theta (a_t \vert s_t) (R_t + \gamma V(s_{t+1}) - V(s_t)) P_\theta(s_t) P_\theta(a_t \vert s_t) P(s_{t+1} \vert s_t, a_t) \, ds_t,a_t,s_{t+1}\]marginalization으로 인하여 \(P_\theta(s_t)\)가 생겼으므로 원래대로 돌이키면 \(P(s_t \vert s_{t+1}, a_{t+1})\)를 써도 될 것같다.
\[= \int_{s_t, a_t, s_{t+1}} \underset{t=0}{\overset{\infty}{\sum}} \nabla_\theta \text{ln} P_\theta (a_t \vert s_t) (R_t + \gamma V(s_{t+1}) - V(s_t)) P(s_t \vert s_{t+1}, a_{t+1}) P_\theta(a_t \vert s_t) P(s_{t+1} \vert s_t, a_t) \, ds_t,a_t,s_{t+1}\]Algorithm
Initialize \(\theta, w\)
Collect N samples ( 1 sample: {\(s_i, a_i, s_{i+1}\)})
Actor update: \(\theta \gets \theta + \alpha \underset{i=t-N+1}{\overset{t}{\sum}} \nabla_\theta \text{ln} P_\theta (a_t \vert s_t) (R_i + \gamma V_w(s_{i+1}) - V_w(s_i))\)
Critic update: \(w \gets w - \beta \nabla_w \underset{i=t-N+1}{\overset{t}{\sum}} (R_i + \gamma V_w (s_{i+1}) - V_w(s_i))^2\)
Clear Batch
repeat 1 ~ 4