Posts 13. Advanced Actor-Critic(A2C)
Post
Cancel

13. Advanced Actor-Critic(A2C)

목차

  1. Advanced Actor Critic
  2. Algorithm
  3. Reference

Advanced Actor Critic

θJθt=0st,atθlnPθ(at|st)Q(st,at)Pθ(st,at)dst,at

위 식에서 Q 대신 action의 함수가 아닌 b(st)를 넣어보자

t=0st,atθlnPθ(at|st)b(st)Pθ(st,at)dst,at =t=0st,atθlnPθ(at|st)b(st)Pθ(at|st)P(st)dst,at =t=0st,atθPθ(at|st)Pθ(at|st)b(st)Pθ(at|st)P(st)dst,at =t=0st,atθPθ(at|st)b(st)P(st)dst,at =t=0θst,atPθ(at|st)b(st)P(st)dst,at =t=0stθatPθ(at|st)datb(st)P(st)dst =t=0stθ1b(st)P(st)dst=0

즉, b(st)가 Q에 대한 함수가 아니면 0 이기 때문에 Q-b=Q가 성립한다.

  • V[X] = V[X-a] but V[X[Q]] V[X[Q-V]]

  • Q-V는 optimal은 아니나 variance가 낮아진다.

  • Advantage=Q(st,at)V(st)
θJθt=0Est,at[θlnPθ(at|st)(QV)]

Q,V actor 3개의 network를 사용해야하므로 Q를 V로 표현

Q(st,at)=Est+1[Rt+γV(st+1)|st,at]

E[X] -a = E[X-a] (a= 상수)

QV=Est+1[Rt+γV(st+1)V(st)|st,at] θJθt=0Est,at[θlnPθ(at|st)(QV)] =t=0Est,at[θlnPθ(at|st)Est+1[Rt+γV(st+1)V(st)|st,at]] =t=0st,atθlnPθ(at|st)st+1(Rt+γV(st+1)V(st))P(st+1|st,at)dst+1Pθ(st,at)dst,at =st,at,st+1t=0θlnPθ(at|st)(Rt+γV(st+1)V(st))Pθ(st)Pθ(at|st)P(st+1|st,at)dst,at,st+1

marginalization으로 인하여 Pθ(st)가 생겼으므로 원래대로 돌이키면 P(st|st+1,at+1)를 써도 될 것같다.

=st,at,st+1t=0θlnPθ(at|st)(Rt+γV(st+1)V(st))P(st|st+1,at+1)Pθ(at|st)P(st+1|st,at)dst,at,st+1

Algorithm

  1. Initialize θ,w

  2. Collect N samples ( 1 sample: {si,ai,si+1})

  3. Actor update: θθ+αti=tN+1θlnPθ(at|st)(Ri+γVw(si+1)Vw(si))

  4. Critic update: wwβwti=tN+1(Ri+γVw(si+1)Vw(si))2

  5. Clear Batch

  6. repeat 1 ~ 4

Reference

  1. 혁펜하임 유투브
This post is licensed under CC BY 4.0 by the author.