Posts 2. Bellman Equation
Post
Cancel

2. Bellman Equation

목차

  1. Bellman Equation
  2. Reference

Bellman Equation

  • 벨만 방정식이란 어떤 상태의 값과 그 후에 이어지는 상태의 값들 사이의 관계를 표현한다.
  • 그래서 state value function이 reward가 재귀적으로 사용되므로 \(V(s_t)\)를 \(V(S_{t+1})\)로 표현할 수 있다.

  • \(Q(s_t,a_t)\)로 \(V(s_t)\) 표현하기
    \(V(s_t) \doteq \int_{a_{t}:a_{\infty}} G_tP(a_{t+1},s_{t+2},a_{t+2} \cdots |s_t) \, da_t:a_{\infty}\)
    \(=\int_{a_t}\int_{s_{t+1}:a_{\infty}} G_t P(s_{t+1},a_{t+1}, \cdots|s_t,a_t)\,da_t:a_{\infty}P(a_t|s_t)\,da_t\)
    \(=\int_{a_t}Q(s_t,a_t)P(a_t|s_t)\,da_t\)
\[V(s_t) = \sum_{a_t} \pi (a_t|s_t) Q(s_t,a_t)\]
  • \(V(s_{t+1})\)로 \(V(s_t)\) 표현하기
    \(V(s_t) \doteq \int_{a_{t}:a_{\infty}} G_tP(a_{t+1},s_{t+2},a_{t+2} \cdots |s_t) \, da_t:a_{\infty}\)
    \(=\int_{a_t,s_{t+1}}\int_{a_{t+1}:a_{\infty}} (R_t + \gamma G_{t+1}) P(s_{t+1},a_{t+1}, \cdots|s_t,a_t)\,da_{t+1}:a_{\infty}P(a_t|s_t)\,da_t\)
    \(=\int_{a_t,s_{t+1}} (R_t +\gamma V(s_{t+1})) P(a_t,s_{t+1}|s_t) \, da_t,s_{t+1}\)
\[V(s_t) = \sum_a \pi (a_t|s_t) (r_{s_t}^{a_t}+ \gamma \sum_{s_{t+1}} P_{s_t s_{t+1}}^{a_t} V(s_{t+1}))\] \[=E_{\pi}[r_{t+1} + \gamma V(s_{t+1})]\]
  • \(V(s_{t+1})\)로 \(Q(s_t,a_t)\)표현하기
    \(Q(s_t,a_t) \doteq \int_{s_{t+1}:a_{\infty}} G_tP(s_{t+1},a_{t+1},s_{t+2} \cdots |s_t,a_t) \, ds_{t+1}:a_{\infty}\)
    \(=\int_{s_{t+1}} \int_{a_{t+1}:a_{\infty}} (R+\gamma G_{t+1}) P(a_{t+1}, s_{t+2} \cdots |s_t, a_t, s_{t+1}) \, da_{t+1}:a_{\infty} P(s_{t+1}|s_t, a_t) \,ds_{t+1}\)
    \(=\int_{s_{t+1}} (R_t + \gamma V(S_{t+1})) P(s_{t+1}|s_t,a_t) \,ds_{t+1}\)
\[Q(s_t,a_t) = r_{s_t}^{a_t}+ \gamma \sum_{s_{t+1}} P_{s_t s_{t+1}}^{a_t} V(s_{t+1})\]
  • \(Q(s_{t+1},a_{t+1})\)로 \(Q(s_t,a_t)\)표현하기
    \(Q(s_t,a_t) \doteq \int_{s_{t+1}:a_{\infty}} G_tP(s_{t+1},a_{t+1},s_{t+2} \cdots |s_t,a_t) \, ds_{t+1}:a_{\infty}\)
    \(=\int_{s_{t+1},a_{t+1}} \int_{s_{t+2}:a_{\infty}} (R_t + \gamma G_{t+1}) P(s_{t+2},a_{t+2} \cdots |s_t, a_t, a_{t+1}) \,ds_{t+2}:a_{\infty} P(s_{t+1},a_{t+1}|s_t,a_t) \, ds_{t+1}, a_{t+1}\)
    \(=\int_{s_{t+1},a_{t+1}} (R_t + Q(s_{t+1}, a_{t+1})) P(s_{t+1}, a_{t+1} | s_t, a_t) \,ds_{t+1},a_{t+1}\)
\[Q(s_t,a_t) = r_{s_t}^{a_t}+ \gamma \sum_{s_{t+1}} P_{s_t s_{t+1}}^{a_t} \sum_{a_{t+1}} \pi(a_{t+1}|s_{t+1})Q(s_{t+1},a_{t+1})\] \[=E_{\pi}[r_{t+1} + \gamma Q(s_{t+1},a_{t+1})]\]
  • 수식이 어떻게 변화되고 그 과정을 이해하고 Q, V 형태 변환을 외우면 된다.
  • 앞으로 벨만 방정식을 굉장히 많이 사용하기 때문에 Q, V 관계를 이해하고 그 관계식을 외우는 것이 도움이 될 것이다.

Reference

  1. 혁펜하임 유투브
This post is licensed under CC BY 4.0 by the author.