[๊ฐ•ํ™”ํ•™์Šต] SARSA์™€ DQN ๊ฐœ๋… ์ •๋ฆฌ

2024. 7. 5. 02:11ใ†๐Ÿงช Data Science/ML, DL

 

 

์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต์˜ ์ฃผ์š” ํฌ์ธํŠธ๋ฅผ ํ™•์‹คํ•˜๊ฒŒ ํŒŒ์•…ํ•˜๊ณ  ๋„˜์–ด๊ฐ„๋‹ค.

๊ทธ ํ›„, SARSA์™€ DQN์˜ ๊ฐœ๋…์„ ์ •๋ฆฌํ•˜๊ณ  ๋‘˜์˜ ์ฐจ์ด์ ์„ ๋น„๊ตํ•œ๋‹ค.

 

 

* ๋ณธ ํฌ์ŠคํŒ…์€ ์ฑ… 'Foundations of Deep Reinforcement Learning: Theory and Practice in Python'์„ ์ฐธ๊ณ ํ•˜๊ณ  ์ •๋ฆฌํ•œ ๊ฒƒ์ž„์„ ๋ฐํž™๋‹ˆ๋‹ค. ํฌ์ŠคํŒ… ๋‚ด์— ์“ฐ์ธ ์ˆ˜์‹๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ทธ๋ฆผ์€ ์ฑ…์—์„œ ๊ฐ€์ ธ์˜จ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

1. Value-based Algorithm

 

$V^{\pi}(s)$ or $Q^{\pi}(s, a)$

 

์ด์ „ ํฌ์ŠคํŒ…์—์„œ ๋‹ค๋ฃฌ Model-based ๊ธฐ๋ฐ˜ REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ agent๊ฐ€ ์ง์ ‘ policy๋ฅผ ํ•™์Šตํ•ด ๊ฐ€๋Š” ๋ฐฉ์‹์ด์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋ฒˆ Value-based ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ state-action ์Œ์„ ์ง์ ‘ ํ‰๊ฐ€ํ•˜๋ฉด์„œ actions์„ ๊ฒฐ์ •ํ•œ๋‹ค.

 

 

Q(s, a)๋Š” state์™€ action์„ ๋ชจ๋‘ ๊ณ ๋ คํ•˜๊ณ , V(s)๋Š” state๋งŒ์„ ๊ณ ๋ คํ•œ๋‹ค. V๋Š” ์‚ฌ์‹ค์ƒ Q๋กœ ๋„์ถœํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋Š” state์—์„œ ์ทจํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  action์˜ ๊ธฐ๋Œ“๊ฐ’์ด ๊ณง V(s)์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

 

$V^{\pi}(s)$๋Š” State์—์„œ ํŠน์ • action์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š๊ธฐ์—, ์–ด๋–ค state์—์„œ๋“  restart์˜ ๋น„์šฉ์ด ์ปค์งˆ ์ˆ˜๋ฐ–์— ์—†๋‹ค. state์—์„œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  action๋“ค์„ ํ•ด๋ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ฐ˜๋ฉด์— $Q^{\pi}(s, a)$๋Š” ์ด ๋ฌธ์ œ๋ฅผ ํ”ผํ•  ์ˆ˜ ์žˆ์–ด, $Q^{\pi}(s, a)$๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋” ๋งŽ๋‹ค. 

 

 

 

2. Temporal Difference Learning (TD)

: model-free ๊ฐ•ํ™”ํ•™์Šต ๋ฐฉ์‹์˜ ํ•œ ์ข…๋ฅ˜. value function์˜ current estimate๋กœ๋ถ€ํ„ฐ bootstrapping ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•œ๋‹ค. 

 

TD์˜ Q ์ˆ˜์‹

 

SARSA ๋ฌธ์ œ์—์„  trajectories๋ฅผ ์ƒ์„ฑํ•˜๊ณ , Q-value for each (s, a) pair๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. trajectories๋Š” $Q_{tar}$๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ํ™œ์šฉํ•œ๋‹ค. ๊ฒฐ๊ตญ Q-value์™€ $Q_{tar}$ ์‚ฌ์ด์˜ ๊ฐ„๊ฒฉ์„ ์ค„์—ฌ๊ฐ€๋Š” ๋ฌธ์ œ๋‹ค. TD learning์€ $Q_{tar}$๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ƒ์„ฑํ•  ๊ฒƒ์ธ๊ฐ€์— ๋Œ€ํ•œ ์ด์•ผ๊ธฐ๋‹ค.

 

์™œ TD๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ผ๊นŒ?

 

2-1. $Q_{tar:MC}^{\pi}$๋Š” ๋ชฌํ…Œ์นด๋ฅผ๋กœ(MC)๋ฅผ ํ†ตํ•ด ์ถ”์ธกํ•˜๋Š” value function์„ ๊ฐ€๋ฆฌํ‚ค๋Š”๋ฐ, ์ˆ˜์‹์„ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ๋ชจ๋“  trajectories์˜ returns๋ฅผ ํ‰๊ท  ๋‚ธ ๊ฒƒ์ด๋‹ค.

MC ๋ฐฉ์‹์˜ Q ์ˆ˜์‹

 

MC ๋ฐฉ์‹์„ ํ†ตํ•  ๊ฒฝ์šฐ, agent๋Š” Q๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด episode๋“ค์ด ๋๋‚  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ค์•ผ ํ•œ๋‹ค. ์œ„์˜ ์ˆ˜์‹์„ ๋ณด์•„๋„, Q๋Š” trajectories๋“ค์˜ returns ํ•ฉ์ด ๋‚˜์˜ฌ ๋•Œ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ฐ˜๋ฉด์—, TD๋Š” time-step ๋ณ„๋กœ ์ด์ „์˜ value function ๊ฐ’๋“ค์„ ํ† ๋Œ€๋กœ ํ˜„์žฌ value function์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด์— ๋”ฐ๋ผ, trajectories๊ฐ€ ๋๋‚  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆด ํ•„์š”๊ฐ€ ์—†๊ฒŒ ๋œ๋‹ค.

 

2-2. DP(dynamic Programming)๋„ time-step ๋ฐฉ์‹์œผ๋กœ, ๋งค step๋งˆ๋‹ค ์—…๋ฐ์ดํŠธํ•˜๊ฑฐ๋‚˜, ๊ทธ ์ „์˜ ๊ฒฐ๊ณผ๋“ค์„ ํ† ๋Œ€๋กœ ํ˜„์žฌ ๊ฐ’์„ ์ถ”์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค๋งŒ, DP๋Š” ๋ชจ๋“  ๊ฒฝ์šฐ์˜ ์ˆ˜๋ฅผ ๋‹ค ์ฐพ๋Š”๋‹ค๋Š” ์ ์—์„œ ๊ฐ•ํ™”ํ•™์Šต ๋ฌธ์ œ๋ฅผ ํ’€๊ธฐ์— ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค. ์ตœ์ ์˜ ๊ฒฝ๋กœ๋กœ ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ๊ฐ€์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ฆ‰, TD๋Š” MC์˜ random sample ๋ฐฉ์‹๊ณผ DP์˜ time-step ๋ฐฉ์‹์„ ๊ฐ๊ฐ ๊ฐ€์ ธ์˜จ ๊ฒƒ์ด๋ผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

  Model-free Bootstrapping Update
DP X (full-width) O Time-step
MC O (Sample) X End of episode
TD O (Sample) O Time-step

(์ •๋ฆฌํ‘œ)

 

 

 

3. SARSA ๊ฐœ๋…๊ณผ Training process

: Value-based algorithm์˜ ๊ฐ•ํ™”ํ•™์Šต ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜

 

์ฃผ์š” ํŠน์ง•
1) TD learning(Q-function) ๋ฐฉ์‹์„ ํ™œ์šฉ
2) ε-greedy policy (a method for generating actions): ๋‹ค์Œ action์„ ๊ฒฐ์ •์ง“๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, random ํ•˜๊ฒŒ ์•ก์…˜์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋Š” ์—ฌ์ง€๋ฅผ ๋‚จ๊ฒจ๋‘ ์œผ๋กœ์จ bias ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ–ˆ๋‹ค.
3) on-policy: policy๋ฅผ ์ง์ ‘ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ(REINFORCE) ์™ธ์—๋„, Target value๊ฐ€ action ์ƒ์„ฑํ•˜๋Š” policy์— ์˜์กดํ•  ๊ฒฝ์šฐ๋„ on-policy๋กœ ๋ณธ๋‹ค. line 8์„ ๋ณด๋ฉด, Target value์ธ $y_{i}$๊ฐ€ a'์— ์˜์กดํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. a'๋Š” experience-generating policy(ε-greedy)์— ์˜์กดํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ SARSA๋ฅผ on-policy๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

SARSA algorithm์„ ๊ฐ„๋žตํ•˜๊ฒŒ ์‚ดํŽด๋ณด์ž. (1) ์ฒ˜์Œ์— ์ž…์‹ค๋ก ๊ณผ ์„ธํƒ€๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , Max step์„ ์ง€์ •ํ•˜๊ณ  (2) for๋ฌธ์„ ๋Œ๋ฆฐ๋‹ค. (3) ε-greedy๋ฅผ ํ†ตํ•ด trajectories๋ฅผ ์ƒ์„ฑ, (4) ๊ทธ ์•ˆ์—์„œ ๋‹ค์‹œ Q ๊ฐ’์„ ๊ตฌํ•˜๋ฉฐ target value๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. (5) target value์™€ $Q^{\pi}$์˜ ์ฐจ์ด๋ฅผ ๊ตฌํ•˜๊ณ , (6) ์„ธํƒ€๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.  

 

SARSA์™€ ๋น„์Šทํ•˜๊ฒŒ ์–ธ๊ธ‰๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด Q-learning(์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์„ ํ™œ์šฉํ•œ ๋ฒ„์ „์ด DQN)์ด๋‹ค. ๋ฐฉ์‹์€ ๋˜‘๊ฐ™์€๋ฐ, $Q^{\pi}$๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ์‹์ด ์กฐ๊ธˆ ๋‹ค๋ฅด๋‹ค. SARSA๋Š” ์‹ค์ œ action๊ณผ state ๊ทธ ์ž์ฒด๋ฅผ ์„ ํƒํ•˜๋Š” ๋ฐ˜๋ฉด, Q-learning์€ ๊ทธ์ค‘์—์„œ ์ œ์ผ ํฐ $Q^{\pi}(s, a)$๋ฅผ ์„ ํƒํ•˜์—ฌ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.

 

 

 

 

 

3. DQN ๊ฐœ๋…๊ณผ Training process

: ๋”ฅ๋Ÿฌ๋‹๊ณผ ๊ฐ•ํ™”ํ•™์Šต์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ ์ฒซ ๋ฒˆ์งธ ์•Œ๊ณ ๋ฆฌ์ฆ˜(from DeepMind)

 

๋จผ์ € Q-learning์ด๋ž€?

SARSA์™€ ๊ฑฐ์˜ ๋น„์Šทํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ํ˜„์žฌ ํ–‰๋™์„ ์ƒ์„ฑํ•  ๋•Œ๋Š” ε-greedy๋ฅผ ํ™œ์šฉํ•˜์ง€๋งŒ, ํ›„์— ๋‹ค์Œ ํ–‰๋™์„ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜ ์ƒ์„ฑํ•  ๋•Œ๋Š” ε-greedy๊ฐ€ ์•„๋‹ˆ๋ผ ์ตœ๊ณ ์˜ return์„ ์ฃผ๋Š” action์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹์ด ํ™œ์šฉ๋˜์—ˆ๋‹ค. ํ–‰๋™ํ•˜๋Š” ์ •์ฑ…๊ณผ ํ•™์Šตํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ์ •์ฑ…์ด ๋‹ค๋ฅด๊ธฐ์—, Q-learning์€ off-policy ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. 

 

Q-learning์˜ ๋‹จ์ : state-action(s, a)์— ํ•ด๋‹นํ•˜๋Š” Q-value๋ฅผ ํ…Œ์ด๋ธ” ํ˜•์‹์œผ๋กœ ์ €์žฅํ•˜๊ณ  ํ•™์Šตํ•œ๋‹ค. ์ด ๊ฒฝ์šฐ, state/action space๊ฐ€ ์ปค์ง€๋ฉด memory/exploration time ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ

 

ํ•ด๊ฒฐ์ : ๋”ฅ๋Ÿฌ๋‹์„ ํ™œ์šฉํ•˜์—ฌ Q-table์— ํ•ด๋‹นํ•˜๋Š” Q-function์„ ๋น„์„ ํ˜• ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌ ์‹œํ‚จ๋‹ค๋ฉด ์–ด๋–จ๊นŒ? ๊ตณ์ด ์ €์žฅํ•  ํ•„์š”๊ฐ€ ์—†์–ด์ง„๋‹ค.

 

 

์ถœ์ฒ˜: ์ด๊ฒƒ์ €๊ฒƒ ํ…Œํฌ๋ธ”๋กœ๊ทธ

 

 

DQN์˜ 3๋Œ€ ์š”์†Œ

 

3-1) CNN Architecture

: Q-function ๊ทผ์‚ฌ๋ฅผ linear function์ด ์•„๋‹ˆ๋ผ CNN Architecture๋ฅผ ํ†ตํ•ด ์ง„ํ–‰

 

 

3-2) Experience Replay

: ๋งค ์Šคํ…๋งˆ๋‹ค ์ถ”์ถœ๋œ ์ƒ˜ํ”Œ $e_{t} = (s_{t}, a_{t}, r_{t}, s_{t+1})$์„ replay memory D์— ์ €์žฅ. Replay memory D์— ์ €์žฅ๋œ ์ƒ˜ํ”Œ๋“ค์„ uniform ํ•˜๊ฒŒ ๋žœ๋ค ์ถ”์ถœํ•˜์—ฌ Q-update ํ•™์Šต์— ์ด์šฉ

์žฅ์  1: ๋ณธ๋ž˜ ๋‹ค์Œ ์ƒ˜ํ”Œ์„ ํ•™์Šตํ•˜์ง€๋งŒ, ์ด๋Ÿด ๊ฒฝ์šฐ Correlation์ด ๊ฐ•ํ•œ ์—ฐ์†๋œ ์ƒ˜ํ”Œ๋“ค๋กœ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ฒŒ ๋œ๋‹ค. ์ œ๋Œ€๋กœ ๋œ ํ•™์Šต์„ ํ•  ์ˆ˜ ์—†๊ธฐ์—, random ์ถ”์ถœํ•˜์—ฌ ์ƒ˜ํ”Œ ์‚ฌ์ด์˜ dependency๋ฅผ ์ตœ๋Œ€ํ•œ ๋ฌด์‹œํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. 

์žฅ์  2: Q-update๋กœ ์ธํ•ด behavior policy ๋ณ€๊ฒฝ ๋ฐœ์ƒ ์‹œ, policy๋กœ ์ƒ์„ฑ๋˜๋Š” training data์˜ ๋ถ„ํฌ๋„ ๊ฐ‘์ž‘์Šค๋Ÿฝ๊ฒŒ ๋ณ€ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ distribution ๊ฐ‘์ž‘์Šค๋Ÿฌ์šด ๋ณ€ํ™”๋Š” ํ•™์Šต ์•ˆ์ •์„ฑ์„ ํ•ด์นœ๋‹ค. ํ•˜์ง€๋งŒ ๋žœ๋ค ์ถ”์ถœ์„ ํ†ตํ•ด ๊ฐ๊ฐ ๋‹ค๋ฅธ ์‹œ๊ฐ„์—์„œ ์ˆ˜ํ–‰๋œ ์ƒ˜ํ”Œ๋“ค๋กœ update๋ฅผ ํ•˜๊ธฐ์— data๊ฐ€ ํŽธํ–ฅ ๋ถ„ํฌํ•˜์ง€ ์•Š๋Š”๋‹ค.

 

 

3-3) Target Network

: main Q-network์™€ target network์˜ ์ด์ค‘ํ™”๋œ ๊ตฌ์กฐ. ํ”„๋กœ์„ธ์Šค๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) Target network๋ฅผ ํ™œ์šฉํ•˜์—ฌ y๋ฅผ ๊ณ„์‚ฐ. (2) Main Q-network๋ฅผ ์ด์šฉํ•˜์—ฌ Q(s, a) ๊ณ„์‚ฐ. (3) Loss function์„ ์ด์šฉํ•˜์—ฌ Main Q-network ์—…๋ฐ์ดํŠธ. (4) ๋งค C ์Šคํ…๋งˆ๋‹ค Target network๋ฅผ Main Q-network๋กœ ์—…๋ฐ์ดํŠธ. 

 

Main Q-network: State, Action์„ ์ด์šฉํ•ด ๊ฒฐ๊ด๊ฐ’์ด ๋˜๋Š” Q(s, a)๋ฅผ ์–ป๋Š”๋ฐ ์ด์šฉ. ๋งค ์Šคํ…๋งˆ๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—…๋ฐ์ดํŠธ
Target network: ๊ธฐ์กด Q-network๋ฅผ ๋™์ผํ•˜๊ฒŒ ๋ณต์ œํ•œ ์ƒํƒœ์—์„œ, y(target value)๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๊ธฐ ์œ„ํ•œ maxQ(s, a) ๊ณผ์ •์— ํ™œ์šฉ. ๋งค C ์Šคํƒญ๋งˆ๋‹ค Main Q-network๋กœ ์—…๋ฐ์ดํŠธ. 

 

์žฅ์  1: ๊ธฐ์กด Q-network๋Š” $\theta$๊ฐ€ ์—…๋ฐ์ดํŠธ๋˜๋ฉด, ๊ฒฐ๊ด๊ฐ’์ธ action-value์™€ target value(y)๊ฐ€ ๋™์‹œ์— ์›€์ง์ด๊ฒŒ ๋œ๋‹ค. ํ•˜์ง€๋งŒ C ์Šคํ…๋™์•ˆ Target value๋ฅผ ๊ณ ์ •ํ•ด ๋‘๋ฉด, ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์—…๋ฐ์ดํŠธ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

(1) line 9์—์„œ Boltzmann policy๋ฅผ ํ™œ์šฉํ•˜์—ฌ generating.
(2) line 10-11์—์„œ ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋“ค์„ batch๋กœ sampling.
(3) line 12์—์„œ U๋ฒˆ์˜ epoch.
(4) line 13์€ ๊ฐ action๋“ค์„ ๋งํ•˜๊ณ , ์ด ์•ˆ์—์„œ y ์—…๋ฐ์ดํŠธ๊ฐ€ ์ด๋ค„์ง„๋‹ค.
(5) line 18-20์—์„œ $\theta$ ์—…๋ฐ์ดํŠธ.

 

 

Boltzmann policy๋ž€?
: ε-greedy์˜ alternative method๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค. ์™„์ „ํžˆ ๋žœ๋ค ํ•˜๊ฒŒ ๋งก๊ธฐ๋Š” ๊ฒƒ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ํ•œ ๋ฒ„์ „. ํŠน์ • ํ–‰๋™์ด ๋†’์€ Q-values์™€ ์—ฐ๊ด€๋˜์–ด ์žˆ๋‹ค๋ฉด can be selected often. 

 

ํŒŒ๋ผ๋ฏธํ„ฐ(temperature)๊ฐ€ ํฌ๋ฉด, distribution์ด uniform ํ•ด์ง„๋‹ค. ๋ฐ˜๋Œ€๋กœ ์ž‘์œผ๋ฉด, concentrated ๋  ์ˆ˜ ์žˆ๋‹ค. ๋” ๋‚˜์€ ํ•™์Šต์„ ์œ„ํ•ด ์ธ๊ฐ„์ด ์ง์ ‘ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๊ธฐ์—, epilon-greedy์— ๋น„ํ•ด local minima์— ๋น ์ง€๊ธฐ ์‰ฝ๋‹ค.