[๊ฐ•ํ™”ํ•™์Šต] REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜ : ๊ฐœ๋… ๋ฐ ์ˆ˜์‹

2024. 5. 27. 22:52ใ†๐Ÿงช Data Science/ML, DL

 

 

๊ฐ•ํ™”ํ•™์Šต์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•˜๊ณ  ์žˆ์–ด, ์—ฌ๋Ÿฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ˆ˜์‹๊ณผ ์ฝ”๋“œ๋ฅผ ์ •๋ฆฌํ•˜๊ณ ์ž ํ•œ๋‹ค.

์ด ํฌ์ŠคํŒ…์€ ์ฒซ ๋ฐœ๊ฑธ์Œ์ด๋ฉฐ, REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ฒ ๋‹ค.

ํŒŒ์ดํŒ…!

 

 

๋ณธ ํฌ์ŠคํŒ…์€ ์ฑ… Foundation of Deep Reinforcement Learning / laura.G์—์„œ ์ˆ˜์‹ ๋ฐ ๋‚ด์šฉ์„ ์ฐธ๊ณ ํ•˜์—ฌ ์“ฐ์ธ ๊ธ€์ž…๋‹ˆ๋‹ค.

 

 

 

์ถœ์ฒ˜: OpenAI Spinning up

 

 

 

1. REINFORCE ๊ฐœ๋…

 

1.1. Model-free vs Model-Based

 

๊ฐ•ํ™”ํ•™์Šต์€ ํฌ๊ฒŒ Model-free, Model-Based๋กœ ๋‚˜๋‰œ๋‹ค. Model-Based ๊ฐ•ํ™”ํ•™์Šต์€ trajectory๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ๋‹ค. ๋‹ค์–‘ํ•œ ๊ฐ€์ง“์ˆ˜์˜ action์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋†’์€ trajectory๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด ๊ฒฝ์šฐ ๋ฌด์ž‘์œ„๋กœ sampling ๋œ trajectory๋ฅผ ํ•™์Šตํ•˜๋Š” Model-free๋ณด๋‹ค ๋” ๋น ๋ฅด๊ฒŒ ๋ชฉํ‘œ๋ฌผ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๊ณ  ์ŠนํŒจ๊ฐ€ ๋ถ„๋ช…ํ•˜๊ฒŒ ์กด์žฌํ•˜๋Š” ์ƒํ™ฉ์—์„  ํšจ์œจ์ ์œผ๋กœ ์ž‘์šฉํ•œ๋‹ค. 

ํ•˜์ง€๋งŒ ๋งŽ์€ ์‹ค์ œ ์ƒํ™ฉ์€ stochastic ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์œผ๋ฉฐ, ํ•˜๋‚˜์˜ ์ˆ˜์‹ ๋ฐ model๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ๊ฐ€ ํ”ํ•˜๋‹ค. good model์— ๋Œ€ํ•œ ๊ธฐ์ค€๋„ ๋ช…ํ™•ํ•˜์ง€ ์•Š์•„, Model-free ๋ฐฉ์‹์ด ๋” ํ™œ๋ฐœํ•˜๊ฒŒ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์œผ๋ฉฐ REINFORCE ๋˜ํ•œ Model-free ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•œ๋‹ค.

 

 

1.2. Policy-optimization 

 

Policy-optimization์ด REINFORCE์˜ ํ•ต์‹ฌ์ด๋‹ค. Policy๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฌธ์ œ์ด๋‹ค. ๊ทธ์ค‘, policy-gradient์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜๊ฒ ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์„ ์‚ดํŽด๋ณด์ž. State(t)๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Agent๋Š” Action(t)๋ฅผ ํ–‰ํ•œ๋‹ค. ์ด์— ๋Œ€ํ•ด Reward(t)๊ฐ€ ์ฃผ์–ด์ง„๋‹ค. ํ•œ Trajectorty๋Š” ๋‹ค์–‘ํ•œ ์‹œ๊ฐ„๋Œ€์˜ timestep์„ ๋‹ด๊ณ  ์žˆ์œผ๋ฉฐ, ๊ฐ ์‹œ๊ฐ„๋Œ€์˜ Reward๋ฅผ Discount Sumํ•˜์—ฌ ๊ธฐ๋Œ€์น˜๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด J(t) ๋ชฉ์ ํ•จ์ˆ˜์ด๋‹ค. ์ด ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ maximizeํ•˜๋Š” ์ •์ฑ…์„ ๋งŒ๋“ค์–ด ๊ฐ€๋Š” ๊ฒƒ์ด Policy-optimization์ด๋ผ๊ณ  ์ดํ•ดํ•˜๋ฉด ๋˜๊ฒ ๋‹ค. ๊ทธ ์ค‘ ๋ชฉ์ ํ•จ์ˆ˜ ๊ฒฐ๊ณผ๋ฅผ policy์— ์ ์šฉํ•˜์—ฌ ์—…๋ฐ์ดํŠธํ•ด๋‚˜๊ฐ€๋Š” ๊ฒƒ์„ policy-gradient ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ผ๊ณ  ํ•œ๋‹ค.

 

 

 

1.3. Policy-gradient needs three components

 

1) A parametrized policy

2) An objective to be maximized

3) A method for updating the policy parameters

์œ„์—์„œ ์†Œ๊ฐœํ•œ polilcy-gradient ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ 3๊ฐœ์˜ ๊ตฌ์„ฑ์š”์†Œ๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค. ์—…๋ฐ์ดํŠธ๋ฅผ ํ•  policy, policy๋ฅผ ์กฐ์ •ํ•  objective function, ์กฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•(policy-gradient) ๋“ฑ์ด๋‹ค. policy๋Š” states์— ๋Œ€ํ•˜์—ฌ action probabilities๋ฅผ ๋งคํ•‘ํ•œ ํ•จ์ˆ˜์ด๋‹ค.

์—…๋ฐ์ดํŠธ๊ฐ€ ๋ ์ˆ˜๋ก ๋†’์€ reward๋ฅผ ๋„์ถœํ•˜๋Š” > trajectory๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” > action์˜ probabilities๊ฐ€ ๋†’๊ฒŒ ๋‚˜์˜ค๋Š” policy๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.

 

 

 

2. REINFORCE ์•ˆ์˜ ์ˆ˜์‹

2.1. Policy

 

$  \pi_{\theta}  = policy $

$\pi$๋Š” policy ํ•จ์ˆ˜ ๊ทธ ์ž์ฒด์ด๋ฉฐ, $\theta$๊ฐ€ learnable parameters์ด๋‹ค.

'We say that the policy is parameterized by $\theta$ ' 

 

๋‹น์—ฐํžˆ

(( \pi_{\theta_{1}} \neq  \pi_{\theta_{2}} ))

์œ„์˜ ์‹์ด ์„ฑ๋ฆฝํ•œ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ์ž์ฒด๊ฐ€ ๋‹ฌ๋ผ ์•„์˜ˆ ๋‹ค๋ฅธ policy์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

 

2.2. The Objective Function

 

 

$R_{t}(\tau)$ ๋Š” ํŠน์ • trajectory์—์„œ ๋ฐ›๋Š” reward์˜ discounted sum์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋“  timestep์— ๋Œ€ํ•˜์—ฌ ๊ตฌํ–ˆ๋˜ Reward ํ•ฉ์˜ ๊ธฐ๋Œ“๊ฐ’์ด ๋ชฉ์ ํ•จ์ˆ˜๊ฐ€ ๋œ๋‹ค. 

 

 

2.3. The Policy Gradient

 

 

policy์˜ parameter์ธ $\theta$๋Š” ๋ชฉ์ ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„ ๊ฒฐ๊ณผ ์ฆ‰, ๋ชฉ์ ํ•จ์ˆ˜์˜ ์ฆ๊ฐ์†Œ๋ถ„์œผ๋กœ ์—…๋ฐ์ดํŠธ๋œ๋‹ค.

$\alpha$๋Š” learning rate์ด๋‹ค. 

๋ชฉ์ ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„์€ ์‚ฌ์‹ค ์กฐ๊ธˆ ๋” ๋ณต์žกํ•œ ์—ฐ์‚ฐ์— ์˜ํ•ด ์œ ๋„๋˜์ง€๋งŒ, ๊ตณ์ด ์„ค๋ช…ํ•˜์ง€๋Š” ์•Š๊ฒ ๋‹ค. ์ตœ์ข… ํ˜•ํƒœ๋งŒ ์ดํ•ดํ•˜๊ณ  ๋„˜์–ด๊ฐ€๊ฒ ๋‹ค.

 

 

2.4. Monte Carlo Sampling

 

Monte Carlo ์ƒ˜ํ”Œ๋ง์€ ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ์ƒ˜ํ”Œ์„ ๋ฝ‘์•„, ๊ฐ’์„ ํ‰๊ท ๋‚ด๋ฉด ์ตœ์ข… ๊ฐ’์— ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ด๋ก ์ด๋‹ค.

๊ธธ์ด๊ฐ€ 2์ธ ์ •์‚ฌ๊ฐํ˜• ์•ˆ์— ๋ฐ˜์ง€๋ฆ„์ด 1์ธ ์›์ด ๋‚ด์ ‘ํ•ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž. ์ด๋“ค์˜ ๋„“์ด ๋น„๋Š” ์•„๋ž˜ ์‹๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋œ๋‹ค.

 

ํ•˜์ง€๋งŒ ๋งŒ์•ฝ ๋ฐ˜์ง€๋ฆ„ ๊ธธ์ด๋ฅผ ๋ชจ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•  ๊ฒฝ์šฐ, ์–ด๋–ป๊ฒŒ ๋„“์ด ๋น„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์„๊นŒ. ์‚ฌ๊ฐํ˜• ์•ˆ์— ๋ฌด์ˆ˜ํžˆ ๋งŽ์ด ์ ์„ randomly ํ•˜๊ฒŒ ์ฐ๊ณ , ์›์— ๋“ค์–ด๊ฐ€ ์žˆ๋Š” ์ ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ด์ ์˜ ๊ฐœ์ˆ˜๋กœ ๋‚˜๋ˆ„๋ฉด ๊ทผ์‚ฌ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

 

 

์ด๊ฒƒ์ด ๋ฐ”๋กœ Monte Carlo Sampling ๊ธฐ์—…์ด๋ฉฐ, $\tau$ (trajectory)๋ฅผ sampling ํ•  ๋•Œ ์ด์šฉ๋œ๋‹ค.

 

 

3. REINFORCE Algorithms

 

3.1. Basic

 

REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฝ”๋“œ๋ฅผ ๋ค๋‹ค.

 

 

episode๋Š” ์ƒํ™ฉ์„ ๋งํ•œ๋‹ค. episode๊ฐ€ 100์ผ ๊ฒฝ์šฐ, 100๋ฒˆ์˜ ํ•™์Šต ๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค๊ณ  ๋ณด๋ฉด ๋œ๋‹ค. ์ง€๋„ ํ•™์Šต ๊ธฐ์ค€์œผ๋ก  epoch๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

(1) ๊ฐ episode ์•ˆ์—์„œ ๋žœ๋ค ํ•˜๊ฒŒ $\tau$๋ฅผ ๋ฝ‘๋Š”๋‹ค.

(2) ๋ชฉ์  ํ•จ์ˆ˜์˜ ๋ฏธ์ค€์„ ์ดˆ๊ธฐํ™”ํ•œ ์ƒํƒœ์—์„œ

(3) time ๋ณ„ reward์™€ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ๊ตฌํ•œ๋‹ค. ๊ทธ ์•ˆ์—์„œ ๋ชฉ์ ํ•จ์ˆ˜์˜ ์ฆ๊ฐ์†Œ๋ถ„์€ ๊ณ„์†ํ•ด์„œ ์—…๋ฐ์ดํŠธ๋œ๋‹ค.

(4) ํ•˜๋‚˜์˜ episode๊ฐ€ ๋๋‚˜๋ฉด $\theta$ (parameters of policy)๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.

(5) ๋ชจ๋“  episode๊ฐ€ ๋๋‹ค๋ฉด ํ•™์Šต์€ ์ข…๋ฃŒ๋œ๋‹ค.

 

 

3.2. Improving

 

ํ•˜์ง€๋งŒ ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•œ๋‹ค. Monte Carlo Sampling์œผ๋กœ $\tau$๋ฅผ ๋ฝ‘์„ ๊ฒฝ์šฐ, trajectory๋งˆ๋‹ค reward๊ฐ€ ๋งค์šฐ ๋‹ค๋ฅด๊ฒŒ ๋‚˜์˜ฌ ๊ฒƒ์ด๋‹ค. ๊ทธ์— ๋”ฐ๋ผ high variance ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด reward scaling ์ค‘ ํ•˜๋‚˜์ธ reward normalization์„ ์ง„ํ–‰ํ•œ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ํ˜•ํƒœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.  

 

 

๋‹จ์ˆœํ•œ ๊ฐœ๋…์œผ๋กœ, reward์— ํŠน์ • ๊ฐ’์„ ๋นผ์ฃผ๋Š” ๊ฒƒ์ด๋‹ค. ํ”ํžˆ $ E(\sum R_{t}(\tau )) $๋ฅผ ์“ด๋‹ค. reward์˜ ํ‰๊ท ์„ ์‚ฌ์šฉํ•˜์—ฌ, centering returns for each trajectory around 0 ํ•œ๋‹ค.

์ด ๊ณผ์ •์„ ํ†ตํ•ด, high variance ๋ฌธ์ œ๋ฅผ ์–ด๋Š ์ •๋„ ํ•ด์†Œํ•  ์ˆ˜ ์žˆ๋‹ค.