[Paper review] SAC: Soft Actor Critic

2025. 4. 18. 14:30ใ†๐Ÿงช Data Science/Paper review

 

 

์š”์ฆ˜ UNIST ์ธ๊ณต์ง€๋Šฅ ์—ฐ๊ตฌ์‹ค์—์„œ ์ธํ„ด์„ ํ•˜๊ณ  ์žˆ๋‹ค.

๋ชฉ์š”์ผ๋งˆ๋‹ค ์„ธ๋ฏธ๋‚˜ ๋ฐœํ‘œ๋ฅผ ํ•˜๋Š”๋ฐ, SAC ๋…ผ๋ฌธ์„ ์ฝ๊ณ  ๋ฆฌ๋ทฐ๋ฅผ ํ–ˆ๋‹ค.

๋ฆฌ๋ทฐ ๋‚ด์šฉ์„ ๋ธ”๋กœ๊ทธ์— ๊ฐ„๋žตํ•˜๊ฒŒ ์ •๋ฆฌํ•œ๋‹ค.

 

 

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor


https://arxiv.org/abs/1801.01290

 

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergenc

arxiv.org

 

 

 

[1] SAC ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ

1-1. Off Policy์— Entropy Maximization framework๋ฅผ ์ ์šฉํ•ด์„œ, ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ๊ณผ Robustness๋ฅผ ๋™์‹œ์— ๊ฐ€์ ธ์˜ค๊ณ ์ž ํ•จ.

1-2. Soft Policy Iteration ๋ฐฉ์‹์„ ๊ฐœ์„ ํ•œ ๋ชจ๋ธ๋กœ, tabular ๋ฐฉ์‹์˜ Q๋ฅผ Function Approximation์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ continuos ํ™˜๊ฒฝ์—์„œ๋„ SAC๊ฐ€ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•จ.
1-3. Policy๋ฅผ Q๋กœ๋ถ€ํ„ฐ Boltzmann ๋ฐ KL-Divergence๋ฅผ ํ†ตํ•ด ์œ ๋„ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ํŒŒ๋ผ๋ฏธํ„ฐํ™”ํ•˜์—ฌ ์ง์ ‘ explicit ํ•˜๊ฒŒ ์—…๋ฐ์ดํŠธํ•จ.

1-4. ๊ฒฐ๊ตญ Critic(Q-network), Actor(Policy-network) ํ˜•ํƒœ๋ฅผ ๋ ๊ฒŒ ๋จ.

 

 

[2] SAC ๋…ผ๋ฌธ์˜ ์•„์‰ฌ์šด ์ 

2-1. Deadly Triad(off policy, function approximation, bootstrapping ๋™์‹œ ์ ์šฉ ์‹œ ์ˆ˜๋ ด ์ฆ๋ช… X)๋ฅผ ์ง์ ‘์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•˜๊ณ , Soft Policy Iteration(off, tabular, bootstrapping)์—์„œ๋งŒ ์ˆ˜๋ ด์„ ์ฆ๋ช…ํ•จ. ๋”ฐ๋ผ์„œ ๋…ผ๋ฌธ์ด ๋‚ด์„ธ์› ๋˜ contribution์ธ Convergence Proof๊ฐ€ ์ง„์ •ํ•œ ์˜๋ฏธ์—์„œ ์ œ์‹œ๋˜์—ˆ๋‹ค๊ณ  ๋ณผ ์ˆœ ์—†์Œ(๋ฐ•์‚ฌ๋‹˜๋“ค์— ์˜ํ•˜๋ฉด, ํ›„์† ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋˜์–ด ์žˆ๋‹ค๊ณ  ํ•˜์‹ฌ)

2-2. Hyperparameter์— ๋Œ€ํ•ด ๋ฏผ๊ฐํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ SAC์˜ contribution์œผ๋กœ ์‚ผ์•˜์Œ. ์ด๊ฒƒ์„ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด์„  ๋ฏผ๊ฐ๋„ ์‹คํ—˜์ด ์ด๋ค„์ ธ์•ผ ํ•˜๋Š”๋ฐ, ์ด์— ๋Œ€ํ•œ ์ง์ ‘์ ์ธ ์‹คํ—˜์€ ์—†๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋จ. ๋‹ค๋ฅธ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, average score๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์•ˆ์ •์  ์ƒ์Šนํ•œ๋‹ค๋Š” ์ ์„ ๋“ค์–ด ๊ฐ„์ ‘ ์ฆ๋ช…ํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž„.

 

 

[3] SAC ๋…ผ๋ฌธ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ์•Œ์•„๋‘๋ฉด ์ข‹์€ ์‚ฌ์ „ ์ง€์‹

3-1. Entropy definition and Entropy Maximization framework in RL

3-2. Likelihood ratio and Reparameterization trick

3-3. Importance Sampling

3-4. Poliy iteration and A2C

3-5. KL-Divergence, its equation about Objective function

 

 

[4] SAC ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐ Q/Value/Policy network์˜ ๋ชฉ์ ํ•จ์ˆ˜

 

 

๋…ผ๋ฌธ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
4-1. ํ•™์Šต ์‹œ์ž‘ ์ „์— ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ ์ดˆ๊ธฐํ™”
4-2. Step์„ ๋ฐŸ์œผ๋ฉด์„œ ํ˜„์žฌ policy๋กœ Replay Buffer์— ๊ฒฝํ—˜ ์ถ•์ 

4-3. Network ๋ณ„๋กœ ๊ฐ Gradient Step์— ๋”ฐ๋ผ ์—…๋ฐ์ดํŠธ ์ง„ํ–‰

 

 

Q/Value/Policy์˜ ๋ชฉ์ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

ํ™•๋ฅ ์— log๋ฅผ ์”Œ์–ด์„œ entropy ์ง€์ˆ˜๋กœ ๋ณ€ํ™˜ํ•จ. ์ด๋ฅผ ๋ณด์ƒ๊ณผ ๊ฐ™์ด ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก objective function์„ ์„ค๊ณ„ํ•˜์—ฌ ์—…๋ฐ์ดํŠธ.

 

 

 

[5] SAC ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ Ablation Study

5-1. ํ™˜๊ฒฝ: MuJoCo, rllab์˜ Continuos

5-2. Comparative Experiment Result: Walker ํ™˜๊ฒฝ์„ ์ œ์™ธํ•˜๊ณค ๋ชจ๋“  ํ™˜๊ฒฝ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„

 

 

 

5-3. Ablation Study: stochastic vs deterministic

SAC๋Š” policy network๋ฅผ mean/std๋ฅผ ์ถ”์ •ํ•˜๊ฒŒ ํ•˜์—ฌ ํ™•๋ฅ  ๋ถ„ํฌ์—์„œ action์„ ๋ฝ‘๋Š” ์‹.
ํ•˜์ง€๋งŒ ์ด๋•Œ stochastic policy๊ฐ€ ์•„๋‹ˆ๋ผ deterministic(ํ™•๋ฅ ์  X) policy๋ฅผ ์ด์šฉํ•  ๊ฒฝ์šฐ, ์„ฑ๋Šฅ์ด stochastic๋ณด๋‹ค ๋†’์•„์ง€๋Š” ์ˆœ๊ฐ„๋„ ์žˆ์ง€๋งŒ, ์ „์ฒด ํ•™์Šต ํ‰๊ท (seed ํ‰๊ท )์„ ๋ƒˆ์„ ๋• ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๋ชจ์Šต์„ ๋ณด์ž„. ํ•™์Šต ์ถ”์ด๋„ stochastic์ด ์•ˆ์ •์ ์ž„์„ ๋ณผ ์ˆ˜ ์žˆ์Œ.

 

 

5-4. Ablation Study: Evaluation, Reward scale, Target Smoothing

ํ•™์Šต ์ค‘ ์„ฑ๋Šฅ testํ•  ๋•Œ, policy๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ action์„ stochasticํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋งํ•œ ๊ฒฝ์šฐ๊ฐ€ policy network๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ ํ‰๊ท ๊ฐ’์„ ์ด์šฉ(deterministic)ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋‚ฎ๊ฒŒ ๋‚˜์˜ด. ๋ณด์ƒ ์ •๊ทœํ™”์™€ ํƒ€๊ฒŸ ๋„คํŠธ์›Œํฌ์˜ ์ง€์ˆ˜ํ‰๊ท  ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์ ๋‹นํ•˜๊ฒŒ ์„ค์ •๋  ๋•Œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„.

 

์ด๋•Œ, ๋ณด์ƒ ์ •๊ทœํ™”๋Š” entropy์˜ ๋ฐ˜์˜ ๋น„์œจ๊ณผ๋„ ์—ฐ๊ฒฐ์ด ๋˜๋Š”๋ฐ, ํฐ ๊ฐ’์œผ๋กœ ์ •๊ทœํ™”๋ฅผ ํ–ˆ๋‹ค๋ฉด entropy ์˜ํ–ฅ๋ ฅ์ด ์ปค์ง„๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•ด๋‘์–ด์•ผ ํ•จ.