[๋…ผ๋ฌธ๋ฆฌ๋ทฐ] Sample-Efficient Multi-agent RL with Reset Replay

2024. 11. 11. 10:41ใ†๐Ÿงช Data Science/Paper review

 

 

๋‹ค์Œ ์ฃผ, ์—ฐ๊ตฌ์‹ค ๋…ผ๋ฌธ ์„ธ๋ฏธ๋‚˜์—์„œ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ๋ฐœํ‘œ๋ฅผ ํ•œ๋‹ค.

๋…ผ๋ฌธ ์ œ๋ชฉ์€ "Sample-Efficient Multi-agent Reinforcement learning with Reset Replay (Yaodong Yang, 2024, ICML)"

 

๋…ผ๋ฌธ ํ‚ค์›Œ๋“œ์˜ ๊ธฐ๋ณธ ๊ฐœ๋…๋“ค์„ ํ›‘๊ณ , ์„ธ๋ถ€์ ์ธ ๋‚ด์šฉ์„ ์ดํ•ดํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰ํ•˜๊ณ ์ž ํ•œ๋‹ค. 

 

Keyword: Multi-agent Reinforcement learning, Sample Efficient, Reset Replay, Buffer

 

 

https://openreview.net/forum?id=w8ei1o9U5y&referrer=%5Bthe%20profile%20of%20Pheng-Ann%20Heng%5D(%2Fprofile%3Fid%3D~Pheng-Ann_Heng1)

 

Sample-Efficient Multiagent Reinforcement Learning with Reset Replay

The popularity of multiagent reinforcement learning (MARL) is growing rapidly with the demand for real-world tasks that require swarm intelligence. However, a noticeable drawback of MARL is its low...

openreview.net

 

 

 

1. Introduction/Background: MARL & ๊ทธ ์™ธ ๊ฐœ๋…

MARL(Multi-Agent Reinforcement Learning)์€ Single ๊ฐ•ํ™”ํ•™์Šต๊ณผ ๋‹ฌ๋ฆฌ ๋‹ค์ˆ˜์˜ ์—์ด์ „ํŠธ๊ฐ€ ์กด์žฌํ•œ๋‹ค. ์—์ด์ „ํŠธ๋“ค์€ ํ™˜๊ฒฝ๊ณผ ์„œ๋กœ์˜ ํ–‰๋™์„ ๋ชจ๋‘ ๊ณ ๋ คํ•˜์—ฌ ์ž์‹ ์˜ ์ •์ฑ…์„ ์ตœ์ ํ™”ํ•œ๋‹ค. ์„œ๋กœ์˜ ํ–‰๋™์„ ๊ณ ๋ คํ•˜๊ธฐ์—, MARL์—์„  ํ˜‘๋ ฅ(Cooperation), ๊ฒฝ์Ÿ(Competition), ์ƒํ˜ธ์ž‘์šฉ(Dynamics)์ด ์ค‘์š”ํ•œ ์š”์†Œ๋กœ ์ž‘์šฉํ•œ๋‹ค. ์—ฌ๋Ÿฌ ์—์ด์ „ํŠธ์˜ ์ƒํ˜ธ ์ž‘์šฉ์„ ์ •์ฑ… ํ•™์Šต์— ๊ณ ๋ คํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— Sample Efficient๊ฐ€ ์ค‘์š”ํ•ด์ง„๋‹ค. ๋…ผ๋ฌธ์€ ์ด Sample Efficient๋ฅผ ์˜ฌ๋ฆฌ๋ฉด์„œ ํ•™์Šต ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. 

 

 

1.1. MDP(Markov decision process) โžก๏ธ Markov Games

๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ๋Š” ๋‹จ์ผ ์—์ด์ „ํŠธ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ๋ง ๋ฐฉ๋ฒ•์ธ MDP์˜ ํ™•์žฅ๋œ ๋ฒ„์ „์ธ Markov Games๋ฅผ ์“ด๋‹ค. ๋‘ ๋ชจ๋ธ๋ง ๋ฐฉ๋ฒ• ๋ชจ๋‘ ๊ตฌ์„ฑ์š”์†Œ๋Š” ๊ฐ™๋‹ค. ์ƒํƒœ ๊ณต๊ฐ„(State), ํ–‰๋™ ๊ณต๊ฐ„(Action), ์ƒํƒœ ์ „์ด ํ•จ์ˆ˜(P(s|a)), ๋ณด์ƒ ํ•จ์ˆ˜(R(s, a)), ํ• ์ธ ์ธ์ž(gamma).

 

๋‹ค๋งŒ, Markov Games๋Š” ํ™˜๊ฒฝ๊ณผ ๋”๋ถˆ์–ด ๋‹ค๋ฅธ ์—์ด์ „ํŠธ์˜ ํ™˜๊ฒฝ๋„ ๋ชจ๋‘ ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ˆ˜์‹์„ ๊ฐ€์ง„๋‹ค.

 

(1) ์ƒํƒœ ์ „์ด ํ•จ์ˆ˜: $$ P(s' \mid s, a_1, a_2, \dots, a_n) $$

(2) ๊ฐ ์—์ด์ „ํŠธ i์— ๋Œ€ํ•œ ๋ณด์ƒ ํ•จ์ˆ˜: $$ R_i(s, a_1, a_2, \dots, a_n) $$

(3) ๋ชฉํ‘œ ํ•จ์ˆ˜: $$ R_i = \sum_{t=0}^{\infty} \gamma^t r_i(s_t, a_{1,t}, a_{2,t}, \dots, a_{n,t}) $$

 

์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด, Markov Games์—์„  ์„œ๋กœ์˜ ํ–‰๋™๋“ค์ด ๋ชจ๋‘ ๊ณ ๋ ค๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. State๋„ ์ด์ „ ํ™˜๊ฒฝ์€ ๋ฌผ๋ก , ๊ฐ ์—์ด์ „ํŠธ์˜ ํ–‰๋™๋“ค์— ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค. ๋™์  ํ™˜๊ฒฝ, ์—์ด์ „ํŠธ๋ผ๋ฆฌ์˜ ์ƒํ˜ธ์ž‘์šฉ์ด ์กด์žฌํ•˜๊ธฐ์— ์ƒ๋Œ€์ ์œผ๋กœ ๋ณต์žก๋„๊ฐ€ ๋†’๊ณ , ๋งŽ์€ ํ•™์Šต์„ ํ•„์š”๋กœ ํ•œ๋‹ค. 

 

Series(์ง๋ ฌ ๋ฐฉ์‹): ๊ฐ ์—์ด์ „ํŠธ๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šตํ•œ๋‹ค. ํ•œ ๋ฒˆ์— ํ•œ ์—์ด์ „ํŠธ๋งŒ ํ•™์Šตํ•˜๊ณ  ๋‚˜๋จธ์ง„ ๋Œ€๊ธฐ ์ƒํƒœ. ์ƒํ˜ธ์ž‘์šฉ ํšจ๊ณผ ๋‚ฎ์Œ.
Parrallel(๋ณ‘๋ ฌ ๋ฐฉ์‹): ์—ฌ๋Ÿฌ ์—์ด์ „ํŠธ๊ฐ€ ๋™์‹œ ํ•™์Šต. ์ •์ (์ผ์ • ๊ธฐ๊ฐ„ ์ •์ฑ… ์œ ์ง€), ๋™์ (๊ฒฐ๊ณผ ์ฆ‰์‹œ ๋ฐ˜์˜) ์ •์ฑ… ์—…๋ฐ์ดํŠธ๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ๋Š” ๋ณดํ†ต ํšจ์œจ์„ฑ๊ณผ ํ•™์Šต ์†๋„๋ฅผ ์œ„ํ•ด Parrallel ๋ฐฉ์‹์„ ์„ ํ˜ธํ•œ๋‹ค.

 

 

1.2. Replay Ratio, Sample Efficient

๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ๋Š” ์ƒํƒœ ๊ณต๊ฐ„์ด ๊ณ ์ฐจ์›์ ์ด๋ฉฐ Single Agent์— ๋น„ํ•ด ํ›จ์”ฌ ๋‹ค์–‘ํ•ด์ง„ ์ƒํ™ฉ์„ ์ •์ฑ…์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ•ด์•ผ ํ•˜๊ธฐ์—, ๋งŽ์€ ์ƒ˜ํ”Œ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. Sample Efficient๋ฅผ ๋”์šฑ ๊ฐ–์ถ”๊ธฐ ์œ„ํ•œ ๋…ธ๋ ฅ์˜ ๊ณผ์ •์ด ๋…ผ๋ฌธ์˜ ์ฃผ์ œ์ž„์„ ์ฐธ๊ณ .

 

Replay Ratio๋Š” ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•  ๋•Œ๋งˆ๋‹ค Agent์˜ parameters๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ํšŸ์ˆ˜์ด๋‹ค. ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์—” ๋น„์šฉ์ด ๋“ ๋‹ค. ์ตœ๋Œ€ํ•œ ๋น„์šฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด, Environments interaction 1ํšŒ๋งˆ๋‹ค ์—…๋ฐ์ดํŠธ๋ฅผ ์—ฌ๋Ÿฌ ๋ฒˆ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ ์—ฌ๋Ÿฌ ๋ฒˆ ํ•ด์ฃผ๋ฉด Sample Efficient(์ƒ˜ํ”Œ ํšจ์œจ์„ฑ)๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. 

 

์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์ด๋ž€? ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์–ผ๋งˆ๋‚˜ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐœ๋…์ด๋‹ค. Sample Efficient๋ฅผ ๊ฐ€์กŒ๋‹ค๋ฉด ์ ์€ ๋ฐ์ดํ„ฐ๋กœ ๋น ๋ฅด๊ฒŒ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒํ•  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•˜๋Š”๋ฐ, Replay Ratio๋ฅผ ์˜ฌ๋ฆฌ๋ฉด ์ง๊ด€์ ์œผ๋ก  ์ ์€ ์ƒํ˜ธ์ž‘์šฉ(์ ์€ ๋ฐ์ดํ„ฐ, ์ ์€ ๋น„์šฉ)์œผ๋กœ ๋‹ค์ˆ˜ ํ•™์Šต์„ ํ†ตํ•ด ์›ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ์‹œํ‚จ๋‹ค. ๋”ฐ๋ผ์„œ Replay Ratio๋Š” Sample Efficient์™€ ๊ฐ•ํ•œ ์—ฐ๊ด€์„ฑ์„ ๊ฐ€์กŒ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

Sample Efficient ์ˆ˜์‹: Replay Buffer์—์„œ ์ƒ˜ํ”Œ๋ง๋˜๋Š” ๋ฐ์ดํ„ฐ ๊ฒฝํ—˜์˜ ํšŸ์ˆ˜์— ๋Œ€ํ•œ ๊ธฐ๋Œ“๊ฐ’์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. 

$$
\mathbb{E}[N_{\text{sampled}}] = \frac{N_{\text{RR}} \cdot N_{\text{B}}}{V \cdot T_{\text{U}}}
$$

(1) $$ N_{\text{RR}} $$ : Replay ratio. ์ƒํ˜ธ์ž‘์šฉ ํ•œ ๋ฒˆ์— ๋ช‡ ๋ฒˆ์˜ ์—…๋ฐ์ดํŠธ๊ฐ€ ์ˆ˜ํ–‰๋˜์—ˆ๋Š”๊ฐ€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.


(2) $$ N_{\text{B}} $$ : Batch size. ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํ„ฐ์—์„œ ์ƒ˜ํ”Œ๋ง๋œ ๊ฒฝํ—˜ ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜


(3) V : Data acquisition Speed. ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ๋™์•ˆ ๋‹จ์œ„ ์‹œ๊ฐ„๋งˆ๋‹ค ์ˆ˜์ง‘๋˜๋Š” ๋ฐ์ดํ„ฐ ๊ฒฝํ—˜์˜ ๊ฐœ์ˆ˜

 

(4) $$ T_{\text{U}} $$ : Update interval. ์—…๋ฐ์ดํŠธ๊ฐ€ ์ด๋ฃจ์–ด์ง€๋Š” ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

 

๋งŒ์•ฝ $$ N_{\text{RR}} = 4 \quad | \quad N_{\text{B}} = 32 \quad | \quad V = 2 \quad | \quad T_{\text{U}} = 4 $$ ๋ผ๋ฉด, Sample Efficient๋Š” 16์ด๋‹ค. ๊ฐ 4 ์Šคํ…๋งˆ๋‹ค 1ํšŒ์˜ ์ƒํ˜ธ์ž‘์šฉ์ด ์ผ์–ด๋‚œ๋‹ค. ์ƒํ˜ธ์ž‘์šฉ๋งˆ๋‹ค 4๋ฒˆ์˜ ์—…๋ฐ์ดํŠธ๊ฐ€ ์ง„ํ–‰๋˜๊ณ , 1๋ฒˆ์˜ ์—…๋ฐ์ดํŠธ๋งˆ๋‹ค 32๊ฐœ์˜ ๋ฐ์ดํ„ฐ ๊ฒฝํ—˜์ด ์“ฐ์ธ๋‹ค. ๋งŒ์•ฝ 30 ์Šคํƒญ์ด ์ง„ํ–‰๋˜์—ˆ๋‹ค๋ฉด, ์ด 7~8ํšŒ ์—…๋ฐ์ดํŠธ๊ฐ€ ์ด๋ค„์กŒ์œผ๋ฉฐ, ์ด 60๊ฐœ์˜ ๋ฐ์ดํ„ฐ ๊ฒฝํ—˜(30x2)์ด ์ˆ˜์ง‘๋œ ๊ฒƒ์ด๋‹ค. 

 

์ˆ˜์‹๋งŒ ๋ณธ๋‹ค๋ฉด, ๋‹ค๋ฅธ ๊ฐ’๋“ค์ด ๊ณ ์ •๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฐ€์ •ํ•˜์— Replay Ratio๋งŒ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด Sample Efficient๋ฅผ ๊ทน๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

 

 

1.3. Plasticity Loss in Reinforcement Learning

ํ•˜์ง€๋งŒ ๋ฌด๋ถ„๋ณ„ํ•˜๊ฒŒ Replay Ratio๋งŒ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด Plasticity Loss๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ ํ•™์Šต์— ๋ถ€์ •์ ์ธ ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ๋‹ค.

Plasticity๋Š” ์‹ ๊ฒฝ๊ณผํ•™๊ณผ ์—ฐ๊ด€์ด ์žˆ๋Š” ๊ฐœ๋…์œผ๋กœ, ์—์ด์ „ํŠธ๊ฐ€ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜์—ฌ ์ •์ฑ…์ด๋‚˜ ํ–‰๋™์„ ๋ณ€ํ™”์‹œํ‚ค๊ณ  ์ ์‘ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋งํ•œ๋‹ค. ์ฆ‰, ์ƒํ˜ธ์ž‘์šฉ ํ•œ ๋ฒˆ์— ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ํ•™์Šต์„ ๋„ˆ๋ฌด ๋งŽ์ด ๋ฐ˜๋ณตํ•˜๋ฉด ๋ชจ๋ธ์ด ์ ์‘ ๋Šฅ๋ ฅ์„ ์žƒ๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

Plasticity๋ฅผ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์ดˆ๊ธฐ ์„ฑ๋Šฅ(baseline b)๊ณผ ์ตœ๊ทผ ์—…๋ฐ์ดํŠธ๋œ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ loss ๊ธฐ๋Œ“๊ฐ’์„ ๋น„๊ตํ•œ๋‹ค. ์ด ์ฐจ์ด๊ฐ€ ํด์ˆ˜๋ก ๊ฐ€์†Œ์„ฑ์ด ๋†’์Œ์„ ์˜๋ฏธํ•œ๋‹ค.

$$ P(\theta_t) = b - \mathbb{E}_{l \sim L}[l(\theta^*_t)], \quad \text{where} \quad \theta^*_t = \text{OPT}(\theta_t, l) $$

 

Plasticity Loss๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ํ•™์Šต ๊ณผ์ •์—์„œ ๋„คํŠธ์›Œํฌ ์ ์‘ ๋Šฅ๋ ฅ์ด ์–ผ๋งˆ๋‚˜ ๊ฐ์†Œํ–ˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ํ‘œ์ด๋‹ค. ํ•™์Šต ์ดˆ๊ธฐ ์‹œ์ ๊ณผ ๋งˆ์ง€๋ง‰ ์‹œ์ ์˜ ์ฐจ์ด๋กœ ๊ณ„์‚ฐ๋œ๋‹ค. 

$$ P(\theta_{t=K}) - P(\theta_{t=0}) $$

 

์œ„์˜ ๊ฐ€์†Œ์„ฑ์ด ํ•™์Šต ์‹œ์ž‘๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€๋œ๋‹ค๋ฉด, ๋ชจ๋ธ์ด ํ•™์Šต ํ›„์—๋„ ์ƒˆ๋กœ์šด ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์ ์‘ ๋Šฅ๋ ฅ์„ ์žƒ์ง€ ์•Š์•˜์Œ์„ ์˜๋ฏธํ•œ๋‹ค.

 

๋…ผ๋ฌธ์€ Sample Efficient๋ฅผ ์œ„ํ•ด Replay Ratio๋ฅผ ์˜ฌ๋ฆฌ๋ฉด์„œ๋„  Plasticity Loss๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋Š” 2๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

 

 

 

2. Method & Algorithm

2.1. ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•: Shrink & Perturb

์œ„์—์„œ ์–ธ๊ธ‰ํ–ˆ๋˜ ๋Œ€๋กœ, Replay ratio๋งŒ ๋†’์ด๋ฉด ๊ฐ€์†Œ์„ฑ ์†์‹ค์ด ๋ฐœ์ƒํ•œ๋‹ค. ์ด๋Š” ํ•™์Šต์„ ๋ถˆ์•ˆ์ •ํ•˜๊ฒŒ ํ•˜๊ณ , agent๊ฐ€ ํ•™์Šตํ•˜๋Š” ์ •์ฑ…์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ๋–จ์–ด๋œจ๋ฆฐ๋‹ค. ์ฆ‰, ์˜ค๋ฒ„ํ”ผํŒ…์ด ๋ฐœ์ƒํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์ด Shrink&Perturb(์ค„์ด๊ธฐ+๋ณ€ํ™”)์ด๋‹ค. ๊ฐ€์†Œ์„ฑ ์œ ์ง€๋ฅผ ์œ„ํ•ด ๋„คํŠธ์›Œํฌ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ผ์ • ์ฃผ๊ธฐ์— ๋”ฐ๋ผ reset ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋ผ ๋ณด๋ฉด ๋œ๋‹ค.

 

MARL์— ์ค‘์š”ํ•œ ์š”์†Œ์ธ ์ค‘์•™ ์ง‘์ค‘์‹ ๋น„ํ‰๊ฐ€(Centralized critic network)์™€ ์—์ด์ „ํŠธ ์ •์ฑ… ํ˜น์€ Q-value network์— ๊ฐ€์†Œ์„ฑ์„ ์ฃผ์ž…ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. 

"MARR performs Shrink & Perturb to inject plasticity into both the centralized critic network and each agent's policy or Q-value network to recover learning abiliity of these networks" -paper-

 

 

Formulation of Shrink&Perturb in MARR(=MARL + ๋…ผ๋ฌธ ์ œ์•ˆ ๋ฐฉ์‹)

 

(1) Agent's policy parameters

$$ \theta_i^t \leftarrow \alpha \theta_i^t + (1 - \alpha) \theta_i^0, \quad \text{for } i = 1, 2, \ldots, N  $$

 

(2) Centralized critic network parameters

$$ \phi^t \leftarrow \alpha \phi^t + (1 - \alpha) \phi^0 $$

 

Interpolation factor์ธ ์•ŒํŒŒ๋Š” ์ตœ๊ทผ parameters๋ฅผ ์–ผ๋งˆ๋‚˜ ๋‚จ๊ฒจ๋‘˜์ง€๋ฅผ ์ •ํ•œ๋‹ค. ์ˆ˜์‹์„ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ์ดˆ๊ธฐ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ผ๋ถ€ ์ฃผ๊ธฐ์ ์œผ๋กœ ๊ฐ€์ ธ์˜ด์œผ๋กœ์จ ์ดˆ๊ธฐ์˜ ๊ฐ€์†Œ์„ฑ์„ ํ˜„์žฌ ๋ชจ๋ธ์— ๊ณ„์† ์ฃผ์ž…ํ•˜๋Š” ๊ฒƒ์ด๋ผ ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

 

์ด๋ ‡๊ฒŒ ๋˜๋ฉด high-replay-ratio๊ฐ€ ๊ฐ€๋Šฅํ•ด์ง€์ง€๋งŒ, same transition experience๋ฅผ ์—…๋ฐ์ดํŠธํ•  ํ™•๋ฅ ์ด ๋†’์•„์ง€๊ฒŒ ๋œ๋‹ค. ์ด๋•Œ ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ์„ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด Data Augmentation์ด๋‹ค. 

 

 

2.2. ๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•: Random amplitude Scale - Data Augmentation

์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ์„ ์œ„ํ•ด์„œ ์ƒ˜ํ”Œ๋ง๋œ transition batch B์— data augmentation์„ ์ ์šฉํ•œ๋‹ค. ๋ฐฉ์‹์€ random amplitude scale ๋ฐฉ์‹์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด์˜ ๋ฐฉํ–ฅ์„ฑ์„ ์œ ์ง€ํ•œ ์ƒํƒœ์—์„œ ๋žœ๋ค์œผ๋กœ ์ง„ํญ์„ ๋ฐ”๊พธ๋Š” ์‹์œผ๋กœ ์ง„ํ–‰ํ•œ๋‹ค.

 

a sampled transition experience

$$ (s, o, a_t, r, s', o') $$

 

Agent๊ฐ€ ๊ฒฝํ—˜ํ•˜๋Š” observation๊ณผ state๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ Scaling๋œ๋‹ค.

$$ o_i \leftarrow o_i \times z, \quad o'_i \leftarrow o'_i \times z, \quad \text{for } i = 1, 2, \ldots, N $$

$$  s \leftarrow s \times z, \quad s' \leftarrow s' \times z $$

z๋Š” range[a, b]๋ฅผ ๊ฐ€์ง„ uniform distribution์˜ ๋žœ๋ค ๊ฐ’์ด๋‹ค.  * z ~ U(a, b)

Random amplitude scale์€ ๋ฐฐ์น˜๋ณ„๋กœ ๋žœ๋ค ํ•˜๊ฒŒ ์ ์šฉ๋˜์ง€๋งŒ, ์‹œ๊ฐ„์— ๋Œ€ํ•ด์„  ์ผ๊ด€์„ฑ์„ ๊ฐ€์ง„๋‹ค.

 

 

 

3. Experiment & Ablation Study                                   

3.1. MARR ์„ฑ๋Šฅ ๋น„๊ต

SMAC(์Šคํƒ€ํฌ๋ž˜ํ”„ํŠธ) ํ™˜๊ฒฝ์—์„œ ์œ ๋ช…ํ•œ ๋ชจ๋ธ๋กœ๋Š” QMIX, QPLEX, ATM ๋“ฑ์ด ์žˆ๋‹ค. ์ด ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ MARR์„ ์ ์šฉํ–ˆ์„ ๋•Œ, ์„ฑ๋Šฅ์ด ์–ด๋–ป๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ๋Š”์ง€ ๋น„๊ตํ•˜๋Š” ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ๋ชจ๋“  ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ MARR์„ ์ ์šฉํ•œ ๋ชจ๋ธ์ด ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

s: Stalkersm
sc: Spine Crawler
z: Zealot
m: Marine

 

 

 

3.2. Parallel Setting and Series Setting with Replay Ratio

 

 

 

์ฒซ ๋ฒˆ์งธ Ablation Study๋กœ ๋ณ‘๋ ฌ(Parallel) ํ•™์Šต์ด ์ˆœ์ฐจ์ (Series) ํ•™์Šต๋ณด๋‹ค High replay ratio์— ๋” ์ ์ ˆํ•œ์ง€๋ฅผ ํ™•์ธํ•œ๋‹ค. ์˜†์˜ ์ด๋ฏธ์ง€์—์„œ Replay ratio๋งŒ ๋ฐ”๊ฟ”์ฃผ๋ฉฐ ๊ฐ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ–ˆ์„ ๋•Œ์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. Parallel์ด Series๋ณด๋‹ค Replay Ratio for good performance๊ฐ€ ๋†’์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

๋‘ ๋ฒˆ์งธ Ablation Study๋กœ MARR์˜ replay ratio๋ณ„ ์„ฑ๋Šฅ ํ™•์ธ์ด๋‹ค. ๋‘ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋‹จ์ˆœ Series, Parallel๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ตœ์  Replay ratio๊ฐ€ ์กด์žฌํ•˜๋ฉฐ, ์ด๋Š” ๋‹ค๋ฅธ ๋‘ ๋ชจ๋ธ๋ณด๋‹ค ๊ฐ’์ด ๋†’์•˜๋‹ค.

 

 

 

 

 

 

 

3.3. Component of MARR, Shrink & Perturb and Data Augmentation

 

 

MARR์˜ ๊ฐ Component๋ฅผ ์ ์šฉํ–ˆ์„ ๋•Œ์™€ ์•ˆ ํ–ˆ์„ ๋•Œ๋ฅผ ๋น„๊ตํ•œ ์‹คํ—˜๋„ ์ง„ํ–‰ํ–ˆ๋‹ค. ์‹คํ—˜์— ๋”ฐ๋ฅด๋ฉด, Baseline์—์„œ Data augmentation๋งŒ ์ง„ํ–‰ํ–ˆ์„ ๋•Œ๋Š” ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ํฌ์ง€ ์•Š์•˜์ง€๋งŒ, Baseline์—์„œ S&P๋ฅผ ์ ์šฉํ–ˆ์„ ๋• ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ด๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. 

 

 

 

 

 

 

 

 

 

3.4. Experiment Analysis of Network Plasticity

๋…ผ๋ฌธ์˜ ์ฃผ์š” Contribution์ธ Plasticity loss ๋ฐฉ์ง€ ์„ฑ๋Šฅ์„ ํ™•์ธํ•œ๋‹ค. L2 gap๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ loss ์œ ๋ฌด๋ฅผ ํ™•์ธํ–ˆ๋‹ค. 

 

L2 gap์€ ๋‘ ๋ฒกํ„ฐ ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. L2 gap์€ ํ›ˆ๋ จ๋œ ๋„คํŠธ์›Œํฌ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ์ดˆ๊ธฐ ๋„คํŠธ์›Œํฌ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฐ’์œผ๋กœ, ์ˆ˜์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:
$$
\| \theta_{\text{trained}} - \theta_{\text{initial}} \|_2 = \sqrt{\sum_{i} (\theta_{\text{trained}, i} - \theta_{\text{initial}, i})^2}
$$
์—ฌ๊ธฐ์„œ:

$$ \theta_{\text{trained}} $$๋Š” ํ›ˆ๋ จ๋œ ๋„คํŠธ์›Œํฌ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒกํ„ฐ,
$$ \theta_{\text{initial}} $$๋Š” ์ดˆ๊ธฐ ๋„คํŠธ์›Œํฌ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒกํ„ฐ.

 

 

 

 

3.5. Analysis of Running Time

MARR์€ ํ•™์Šต ํšจ์œจ์„ ๋†’์ด๋Š”๋ฐ ์ง‘์ค‘ํ•œ๋‹ค. ๋ณ‘๋ ฌ ํ•™์Šต๊ณผ ๋†’์€ ์žฌ์‚ฌ์šฉ ๋น„์œจ์„ ์ ์šฉํ•˜์—ฌ ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ์ค„์ด๋Š” ๋Œ€์‹  ๋„คํŠธ์›Œํฌ ํ•™์Šต์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ๋‹ค. 

 

์‹œ๊ฐ„ ์š”์†Œ ์ˆ˜์‹์„ ๋ถ„์„ํ•˜๋ฉด, ์ด๊ฒƒ์ด ์‚ฌ์‹ค์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ „์ฒด ์‹คํ–‰ ์‹œ๊ฐ„์€ ์ „์ฒด ์ƒํ˜ธ์ž‘์šฉ์— ๋“  ์‹œ๊ฐ„ ํ•ญ๊ณผ ๋„คํŠธ์›Œํฌ ์—…๋ฐ์ดํŠธ ์‹œ๊ฐ„ ํ•ญ์œผ๋กœ ์ด๋ค„์ ธ ์žˆ๋‹ค. 

$$
h_{\text{tot}} = \frac{T_{\text{tot}} \cdot h_{\text{env}}}{P_{\text{env}}} + \frac{T_{\text{tot}} \cdot NRR \cdot h_{\text{upt}}}{T_U} + h_{\text{rst}}
$$

 

์ฒซ ๋ฒˆ์งธ ํ•ญ: (์ „์ฒด ์ƒํ˜ธ์ž‘์šฉ ํšŸ์ˆ˜ * ์ƒํ˜ธ์ž‘์šฉ๋‹น ์‹คํ–‰ ์‹œ๊ฐ„)/๋ณ‘๋ ฌ ํ™˜๊ฒฝ ์ˆ˜

๋‘ ๋ฒˆ์งธ ํ•ญ: (์ „์ฒด ์ƒํ˜ธ์ž‘์šฉ ํšŸ์ˆ˜ * NRR * ์—…๋ฐ์ดํŠธ ๋‹น ์‹คํ–‰ ์‹œ๊ฐ„)/๋„คํŠธ์›Œํฌ ์—…๋ฐ์ดํŠธ ๊ฐ„๊ฒฉ

์„ธ ๋ฒˆ์งธ ํ•ญ: ๊ธฐํƒ€ ์ดˆ๊ธฐํ™”๋‚˜ ๋ถ€๊ฐ€ ์ž‘์—…์— ๋“œ๋Š” ์‹œ๊ฐ„

 

MARR์„ ํ™œ์šฉํ•˜๋ฉด ์ „์ฒด ์ƒํ˜ธ์ž‘์šฉ ํšŸ์ˆ˜ ๋ฐ ์‹œ๊ฐ„์„ ์ค„์ด๋ฉด์„œ NRR์„ ๋Š˜๋ฆฌ๊ธฐ์—, ๋„คํŠธ์›Œํฌ ์—…๋ฐ์ดํŠธ์— ์‹œ๊ฐ„์„ ์ง‘์ค‘์‹œํ‚ค๋Š” ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„๋‹ค.

 

 

4. Limitations 

(1) On-policy ๋ฐฉ์‹์— ๋Œ€ํ•œ ๋ถ€์ ํ•ฉ์„ฑ

๋ณธ ๋…ผ๋ฌธ์€ off-policy์—์„œ ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ๋ฅผ buffer์—์„œ ๊บผ๋‚ด์™€์„œ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณต ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ด๋‹ค. ํ•˜์ง€๋งŒ on-policy๋Š” ํ˜„์žฌ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฆ‰์‹œ ํ™œ์šฉํ•˜์—ฌ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ์— ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค.

 

(2) ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋ฌธ์ œ

MARR์„ ์‹ค์ œ ์ƒํ™ฉ์— ์ ์šฉํ•˜๋Š”๋ฐ ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋˜‘๊ฐ™์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŽ์ด ํ™œ์šฉํ•˜๊ธฐ์—, ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ ํ€„๋ฆฌํ‹ฐ๊ฐ€ ๋”์šฑ ์ค‘์š”ํ•ด์ง„๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ œ ๋ฐ์ดํ„ฐ๋Š” ๋…ธ์ด์ฆˆ๊ฐ€ ๊ปด์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๊ต‰์žฅํžˆ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์—, ๋…ผ๋ฌธ์˜ ์„ฑ๋Šฅ์ด ๋ฐœํ˜„๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค.

 

(3) ํŠน์ • ์ƒํ™ฉ์—์„œ์˜ ์ง„ํญ ์ ์šฉ ๋ฌธ์ œ

์„ ํ˜•์ ์œผ๋กœ ์Šค์ผ€์ผ๋ง๋œ ํŠน์„ฑ๋“ค์— ๋Œ€ํ•ด์„ , ๋ณ€ํ™”๋ฅผ ์‹œํ‚ค๋”๋ผ๋„ ์„ ํ˜•์ ์ธ ํŠน์„ฑ์ด ์œ ์ง€๋˜์–ด์•ผ ํ•œ๋‹ค. ์ด๋ ‡๋“ฏ ๋žœ๋ค ์ง„ํญ ์Šค์ผ€์ผ๋ง์ด ์–ด๋ ค์šด ์ƒํ™ฉ์ด ์กด์žฌํ•œ๋‹ค.