[๊ฐ•ํ™”ํ•™์Šต] REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜ : ์ฝ”๋“œ ๊ตฌํ˜„

2024. 6. 2. 11:55ใ†๐Ÿงช Data Science/ML, DL



 

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„  REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ Pytorch๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ตฌํ˜„์„ ํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค.

๋‹ค๋ฅธ RL Algorithms๊ณผ๋Š” ๋‹ฌ๋ฆฌ ์•„์ฃผ ๊ฐ„๋‹จํ•˜๊ฒŒ ์˜ˆ์ œ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•˜์—ฌ ์–ด๋ ต์ง€ ์•Š๋‹ค.

๋ณธ ํฌ์ŠคํŒ…์„ ๋ณด๊ธฐ ์ „, REINFORCE ๊ฐœ๋…์€ ํ™•์‹คํ•˜๊ฒŒ ์ธ์ง€ํ•˜๊ณ  ์žˆ์–ด์•ผ ํ•จ์„ ์•Œ๋ฆฐ๋‹ค.

 

 

Last posting
๋‚ด์šฉ: ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ REINFORCE์˜ ๊ฐœ๋…๊ณผ ์ˆ˜์‹์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์•˜๋‹ค.
์š”์•ฝ: reward์˜ ํ•ฉ์ธ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ตœ์ ์˜ policy ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜

 

https://mengu.tistory.com/136

 

[๊ฐ•ํ™”ํ•™์Šต] REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜ : ๊ฐœ๋… ๋ฐ ์ˆ˜์‹

๊ฐ•ํ™”ํ•™์Šต์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•˜๊ณ  ์žˆ์–ด, ์—ฌ๋Ÿฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ˆ˜์‹๊ณผ ์ฝ”๋“œ๋ฅผ ์ •๋ฆฌํ•˜๊ณ ์ž ํ•œ๋‹ค.์ด ํฌ์ŠคํŒ…์€ ์ฒซ ๋ฐœ๊ฑธ์Œ์ด๋ฉฐ, REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ฒ ๋‹ค.ํŒŒ์ดํŒ…!  ๋ณธ ํฌ์ŠคํŒ…์€ ์ฑ… Foundation of Deep Reinforc

mengu.tistory.com

 

 

 

1. Problem to solve : CartPole

 

์ถœ์ฒ˜: OpenAI

 

์„ค๋ช…: CartPole ๊ฒŒ์ž„์€ ํ•˜๋‹จ ๋ฐ•์Šค๋ฅผ ์›€์ง์—ฌ์„œ ๋ง‰๋Œ€๊ธฐ๊ฐ€ ์“ฐ๋Ÿฌ์ง€์ง€ ์•Š๋„๋ก ํ•˜๋Š” ๊ฒŒ์ž„์ด๋‹ค.
๋ณด์ƒ: ์“ฐ๋Ÿฌ์ง€์ง€ ์•Š๊ณ  ๋ฒ„ํ‹ฐ๋Š” ์‹œ๊ฐ„๋งŒํผ ๋ณด์ƒ์„ ์–ป๊ฒŒ ๋œ๋‹ค. ํ˜น์€ ๋ฐ”๋‹ฅ๊ณผ ๋ง‰๋Œ€๊ธฐ์˜ ์œ„์ชฝ๊ณผ์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€์ˆ˜๋ก ๋ณด์ƒ์„ ์ฃผ๋Š” ์ฒด๊ณ„๋ฅผ ๊ตฌ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

2. ์ „์ฒด ์ฝ”๋“œ

์ฝ”๋“œ๋Š” ์ฑ… Foundation of Deep Reinforcement Learning(laura.G)๋ฅผ ์ฐธ๊ณ ํ•˜์˜€๋‹ค.

from torch.distributions import Categorical
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

gamma = 0.99

class Pi(nn.Module):
  def __init__(self, in_dim, out_dim):
    super(Pi, self).__init__()
    layers = [
        nn.Linear(in_dim, 64),
        nn.ReLU(),
        nn.Linear(64, out_dim),
        ]
    self.model = nn.Sequential(*layers)
    self.onpolicy_reset()
    self.train() # set training mode
  
  def onpolicy_reset(self):
    self.log_probs = []
    self.rewards = []
  
  def forward(self, x):
    pdparam = self.model(x)
    return pdparam
    
  def act(self, state):
    x = torch.from_numpy(state.astype(np.float32)) # to tensor
    pdparam = self.forward(x) # forward pass
    pd = Categorical(logits=pdparam) # probability distribution
    action = pd.sample() # pi(a|s) in action via pd
    log_prob = pd.log_prob(action) # log_prob of pi(a|s)
    self.log_probs.append(log_prob) # store for training
    return action.item()
  
def train(pi, optimizer):
  # Inner gradient-ascent loop of REINFORCE algorithm
  T = len(pi.rewards)
  rets = np.empty(T, dtype=np.float32) # the returns
  future_ret = 0.0
  # compute the returns efficiently
  for t in reversed(range(T)):
    future_ret = pi.rewards[t] + gamma * future_ret
    rets[t] = future_ret

  rets = torch.tensor(rets)
  log_probs = torch.stack(pi.log_probs)
  loss = - log_probs * rets # gradient term; Negative for maximizing
  loss = torch.sum(loss)
  optimizer.zero_grad()
  loss.backward() # backpropagate, compute gradients
  optimizer.step() # gradient-ascent, update the weights
  return loss

def main():
  env = gym.make('CartPole-v0')
  in_dim = env.observation_space.shape[0] # 4
  out_dim = env.action_space.n # 2
  pi = Pi(in_dim, out_dim) # policy pi_theta for REINFORCE
  optimizer = optim.Adam(pi.parameters(), lr=0.01)

  for epi in range(300):
    state = env.reset()
    for t in range(200): # cartpole max timestep is 200
      action = pi.act(state)
      state, reward, done, _ = env.step(action)
      pi.rewards.append(reward)
      env.render()
      if done:
        break
    loss = train(pi, optimizer) # train per episode
    total_reward = sum(pi.rewards)
    solved = total_reward > 195.0
    pi.onpolicy_reset() # onpolicy: clear memory after training
    
    print(f'Episode {epi}, loss: {loss}, \
    total_reward: {total_reward}, solved: {solved}')
    
if __name__ == '__main__':
  main()

 

์•ฝ 80์ค„ ์ •๋„์˜ ์ฝ”๋“œ๋กœ, ์ง€๊ธˆ๋ถ€ํ„ฐ ํ•จ์ˆ˜ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ์–ด์„œ ์„ค๋ช…ํ•˜๊ฒ ๋‹ค.

์œ„์˜ ์ฝ”๋“œ๋„ ๋‹ค์Œ๊ณผ ๊ฐ™์€ Algorithm์„ ๋”ฐ๋ฅด๋ฉฐ, ์ตœ์ ์˜ ((\pi)) ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ž„์„ ์žŠ์ง€ ์•Š๋Š”๋‹ค๋ฉด ์‰ฝ๊ฒŒ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

 

 

 

 

 

2-1. ์ฝ”๋“œ ์„ค๋ช…: Class Pi(nn.Module)

Class Pi๋Š” ((\pi_{\theta}))๋ฅผ ์ฝ”๋“œ๋กœ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋‹ค. ์ฆ‰, ํ•™์Šต์‹œํ‚ฌ ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ๊ตฌ์ถ•ํ•˜๋Š” ์ฝ”๋“œ๋‹ค.

Pytorch์˜ nn.Module์„ ๋ฐ›์€ ํ›„, __init__() ํ•จ์ˆ˜์—์„œ ๊ธฐ๋ณธ์ ์ธ ์„ธํŒ…์„ ํ•ด์ฃผ์—ˆ๋‹ค.

layer์€ [64๊ฐœ์˜ units์œผ๋กœ ๊ตฌ์„ฑ๋œ ์„ ํ˜•์‹ ๊ฒฝ๋ง-ReLU-์„ ํ˜•์‹ ๊ฒฝ๋ง]์ด๋‹ค. 

class Pi(nn.Module):
  def __init__(self, in_dim, out_dim):
    super(Pi, self).__init__()
    layers = [
        nn.Linear(in_dim, 64),
        nn.ReLU(),
        nn.Linear(64, out_dim),
        ]
    self.model = nn.Sequential(*layers)
    self.onpolicy_reset()
    self.train() # set training mode
  
  def onpolicy_reset(self):
    self.log_probs = []
    self.rewards = []
  
  def forward(self, x):
    pdparam = self.model(x)
    return pdparam
    
  def act(self, state):
    x = torch.from_numpy(state.astype(np.float32)) # to tensor
    pdparam = self.forward(x) # forward pass
    pd = Categorical(logits=pdparam) # probability distribution
    action = pd.sample() # pi(a|s) in action via pd
    log_prob = pd.log_prob(action) # log_prob of pi(a|s)
    self.log_probs.append(log_prob) # store for training
    return action.item()

๊ทธ ๋ฐ‘์œผ๋กœ ํ•จ์ˆ˜ onpolicy_reset์€ training ํ›„์— ๋‹ค์‹œ log_probs, rewards๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๋ฉฐ,

ํ•จ์ˆ˜ forward๋Š” Agent์˜ state๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ˆœ์ „ํŒŒ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

ํ•จ์ˆ˜ act๋Š” state๋ฅผ ์ธ์ž๋กœ ๋ฐ›์œผ๋ฉฐ  forward()๋ฅผ ๋ถˆ๋Ÿฌ์™€ ๊ณ„์‚ฐ ํ›„, action์„ sample๋กœ ๋ฝ‘๋Š”๋‹ค. ๋˜ํ•œ action์— ๋”ฐ๋ฅธ ((\bigtriangledown_{theta}))log((\pi_{\theta}(a_{t}|s_{t}))) ๋„ ๊ณ„์‚ฐํ•˜์—ฌ ์ €์žฅํ•ด ๋‘”๋‹ค.

* log_prob = ((\bigtriangledown_{theta}))log((\pi_{\theta}(a_{t}|s_{t})))

 

 

 

 

2-2. ์ฝ”๋“œ ์„ค๋ช…: train(pi, optimizer)

ํ•จ์ˆ˜ train()์€ ((\pi))์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์ธ ((\theta))๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๊ธฐ ์œ„ํ•œ loss๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
((\pi_{\theta}))์— ๊ณ„์‚ฐ๋˜์–ด ์žˆ๋Š” reward๋ฅผ ๊ฐ€์ ธ์™€์„œ gamma๋ฅผ ํ†ตํ•ด discounted reward๋“ค์„ ๊ตฌํ•œ๋‹ค.

def train(pi, optimizer):
  # Inner gradient-ascent loop of REINFORCE algorithm
  T = len(pi.rewards)
  rets = np.empty(T, dtype=np.float32) # the returns
  future_ret = 0.0
  # compute the returns efficiently
  for t in reversed(range(T)):
    future_ret = pi.rewards[t] + gamma * future_ret
    rets[t] = future_ret

  rets = torch.tensor(rets)
  log_probs = torch.stack(pi.log_probs)
  loss = - log_probs * rets # gradient term; Negative for maximizing
  loss = torch.sum(loss)
  optimizer.zero_grad()
  loss.backward() # backpropagate, compute gradients
  optimizer.step() # gradient-ascent, update the weights
  return loss

๊ทธ๋‹ค์Œ, reward์™€ log_prob๋ฅผ ๊ณฑํ•˜๊ณ , negative ํ•˜์—ฌ loss sum์„ ๊ณ„์‚ฐํ•œ๋‹ค. 

๊ณ„์‚ฐ๋œ loss๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ loss.backward()๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , opimizer๋ฅผ ๋ถˆ๋Ÿฌ์™€ ์ตœ์ ํ™”๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.

 

 

 

 

 

2-3. ์ฝ”๋“œ ์„ค๋ช…: main()

ํ•จ์ˆ˜ main์€ ํ•™์Šต ๊ณผ์ • ์ž์ฒด๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋‹ค.

์ฒ˜์Œ gym ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด 'CartPole-v0'์„ ์ˆ˜ํ–‰ํ•  ํ™˜๊ฒฝ์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค. ๊ทธ ํ›„ ((\pi)) ์‹ ๊ฒฝ๋ง๊ณผ optimizer ์„ธํŒ…์„ ํ•ด์ค€๋‹ค.

* in_dim, out_dim์€ ๊ฐ๊ฐ ๋“ค์–ด๊ฐˆ ๋•Œ์˜ ์ •๋ณด ์ฐจ์›(4), ๋‚˜์˜ฌ ๋•Œ์˜ ์ •๋ณด ์ฐจ์›(2)์„ ๋œปํ•œ๋‹ค.

 

def main():
  env = gym.make('CartPole-v0')
  in_dim = env.observation_space.shape[0] # 4
  out_dim = env.action_space.n # 2
  pi = Pi(in_dim, out_dim) # policy pi_theta for REINFORCE
  optimizer = optim.Adam(pi.parameters(), lr=0.01)

  for epi in range(300):
    state = env.reset()
    for t in range(200): # cartpole max timestep is 200
      action = pi.act(state)
      state, reward, done, _ = env.step(action)
      pi.rewards.append(reward)
      env.render()
      if done:
        break
    loss = train(pi, optimizer) # train per episode
    total_reward = sum(pi.rewards)
    solved = total_reward > 195.0
    pi.onpolicy_reset() # onpolicy: clear memory after training
    
    print(f'Episode {epi}, loss: {loss}, \
    total_reward: {total_reward}, solved: {solved}')

episode(=epoch)๋Š” 300๋ฒˆ์ด๋‹ค. ๋งค episode๋งˆ๋‹ค state๋Š” ์ดˆ๊ธฐํ™”๋˜๊ณ , timeste 200๋ฒˆ์„ ๋Œ ๋•Œ๊นŒ์ง€ ์‹คํŒจํ•˜์ง€ ์•Š์•˜๋‹ค๋ฉด ์„ฑ๊ณต์œผ๋กœ ๊ฐ„์ฃผํ•œ๋‹ค. ๋งค timestep๋งˆ๋‹ค ((\pi)) ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด action์„ sample ํ•˜๋ฉฐ, ์ด์— ๋”ฐ๋ฅธ ๋‹ค์Œ state, reward๋ฅผ ๊ตฌํ•˜์—ฌ ์ €์žฅํ•ด ๋‘”๋‹ค.

timestep์ด ์ข…๋ฃŒ๋˜๋ฉด, ํ•จ์ˆ˜ train()์œผ๋กœ loss๋ฅผ ๊ตฌํ•˜๊ณ  ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ํ•œํŽธ, ((\pi))์—์„œ rewards๋ฅผ ๊ฐ€์ ธ์™€ sum ํ•˜์—ฌ ์ตœ์ข… ์„ฑ๊ณต ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•œ๋‹ค. ๊ทธ ํ›„, ๋‹ค์‹œ episode๋ฅผ ์‹œ์ž‘ํ•  ๋• ๋ชจ๋“  log_prob, reward*๋ฅผ ์ดˆ๊ธฐํ™”ํ•ด ์ค€๋‹ค.

* ์ด๋Š” REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด on-policy๋ฅผ ๋”ฐ๋ฅด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์˜ค๋กœ์ง€ ์ตœ๊ทผ policy์— ์˜ํ•ด์„œ๋งŒ update ๋˜์–ด์•ผ ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด, ๋‹ค์Œ episode 105์˜ ์—…๋ฐ์ดํŠธ๋Š” episode 104์˜ policy์—์„œ ๋น„๋กฏ๋˜์–ด์•ผ ํ•œ๋‹ค.

 

 

 

 

 

3. ๊ฒฐ๊ณผ

 

์ตœ์ดˆ๋กœ ์„ฑ๊ณตํ•œ Episode๋Š” 98๋ฒˆ์งธ์˜€๋‹ค. ํ•˜์ง€๋งŒ ์šฐ์—ฐํžˆ ์„ฑ๊ณตํ•œ ๊ฒƒ์ด๊ธฐ์—, ์ดํ›„๋ถ€ํ„ฐ๋Š” ๋‹ค์‹œ ์‹คํŒจํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ์„ฑ๊ณต ๊ฒฝํ—˜์ด reward ํ–ฅ์ƒ์— ํฐ ๊ธฐ์—ฌ๋ฅผ ํ–ˆ์„ ๊ฒƒ์ด๋‹ค.

 

 

 

์ด 300๋ฒˆ์˜ Episode๋ฅผ ์ง„ํ–‰ํ–ˆ๊ณ , total_reward๋Š” ์ ์ฐจ 200์— ์ˆ˜๋ ดํ–ˆ๋‹ค.

 

 

 

์ง€๊ธˆ๊นŒ์ง€ REINFORCE ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฐœ๋…๊ณผ ์ฝ”๋“œ๋ฅผ ์‚ดํŽด๋ณด์•˜๋‹ค. ๋‹ค์Œ ํฌ์ŠคํŒ…์—์„  ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹ค๋ฃจ๋„๋ก ํ•˜๊ฒ ๋‹ค.