2024. 6. 2. 11:55ใ๐งช Data Science/ML, DL
์ด๋ฒ ํฌ์คํ ์์ REINFORCE ์๊ณ ๋ฆฌ์ฆ์ Pytorch๋ก ๊ฐ๋จํ๊ฒ ๊ตฌํ์ ํด๋ณด๊ณ ์ ํ๋ค.
๋ค๋ฅธ RL Algorithms๊ณผ๋ ๋ฌ๋ฆฌ ์์ฃผ ๊ฐ๋จํ๊ฒ ์์ ํํ์ด ๊ฐ๋ฅํ์ฌ ์ด๋ ต์ง ์๋ค.
๋ณธ ํฌ์คํ ์ ๋ณด๊ธฐ ์ , REINFORCE ๊ฐ๋ ์ ํ์คํ๊ฒ ์ธ์งํ๊ณ ์์ด์ผ ํจ์ ์๋ฆฐ๋ค.
Last posting
๋ด์ฉ: ๊ฐํํ์ต ์๊ณ ๋ฆฌ์ฆ์ธ REINFORCE์ ๊ฐ๋ ๊ณผ ์์์ ๋ํด ์ดํด๋ณด์๋ค.
์์ฝ: reward์ ํฉ์ธ ๋ชฉ์ ํจ์๋ฅผ ์ต๋ํํ๋ ์ต์ ์ policy ์ฐพ๋ ๊ฒ์ด ๋ชฉํ์ธ ์๊ณ ๋ฆฌ์ฆ
1. Problem to solve : CartPole
์ค๋ช
: CartPole ๊ฒ์์ ํ๋จ ๋ฐ์ค๋ฅผ ์์ง์ฌ์ ๋ง๋๊ธฐ๊ฐ ์ฐ๋ฌ์ง์ง ์๋๋ก ํ๋ ๊ฒ์์ด๋ค.
๋ณด์: ์ฐ๋ฌ์ง์ง ์๊ณ ๋ฒํฐ๋ ์๊ฐ๋งํผ ๋ณด์์ ์ป๊ฒ ๋๋ค. ํน์ ๋ฐ๋ฅ๊ณผ ๋ง๋๊ธฐ์ ์์ชฝ๊ณผ์ ๊ฑฐ๋ฆฌ๊ฐ ๋ฉ์๋ก ๋ณด์์ ์ฃผ๋ ์ฒด๊ณ๋ฅผ ๊ตฌ์ํ ์ ์๋ค.
2. ์ ์ฒด ์ฝ๋
์ฝ๋๋ ์ฑ Foundation of Deep Reinforcement Learning(laura.G)๋ฅผ ์ฐธ๊ณ ํ์๋ค.
from torch.distributions import Categorical
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
gamma = 0.99
class Pi(nn.Module):
def __init__(self, in_dim, out_dim):
super(Pi, self).__init__()
layers = [
nn.Linear(in_dim, 64),
nn.ReLU(),
nn.Linear(64, out_dim),
]
self.model = nn.Sequential(*layers)
self.onpolicy_reset()
self.train() # set training mode
def onpolicy_reset(self):
self.log_probs = []
self.rewards = []
def forward(self, x):
pdparam = self.model(x)
return pdparam
def act(self, state):
x = torch.from_numpy(state.astype(np.float32)) # to tensor
pdparam = self.forward(x) # forward pass
pd = Categorical(logits=pdparam) # probability distribution
action = pd.sample() # pi(a|s) in action via pd
log_prob = pd.log_prob(action) # log_prob of pi(a|s)
self.log_probs.append(log_prob) # store for training
return action.item()
def train(pi, optimizer):
# Inner gradient-ascent loop of REINFORCE algorithm
T = len(pi.rewards)
rets = np.empty(T, dtype=np.float32) # the returns
future_ret = 0.0
# compute the returns efficiently
for t in reversed(range(T)):
future_ret = pi.rewards[t] + gamma * future_ret
rets[t] = future_ret
rets = torch.tensor(rets)
log_probs = torch.stack(pi.log_probs)
loss = - log_probs * rets # gradient term; Negative for maximizing
loss = torch.sum(loss)
optimizer.zero_grad()
loss.backward() # backpropagate, compute gradients
optimizer.step() # gradient-ascent, update the weights
return loss
def main():
env = gym.make('CartPole-v0')
in_dim = env.observation_space.shape[0] # 4
out_dim = env.action_space.n # 2
pi = Pi(in_dim, out_dim) # policy pi_theta for REINFORCE
optimizer = optim.Adam(pi.parameters(), lr=0.01)
for epi in range(300):
state = env.reset()
for t in range(200): # cartpole max timestep is 200
action = pi.act(state)
state, reward, done, _ = env.step(action)
pi.rewards.append(reward)
env.render()
if done:
break
loss = train(pi, optimizer) # train per episode
total_reward = sum(pi.rewards)
solved = total_reward > 195.0
pi.onpolicy_reset() # onpolicy: clear memory after training
print(f'Episode {epi}, loss: {loss}, \
total_reward: {total_reward}, solved: {solved}')
if __name__ == '__main__':
main()
์ฝ 80์ค ์ ๋์ ์ฝ๋๋ก, ์ง๊ธ๋ถํฐ ํจ์ ๋จ์๋ก ์ชผ๊ฐ์ด์ ์ค๋ช ํ๊ฒ ๋ค.
์์ ์ฝ๋๋ ๋ค์๊ณผ ๊ฐ์ Algorithm์ ๋ฐ๋ฅด๋ฉฐ, ์ต์ ์ ((\pi)) ํ๋ผ๋ฏธํฐ๋ฅผ ๊ตฌํ๋ ๊ฒ์์ ์์ง ์๋๋ค๋ฉด ์ฝ๊ฒ ์ดํดํ ์ ์์ ๊ฒ์ด๋ค.
2-1. ์ฝ๋ ์ค๋ช : Class Pi(nn.Module)
Class Pi๋ ((\pi_{\theta}))๋ฅผ ์ฝ๋๋ก ๋ํ๋ธ ๊ฒ์ด๋ค. ์ฆ, ํ์ต์ํฌ ์ธ๊ณต์ ๊ฒฝ๋ง์ ๊ตฌ์ถํ๋ ์ฝ๋๋ค.
Pytorch์ nn.Module์ ๋ฐ์ ํ, __init__() ํจ์์์ ๊ธฐ๋ณธ์ ์ธ ์ธํ ์ ํด์ฃผ์๋ค.
layer์ [64๊ฐ์ units์ผ๋ก ๊ตฌ์ฑ๋ ์ ํ์ ๊ฒฝ๋ง-ReLU-์ ํ์ ๊ฒฝ๋ง]์ด๋ค.
class Pi(nn.Module):
def __init__(self, in_dim, out_dim):
super(Pi, self).__init__()
layers = [
nn.Linear(in_dim, 64),
nn.ReLU(),
nn.Linear(64, out_dim),
]
self.model = nn.Sequential(*layers)
self.onpolicy_reset()
self.train() # set training mode
def onpolicy_reset(self):
self.log_probs = []
self.rewards = []
def forward(self, x):
pdparam = self.model(x)
return pdparam
def act(self, state):
x = torch.from_numpy(state.astype(np.float32)) # to tensor
pdparam = self.forward(x) # forward pass
pd = Categorical(logits=pdparam) # probability distribution
action = pd.sample() # pi(a|s) in action via pd
log_prob = pd.log_prob(action) # log_prob of pi(a|s)
self.log_probs.append(log_prob) # store for training
return action.item()
๊ทธ ๋ฐ์ผ๋ก ํจ์ onpolicy_reset์ training ํ์ ๋ค์ log_probs, rewards๋ฅผ ์ด๊ธฐํํ๋ฉฐ,
ํจ์ forward๋ Agent์ state๋ฅผ ๋ฐํ์ผ๋ก ์์ ํ ๊ณ์ฐ์ ์ํํ๋ค.
ํจ์ act๋ state๋ฅผ ์ธ์๋ก ๋ฐ์ผ๋ฉฐ forward()๋ฅผ ๋ถ๋ฌ์ ๊ณ์ฐ ํ, action์ sample๋ก ๋ฝ๋๋ค. ๋ํ action์ ๋ฐ๋ฅธ ((\bigtriangledown_{theta}))log((\pi_{\theta}(a_{t}|s_{t}))) ๋ ๊ณ์ฐํ์ฌ ์ ์ฅํด ๋๋ค.
* log_prob = ((\bigtriangledown_{theta}))log((\pi_{\theta}(a_{t}|s_{t})))
2-2. ์ฝ๋ ์ค๋ช : train(pi, optimizer)
ํจ์ train()์ ((\pi))์ ํ๋ผ๋ฏธํฐ์ธ ((\theta))๋ฅผ ์
๋ฐ์ดํธํ๊ธฐ ์ํ loss๋ฅผ ๊ณ์ฐํ๋ค.
((\pi_{\theta}))์ ๊ณ์ฐ๋์ด ์๋ reward๋ฅผ ๊ฐ์ ธ์์ gamma๋ฅผ ํตํด discounted reward๋ค์ ๊ตฌํ๋ค.
def train(pi, optimizer):
# Inner gradient-ascent loop of REINFORCE algorithm
T = len(pi.rewards)
rets = np.empty(T, dtype=np.float32) # the returns
future_ret = 0.0
# compute the returns efficiently
for t in reversed(range(T)):
future_ret = pi.rewards[t] + gamma * future_ret
rets[t] = future_ret
rets = torch.tensor(rets)
log_probs = torch.stack(pi.log_probs)
loss = - log_probs * rets # gradient term; Negative for maximizing
loss = torch.sum(loss)
optimizer.zero_grad()
loss.backward() # backpropagate, compute gradients
optimizer.step() # gradient-ascent, update the weights
return loss
๊ทธ๋ค์, reward์ log_prob๋ฅผ ๊ณฑํ๊ณ , negative ํ์ฌ loss sum์ ๊ณ์ฐํ๋ค.
๊ณ์ฐ๋ loss๋ฅผ ๋ฐํ์ผ๋ก loss.backward()๋ฅผ ๊ณ์ฐํ๊ณ , opimizer๋ฅผ ๋ถ๋ฌ์ ์ต์ ํ๋ฅผ ์งํํ๋ค.
2-3. ์ฝ๋ ์ค๋ช : main()
ํจ์ main์ ํ์ต ๊ณผ์ ์์ฒด๋ฅผ ๋ํ๋ธ ๊ฒ์ด๋ค.
์ฒ์ gym ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ํตํด 'CartPole-v0'์ ์ํํ ํ๊ฒฝ์ ๋ถ๋ฌ์จ๋ค. ๊ทธ ํ ((\pi)) ์ ๊ฒฝ๋ง๊ณผ optimizer ์ธํ ์ ํด์ค๋ค.
* in_dim, out_dim์ ๊ฐ๊ฐ ๋ค์ด๊ฐ ๋์ ์ ๋ณด ์ฐจ์(4), ๋์ฌ ๋์ ์ ๋ณด ์ฐจ์(2)์ ๋ปํ๋ค.
def main():
env = gym.make('CartPole-v0')
in_dim = env.observation_space.shape[0] # 4
out_dim = env.action_space.n # 2
pi = Pi(in_dim, out_dim) # policy pi_theta for REINFORCE
optimizer = optim.Adam(pi.parameters(), lr=0.01)
for epi in range(300):
state = env.reset()
for t in range(200): # cartpole max timestep is 200
action = pi.act(state)
state, reward, done, _ = env.step(action)
pi.rewards.append(reward)
env.render()
if done:
break
loss = train(pi, optimizer) # train per episode
total_reward = sum(pi.rewards)
solved = total_reward > 195.0
pi.onpolicy_reset() # onpolicy: clear memory after training
print(f'Episode {epi}, loss: {loss}, \
total_reward: {total_reward}, solved: {solved}')
episode(=epoch)๋ 300๋ฒ์ด๋ค. ๋งค episode๋ง๋ค state๋ ์ด๊ธฐํ๋๊ณ , timeste 200๋ฒ์ ๋ ๋๊น์ง ์คํจํ์ง ์์๋ค๋ฉด ์ฑ๊ณต์ผ๋ก ๊ฐ์ฃผํ๋ค. ๋งค timestep๋ง๋ค ((\pi)) ์ ๊ฒฝ๋ง์ ํตํด action์ sample ํ๋ฉฐ, ์ด์ ๋ฐ๋ฅธ ๋ค์ state, reward๋ฅผ ๊ตฌํ์ฌ ์ ์ฅํด ๋๋ค.
timestep์ด ์ข ๋ฃ๋๋ฉด, ํจ์ train()์ผ๋ก loss๋ฅผ ๊ตฌํ๊ณ ์ ๋ฐ์ดํธํ๋ค. ํํธ, ((\pi))์์ rewards๋ฅผ ๊ฐ์ ธ์ sum ํ์ฌ ์ต์ข ์ฑ๊ณต ์ฌ๋ถ๋ฅผ ํ๋จํ๋ค. ๊ทธ ํ, ๋ค์ episode๋ฅผ ์์ํ ๋ ๋ชจ๋ log_prob, reward*๋ฅผ ์ด๊ธฐํํด ์ค๋ค.
* ์ด๋ REINFORCE ์๊ณ ๋ฆฌ์ฆ์ด on-policy๋ฅผ ๋ฐ๋ฅด๊ธฐ ๋๋ฌธ์ด๋ค. ์ค๋ก์ง ์ต๊ทผ policy์ ์ํด์๋ง update ๋์ด์ผ ํ๋ค. ์๋ฅผ ๋ค๋ฉด, ๋ค์ episode 105์ ์ ๋ฐ์ดํธ๋ episode 104์ policy์์ ๋น๋กฏ๋์ด์ผ ํ๋ค.
3. ๊ฒฐ๊ณผ
์ต์ด๋ก ์ฑ๊ณตํ Episode๋ 98๋ฒ์งธ์๋ค. ํ์ง๋ง ์ฐ์ฐํ ์ฑ๊ณตํ ๊ฒ์ด๊ธฐ์, ์ดํ๋ถํฐ๋ ๋ค์ ์คํจํ์๋ค. ํ์ง๋ง ์ด๋ฐ ์ฑ๊ณต ๊ฒฝํ์ด reward ํฅ์์ ํฐ ๊ธฐ์ฌ๋ฅผ ํ์ ๊ฒ์ด๋ค.
์ด 300๋ฒ์ Episode๋ฅผ ์งํํ๊ณ , total_reward๋ ์ ์ฐจ 200์ ์๋ ดํ๋ค.
์ง๊ธ๊น์ง REINFORCE ๊ฐํํ์ต ์๊ณ ๋ฆฌ์ฆ์ ๊ฐ๋ ๊ณผ ์ฝ๋๋ฅผ ์ดํด๋ณด์๋ค. ๋ค์ ํฌ์คํ ์์ ๋ค๋ฅธ ์๊ณ ๋ฆฌ์ฆ์ ๋ค๋ฃจ๋๋ก ํ๊ฒ ๋ค.
'๐งช Data Science > ML, DL' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[DL] Constrained Decoding (1) | 2024.12.03 |
---|---|
[๊ฐํํ์ต] SARSA์ DQN ๊ฐ๋ ์ ๋ฆฌ (0) | 2024.07.05 |
[๊ฐํํ์ต] REINFORCE ์๊ณ ๋ฆฌ์ฆ : ๊ฐ๋ ๋ฐ ์์ (0) | 2024.05.27 |
[ML] ์ฐจ์ ์ถ์ (1) - ์ ์, PCA, ์์ ์ฝ๋ (1) | 2024.02.26 |
[์ถ์ฒ ์๊ณ ๋ฆฌ์ฆ] ALS ๊ฐ๋ , Basic ํ๊ฒ feat. ์ฝ๋ X (0) | 2022.05.23 |