强化学习基本概念,可参考:《强化学习之一二》。
算法介绍
DQN 算法非常简洁,贝尔曼方程的$Q$值版本为:
现在在我们实际使用的时候,通常会用神经网络来拟合$Q$函数,即:$Q(s, a; \theta) \approx Q^\star (s, a)$,其中$\theta$为神经网络参数。我们可以使用如下损失函数来训练我们的$Q$网络:
其中:$y_i = \mathbb{E}_{s^\prime \sim \varepsilon}\left[ r + \gamma \max_{a^\prime} Q(s^\prime, a^\prime; \theta_{i-1} \vert s, a) \right]$为第$i$轮迭代的时候,我们$Q$网络想要靠近的目标,其中$\gamma$为衰减系数;$\theta_{i-1}$为前一轮迭代时生成样本路径所用模型的参数,它是固定的;$s, a$服从行为分布(behaviour distribution)(实际使用时,就是我们所生成的样本路径)。
根据损失函数的公式,我们可以得到参数的梯度为:
在pytorch的实现中,我们无需直接计算梯度,我们可以直接最小化$y_i$与$Q(s, a; \theta_i)$间的MSE损失即可。
实现 tips
- 使用经验回放(experience replay)缓冲区来提高样本利用率
- 每次从缓冲区中采样$N$个,来增加训练的稳定性
在下面的实现中,我们默认会缓存32轮的数据,并采用先进先出原则,数据超过缓冲区大小后,会将最早的数据移除。此外,我们每一轮训练完毕,会使用我们最新训练后的模型来更新我们的行为模型。
DQN pytorch 简单实现
我们首先实现一个类,用于存储样本路径:
class EpisodeData(object):
def __init__(self):
self.fields = [
'states', 'actions', 'rewards', 'dones', 'log_probs', 'next_states'
]
for f in self.fields:
setattr(self, f, [])
self.total_rewards = 0
def add_record(self,
state,
action,
reward,
done,
log_prob=None,
next_state=None):
self.states.append(state)
self.actions.append(action)
self.log_probs.append(log_prob)
self.dones.append(done)
self.rewards.append(reward)
self.next_states.append(next_state)
self.total_rewards += reward
def get_states(self):
return np.array(self.states)
def get_actions(self):
return np.array(self.actions)
def steps(self):
return len(self.states)
def calc_qs(self, pre_model, gamma):
next_states = torch.tensor(np.array(self.next_states)).float()
next_qs = pre_model(next_states).max(dim=-1).values
masks = torch.tensor(np.array(self.dones) == 0)
rewards = torch.tensor(np.array(self.rewards)).view(-1)
qs = rewards + gamma * next_qs * masks
return qs.detach().float()
然后我们实现一下DQN算法本体:
from torch import optim
class DQN(object):
def __init__(self,
env,
model,
lr=1e-5,
optimizer='adam',
device='cpu',
deterministic=False,
gamma=0.95,
n_replays=4,
batch_size=200,
model_kwargs=None,
exploring=None,
n_trained_times=1,
n_buffers=32,
model_prefix="dqn"):
self.env = env
self.model = model
self.lr = lr
self.optimizer = optimizer
self.device = device
self.deterministic = deterministic
self.gamma = gamma
self.n_replays = n_replays
self.batch_size = batch_size
self.model_kwargs = model_kwargs
if optimizer == 'adam':
self.optimizer = optim.Adam(self.model.parameters(), lr=self.lr)
elif optimizer == 'sgd':
self.optimizer = optim.SGD(self.model.parameters(), lr=self.lr)
self.exploring = exploring
self.n_trained_times = n_trained_times
if self.model_kwargs:
self.pre_model = self.model.__class__(**self.model_kwargs)
else:
self.pre_model = self.model.__class__()
self.data_buffer = []
self.n_buffers = n_buffers
self.model_prefix = model_prefix
self.copy_model()
def gen_epoch_data(self, n_steps=1024, exploring=0., done_penalty=0):
state = self.env.reset()
done = False
epoch_data = EpisodeData()
self.model.eval()
steps = 0
for _ in range(n_steps):
steps += 1
qs = self.model(torch.tensor(state[np.newaxis, :]).float())
if exploring and np.random.rand() <= exploring:
action = self.env.action_space.sample()
else:
action = qs[0].argmax().item()
next_state, reward, done, _ = self.env.step(int(action))
if done and done_penalty:
reward -= done_penalty
epoch_data.add_record(state,
action,
reward,
1 if done else 0,
next_state=next_state)
state = next_state
if done:
state = self.env.reset()
return epoch_data
def get_exploring(self, need_exploring=False, mexp=0.1):
if need_exploring:
return max(mexp, self.n_trained_times**(-0.5))
if isinstance(self.exploring, float):
return self.exploring
elif self.exploring == 'quadratic_decrease':
return max(0.01, self.n_trained_times**(-0.5))
return 0.01
def copy_model(self):
self.pre_model.load_state_dict(self.model.state_dict())
self.pre_model.eval()
def train(self, epoch_data):
total_loss = 0.
qs = epoch_data.calc_qs(self.pre_model, gamma=0.95).to(self.device)
states = torch.tensor(epoch_data.get_states()).float().to(self.device)
actions = torch.tensor(epoch_data.get_actions()[:, np.newaxis]).to(
self.device)
n_batches = ceil(len(epoch_data.states) / self.batch_size)
indices = torch.randperm(len(epoch_data.states)).to(self.device)
for b in range(n_batches):
batch_indices = indices[b * self.batch_size:(b + 1) *
self.batch_size]
batch_states = states[batch_indices]
batch_actions = actions[batch_indices]
batch_qs = qs[batch_indices]
qs_pred = self.model(batch_states).gather(1,
batch_actions).view(-1)
loss_func = nn.MSELoss()
loss = loss_func(batch_qs, qs_pred)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
total_loss += loss.item()
return total_loss / n_batches
def learning(self, n_epoches=100, n_steps=1024):
self.model.train()
max_reward = -10000.
decay_reward = 0
decay = 0.95
for n in range(n_epoches):
# generate new data
new_data = self.gen_epoch_data(n_steps=n_steps,
exploring=self.get_exploring()
if not self.deterministic else 0.)
self.data_buffer.insert(0, new_data)
if len(self.data_buffer) > self.n_buffers:
self.data_buffer = self.data_buffer[:self.n_buffers]
# training
for data in self.data_buffer[::-1]:
loss = self.train(data)
# update static model
self.copy_model()
# show training information
decay_reward = new_data.total_rewards if decay_reward == 0 else (
decay_reward * decay + new_data.total_rewards * (1 - decay))
if max_reward < decay_reward:
max_reward = decay_reward
torch.save(self.model.state_dict(),
f'./models/{self.model_prefix}-success-v{n}.pt')
if n % 10 == 0:
print(
f'round: {n:>3d} | loss: {loss:>5.3f} | '
f'pre reward: {decay_reward:>5.2f}',
flush=True)
Atari DemonAttack 游戏实验
使用强化学习来训练atari游戏,效率一直很低,需要训练很长时间。这里我们使用openai Gym中的DemonAttack-ram-v0
环境来训练。
我们使用Atari游戏环境,一般不会直接使用,这里我们调整了以下几点:
- 跳帧:atari 游戏设计时是60帧的,而游戏每次都会读取用户的输入。一般为了加快训练,我们会设置每输入一个action,我们让环境执行多步。在我们的代码中,我们每个动作默认会执行
8
次 - 防止空转:我们知道,一开始模型很有可能在很长时间都没有获取奖励。这就导致了很多操作没有收益,这种操作对训练其实没有任何用处。所以,在我们的实现中,会设置一个最长的为获取奖励的步数,默认我们设置为
60
- 死亡、空转惩罚:生命减少或者达到空转阈值后,我们会将奖励减去一个预设值
- 默认ram的状态是
0~255
之间的值,我们将其归一化到0~1
间 - 奖励值我们默认除以
10
(乘以$0.1$)
那么,最后我们的环境修改为:
# -*- coding: utf-8 -*-
from gym import Wrapper
import numpy as np
class SkipframeWrapper(Wrapper):
def __init__(self,
env,
n_skip=8,
n_max_nops=0,
done_penalty=50,
reward_scale=0.1,
lives_penalty=50):
self.n_skip = n_skip
self.env = env
# 最大空转步数
self.n_max_nops = n_max_nops
self.n_nops = 0
# 游戏结束惩罚
self.done_penalty = done_penalty
# 奖励缩放系数
self.reward_scale = reward_scale
# 失去生命惩罚
self.lives_penalty = lives_penalty
def reset(self):
self.n_nops = 0
self.n_pre_lives = None
return self.env.reset()
def step(self, action):
n = self.n_skip
total_reward = 0
current_lives = None
while n > 0:
n -= 1
state, _reward, done, info = self.env.step(action)
total_reward += _reward
if 'lives' in info:
current_lives = info['lives']
if done:
break
if current_lives is not None:
if self.n_pre_lives is not None and current_lives < self.n_pre_lives:
total_reward -= self.lives_penalty
self.n_pre_lives = current_lives
state = state.astype(np.float) / 256.
if total_reward == 0:
self.n_nops += 1
if self.n_max_nops and self.n_nops >= self.n_max_nops:
done = True
if done:
total_reward -= self.done_penalty
total_reward *= self.reward_scale
return state, total_reward, done, info
我们的模型使用简单的双隐层MLP结构:
import numpy as np
import torch
from torch import nn
class DARModel(ModuleInitMixin, nn.Module):
def __init__(self, device='cpu') -> None:
super().__init__()
self.fc = nn.Sequential(
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 6),
)
self.device = device
self._initialize_weights()
def _initialize_weights(self):
for module in self.modules():
if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
nn.init.normal_(module.weight, 0, 0.05)
nn.init.normal_(module.bias, 0, 0.1)
def forward(self, x):
if isinstance(x, np.ndarray):
x = torch.tensor(x).float()
x = x.to(self.device)
return self.fc(x)
部分实验效果
游戏训练过程非常缓慢,这里展示一下训练了几个小时的效果(挑了个看起来还凑合的):
引用
[1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning."arXiv preprint arXiv:1312.5602(2013).
[2] 强化学习之一二:https://paperexplained.cn/articles/article/sdetail/ed046429-1b20-458f-9483-9089f2ae5acb/
更多推荐