将简单的策略应用于 cartpole 游戏
到目前为止,我们已经随机选择了一个动作并应用它。现在让我们应用一些逻辑来挑选行动而不是随机机会。第三个观察指的是角度。如果角度大于零,则意味着杆向右倾斜,因此我们将推车向右移动(1)。否则,我们将购物车向左移动(0)。我们来看一个例子:
- 我们定义了两个策略函数如下:
def policy_logic(env,obs):
return 1 if obs[2] > 0 else 0
def policy_random(env,obs):
return env.action_space.sample()
- 接下来,我们定义一个将针对特定数量的剧集运行的实验函数;每一集一直持续到游戏损失,即
done
为True
。我们使用rewards_max
来指示何时突破循环,因为我们不希望永远运行实验:
def experiment(policy, n_episodes, rewards_max):
rewards=np.empty(shape=(n_episodes))
env = gym.make('CartPole-v0')
for i in range(n_episodes):
obs = env.reset()
done = False
episode_reward = 0
while not done:
action = policy(env,obs)
obs, reward, done, info = env.step(action)
episode_reward += reward
if episode_reward > rewards_max:
break
rewards[i]=episode_reward
print('Policy:{}, Min reward:{}, Max reward:{}'
.format(policy.__name__,
min(rewards),
max(rewards)))
- 我们运行实验 100 次,或直到奖励小于或等于
rewards_max
,即设置为 10,000:
n_episodes = 100
rewards_max = 10000
experiment(policy_random, n_episodes, rewards_max)
experiment(policy_logic, n_episodes, rewards_max)
我们可以看到逻辑选择的动作比随机选择的动作更好,但不是更好:
Policy:policy_random, Min reward:9.0, Max reward:63.0, Average reward:20.26
Policy:policy_logic, Min reward:24.0, Max reward:66.0, Average reward:42.81
现在让我们进一步修改选择动作的过程 - 基于参数。参数将乘以观察值,并且将基于乘法结果是零还是一来选择动作。让我们修改随机搜索方法,我们随机初始化参数。代码如下:
def policy_logic(theta,obs):
# just ignore theta
return 1 if obs[2] > 0 else 0
def policy_random(theta,obs):
return 0 if np.matmul(theta,obs) < 0 else 1
def episode(env, policy, rewards_max):
obs = env.reset()
done = False
episode_reward = 0
if policy.__name__ in ['policy_random']:
theta = np.random.rand(4) * 2 - 1
else:
theta = None
while not done:
action = policy(theta,obs)
obs, reward, done, info = env.step(action)
episode_reward += reward
if episode_reward > rewards_max:
break
return episode_reward
def experiment(policy, n_episodes, rewards_max):
rewards=np.empty(shape=(n_episodes))
env = gym.make('CartPole-v0')
for i in range(n_episodes):
rewards[i]=episode(env,policy,rewards_max)
#print("Episode finished at t{}".format(reward))
print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}'
.format(policy.__name__,
np.min(rewards),
np.max(rewards),
np.mean(rewards)))
n_episodes = 100
rewards_max = 10000
experiment(policy_random, n_episodes, rewards_max)
experiment(policy_logic, n_episodes, rewards_max)
我们可以看到随机搜索确实改善了结果:
Policy:policy_random, Min reward:8.0, Max reward:200.0, Average reward:40.04
Policy:policy_logic, Min reward:25.0, Max reward:62.0, Average reward:43.03
通过随机搜索,我们改进了结果以获得 200 的最大奖励。平均而言,随机搜索的奖励较低,因为随机搜索会尝试各种不良参数,从而降低整体结果。但是,我们可以从所有运行中选择最佳参数,然后在生产中使用最佳参数。让我们修改代码以首先训练参数:
def policy_logic(theta,obs):
# just ignore theta
return 1 if obs[2] > 0 else 0
def policy_random(theta,obs):
return 0 if np.matmul(theta,obs) < 0 else 1
def episode(env,policy, rewards_max,theta):
obs = env.reset()
done = False
episode_reward = 0
while not done:
action = policy(theta,obs)
obs, reward, done, info = env.step(action)
episode_reward += reward
if episode_reward > rewards_max:
break
return episode_reward
def train(policy, n_episodes, rewards_max):
env = gym.make('CartPole-v0')
theta_best = np.empty(shape=[4])
reward_best = 0
for i in range(n_episodes):
if policy.__name__ in ['policy_random']: theta = np.random.rand(4) * 2 - 1
else:
theta = None
reward_episode=episode(env,policy,rewards_max, theta)
if reward_episode > reward_best:
reward_best = reward_episode
theta_best = theta.copy()
return reward_best,theta_best
def experiment(policy, n_episodes, rewards_max, theta=None):
rewards=np.empty(shape=[n_episodes])
env = gym.make('CartPole-v0')
for i in range(n_episodes):
rewards[i]=episode(env,policy,rewards_max,theta)
#print("Episode finished at t{}".format(reward))
print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}'
.format(policy.__name__,
np.min(rewards),
np.max(rewards),
np.mean(rewards)))
n_episodes = 100
rewards_max = 10000
reward,theta = train(policy_random, n_episodes, rewards_max)
print('trained theta: {}, rewards: {}'.format(theta,reward))
experiment(policy_random, n_episodes, rewards_max, theta)
experiment(policy_logic, n_episodes, rewards_max)
我们训练了 100 集,然后使用最佳参数为随机搜索策略运行实验:
n_episodes = 100
rewards_max = 10000
reward,theta = train(policy_random, n_episodes, rewards_max)
print('trained theta: {}, rewards: {}'.format(theta,reward))
experiment(policy_random, n_episodes, rewards_max, theta)
experiment(policy_logic, n_episodes, rewards_max)
我们发现训练参数给出了 200 的最佳结果:
trained theta: [-0.14779543 0.93269603 0.70896423 0.84632461], rewards: 200.0
Policy:policy_random, Min reward:200.0, Max reward:200.0, Average reward:200.0
Policy:policy_logic, Min reward:24.0, Max reward:63.0, Average reward:41.94
我们可以优化训练代码以继续训练,直到我们获得最大奖励。笔记本ch-13a_Reinforcement_Learning_NN
中提供了此优化的代码。
现在我们已经学习了 OpenAI Gym 的基础知识,让我们学习强化学习。