将简单的策略应用于 cartpole 游戏

到目前为止,我们已经随机选择了一个动作并应用它。现在让我们应用一些逻辑来挑选行动而不是随机机会。第三个观察指的是角度。如果角度大于零,则意味着杆向右倾斜,因此我们将推车向右移动(1)。否则,我们将购物车向左移动(0)。我们来看一个例子:

  1. 我们定义了两个策略函数如下:
def policy_logic(env,obs):
  return 1 if obs[2] > 0 else 0
def policy_random(env,obs):
  return env.action_space.sample()
  1. 接下来,我们定义一个将针对特定数量的剧集运行的实验函数;每一集一直持续到游戏损失,即doneTrue。我们使用rewards_max来指示何时突破循环,因为我们不希望永远运行实验:
def experiment(policy, n_episodes, rewards_max):
  rewards=np.empty(shape=(n_episodes))
    env = gym.make('CartPole-v0')

    for i in range(n_episodes):
  obs = env.reset()
        done = False
        episode_reward = 0
        while not done:
  action = policy(env,obs)
            obs, reward, done, info = env.step(action)
            episode_reward += reward
            if episode_reward > rewards_max:
  break
        rewards[i]=episode_reward
  print('Policy:{}, Min reward:{}, Max reward:{}'
          .format(policy.__name__,
                  min(rewards),
                  max(rewards)))
  1. 我们运行实验 100 次,或直到奖励小于或等于rewards_max,即设置为 10,000:
n_episodes = 100
rewards_max = 10000
experiment(policy_random, n_episodes, rewards_max)
experiment(policy_logic, n_episodes, rewards_max)

我们可以看到逻辑选择的动作比随机选择的动作更好,但不是更好:

Policy:policy_random, Min reward:9.0, Max reward:63.0, Average reward:20.26
Policy:policy_logic, Min reward:24.0, Max reward:66.0, Average reward:42.81

现在让我们进一步修改选择动作的过程 - 基于参数。参数将乘以观察值,并且将基于乘法结果是零还是一来选择动作。让我们修改随机搜索方法,我们随机初始化参数。代码如下:

def policy_logic(theta,obs):
  # just ignore theta
  return 1 if obs[2] > 0 else 0

def policy_random(theta,obs):
  return 0 if np.matmul(theta,obs) < 0 else 1

def episode(env, policy, rewards_max):
  obs = env.reset()
    done = False
    episode_reward = 0
    if policy.__name__ in ['policy_random']:
  theta = np.random.rand(4) * 2 - 1
    else:
  theta = None
    while not done:
  action = policy(theta,obs)
        obs, reward, done, info = env.step(action)
        episode_reward += reward
        if episode_reward > rewards_max:
  break
    return episode_reward

def experiment(policy, n_episodes, rewards_max):
  rewards=np.empty(shape=(n_episodes))
    env = gym.make('CartPole-v0')

    for i in range(n_episodes):
  rewards[i]=episode(env,policy,rewards_max)
        #print("Episode finished at t{}".format(reward))
  print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}'
          .format(policy.__name__,
                  np.min(rewards),
                  np.max(rewards),
                  np.mean(rewards)))

n_episodes = 100
rewards_max = 10000
experiment(policy_random, n_episodes, rewards_max)
experiment(policy_logic, n_episodes, rewards_max)

我们可以看到随机搜索确实改善了结果:

Policy:policy_random, Min reward:8.0, Max reward:200.0, Average reward:40.04
Policy:policy_logic, Min reward:25.0, Max reward:62.0, Average reward:43.03

通过随机搜索,我们改进了结果以获得 200 的最大奖励。平均而言,随机搜索的奖励较低,因为随机搜索会尝试各种不良参数,从而降低整体结果。但是,我们可以从所有运行中选择最佳参数,然后在生产中使用最佳参数。让我们修改代码以首先训练参数:

def policy_logic(theta,obs):
  # just ignore theta
  return 1 if obs[2] > 0 else 0

def policy_random(theta,obs):
  return 0 if np.matmul(theta,obs) < 0 else 1

def episode(env,policy, rewards_max,theta):
  obs = env.reset()
    done = False
    episode_reward = 0

    while not done:
  action = policy(theta,obs)
        obs, reward, done, info = env.step(action)
        episode_reward += reward
        if episode_reward > rewards_max:
  break
    return episode_reward

def train(policy, n_episodes, rewards_max):

  env = gym.make('CartPole-v0')
    theta_best = np.empty(shape=[4])
    reward_best = 0

    for i in range(n_episodes):
 if policy.__name__ in ['policy_random']:  theta = np.random.rand(4) * 2 - 1
        else:
  theta = None

        reward_episode=episode(env,policy,rewards_max, theta)
        if reward_episode > reward_best:
  reward_best = reward_episode
            theta_best = theta.copy()
    return reward_best,theta_best

def experiment(policy, n_episodes, rewards_max, theta=None):
  rewards=np.empty(shape=[n_episodes])
    env = gym.make('CartPole-v0')

    for i in range(n_episodes):
  rewards[i]=episode(env,policy,rewards_max,theta)
        #print("Episode finished at t{}".format(reward))
  print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}'
          .format(policy.__name__,
                  np.min(rewards),
                  np.max(rewards),
                  np.mean(rewards)))

n_episodes = 100
rewards_max = 10000

reward,theta = train(policy_random, n_episodes, rewards_max)
print('trained theta: {}, rewards: {}'.format(theta,reward))
experiment(policy_random, n_episodes, rewards_max, theta)
experiment(policy_logic, n_episodes, rewards_max)

我们训练了 100 集,然后使用最佳参数为随机搜索策略运行实验:

n_episodes = 100
rewards_max = 10000

reward,theta = train(policy_random, n_episodes, rewards_max)
print('trained theta: {}, rewards: {}'.format(theta,reward))
experiment(policy_random, n_episodes, rewards_max, theta)
experiment(policy_logic, n_episodes, rewards_max)

我们发现训练参数给出了 200 的最佳结果:

trained theta: [-0.14779543  0.93269603  0.70896423  0.84632461], rewards: 200.0
Policy:policy_random, Min reward:200.0, Max reward:200.0, Average reward:200.0
Policy:policy_logic, Min reward:24.0, Max reward:63.0, Average reward:41.94

我们可以优化训练代码以继续训练,直到我们获得最大奖励。笔记本ch-13a_Reinforcement_Learning_NN中提供了此优化的代码。

现在我们已经学习了 OpenAI Gym 的基础知识,让我们学习强化学习。

results matching ""

    No results matching ""