Q-Learning 的初始化和离散化
极地车环境返回的观测涉及环境状况。极点车的状态由我们需要离散的连续值表示。
如果我们将这些值离散化为小的状态空间,那么智能体会得到更快的训练,但需要注意的是会有收敛到最优策略的风险。
我们使用以下辅助函数来离散极推车环境的状态空间:
# discretize the value to a state space
def discretize(val,bounds,n_states):
discrete_val = 0
if val <= bounds[0]:
discrete_val = 0
elif val >= bounds[1]:
discrete_val = n_states-1
else:
discrete_val = int(round( (n_states-1) *
((val-bounds[0])/
(bounds[1]-bounds[0]))
))
return discrete_val
def discretize_state(vals,s_bounds,n_s):
discrete_vals = []
for i in range(len(n_s)):
discrete_vals.append(discretize(vals[i],s_bounds[i],n_s[i]))
return np.array(discrete_vals,dtype=np.int)
我们将每个观察尺寸的空间离散为 10 个单元。您可能想尝试不同的离散空间。在离散化之后,我们找到观察的上限和下限,并将速度和角速度的界限改变在-1 和+1 之间,而不是-Inf 和+ Inf。代码如下:
env = gym.make('CartPole-v0')
n_a = env.action_space.n
# number of discrete states for each observation dimension
n_s = np.array([10,10,10,10]) # position, velocity, angle, angular velocity
s_bounds = np.array(list(zip(env.observation_space.low, env.observation_space.high)))
# the velocity and angular velocity bounds are
# too high so we bound between -1, +1
s_bounds[1] = (-1.0,1.0)
s_bounds[3] = (-1.0,1.0)