强化学习技巧

Reinforcement learning techniques can be categorized on the basis of the availability of the model as follows:

模型可用：如果模型可用，则智能体可以通过迭代策略或值函数来离线计划，以找到提供最大奖励的最优策略。
- 值迭代学习：在值迭代学习方法中，智能体通过将V(s)初始化为随机值开始，然后重复更新V(s)直到找到最大奖励。
- 策略迭代学习 ：在策略迭代学习方法中，智能体通过初始化随机策略p开始，然后重复更新策略，直到找到最大奖励。
模型不可用：如果模型不可用，则智能体只能通过观察其动作的结果来学习。因此，从观察，行动和奖励的历史来看，智能体会尝试估计模型或尝试直接推导出最优策略：
- 基于模型的学习：在基于模型的学习中，智能体首先从历史中估计模型，然后使用策略或基于价值的方法来找到最优策略。
- 无模型学习：在无模型学习中，智能体不会估计模型，而是直接从历史中估计最优策略。 Q-Learning 是无模型学习的一个例子。

作为示例，值迭代学习的算法如下：

initialize V(s) to random values for all states
Repeat
    for s in states
        for a in actions
            compute Q[s,a]
        V(s) = max(Q[s])   # maximum of Q for all actions for that state
Until optimal value of V(s) is found for all states

策略迭代学习的算法如下：

initialize a policy P_new to random sequence of actions for all states
Repeat
    P = P_new
    for s in states
        compute V(s) with P[s]
        P_new[s] = policy of optimal V(s)
Until P == P_new

强化学习技巧

强化学习技巧

results matching ""

No results matching ""