多臂老虎机问题

原文链接:https://oneraynyday.github.io/ml/2018/05/03/Reinforcement-Learning-Bandit/

老虎机问题是表格求解方法的一个子集,之所以称为表格是因为我们可以在表格中找到任何状态。

K-armed Bandit Problem:

One-armed bandit: a solt machine operated by pulling a long handle at the side.

有K个不同的action,每个action输出一笔钱,从以该action为条件的分布中采样得到(sampled from a distribution conditioned on the action),通常有T个时间步骤,如何使得获得的最多?

At是action,Rt是reward

It means “the value of an action a is the expected value of the reward of the action(at any time).”

Qt(a)是t时刻q*(a)的估计

如何计算Qt(a)?

value*: in this case, it’s different than the concept of rewards. Value is the long run metric, meanwhile reward is the immediate metric. 

Action-value Methods:分为两步计算,首先估计action的value,接着选择具体的action

1. Estimating Action values

求均值来近似计算

It entails that Qt(a) coverages almost surely to q*(a)

2. 

(1)Action selection rule: Greedy

选最大的

(2)Action selection rule: e-Greedy   随机挑选,choose from all actions uniformly

指数平均,a是系数,可以替换为一个函数an(a),表示每个时间点reward的权重

an(a)的两点性质,使得上面的更新是任意收敛的,并且不是收敛到一个具体的值,


性质

(3)Action Selection Rule: Optimistic initial values

One trick is to set the initial values for Q1(a)=C∀a, where C>q∗(a)∀a. 最初始的value值设置是很随机的,是一个超参数,一个trick是对于任意的a,设置value是一个符合一个条件的常数。

(4)Action Selection Rule:Upper-Confidence-Bound Selection


Double Bandit:视频链接:https://www.youtube.com/watch?feature=player_embedded&v=2M7mv4-BPCg

推荐阅读更多精彩内容