Deep Reinforcement Learning for Vision-Based Robotic Grasping:A Simulated Comparative Evaluation ...

未看RL Algorithms部分!未完待续...

Snip20190226_32.png

视频:https://goo.gl/pyMd6p

环境code:https://goo.gl/jAESt9

demo code:https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/baselines/enjoy_kuka_diverse_object_grasping.py

Abstract

Question--the proliferation of algorithms makes it difficult to discern which particular approach would be best suited for a rich, diverse task like grasping.
Goal--propose a simulated benchmark for robotic grasping that emphasizes off-policy learning and generalization to unseen objects.
Method--evaluate the benchmark tasks against a variety of Q-function estimation methods, a method previously proposed for robotic grasping with deep neural network models, and a novel approach based on a combination of Monte Carlo return estimation and an off-policy correction.
Results--several simple methods provide a surprisingly strong competitor to popular algorithms such as double Qlearning, and our analysis of stability sheds light on the relative tradeoffs between the algorithms.

I. INTRODUCTION

解决grasping problem的有很多方法,比如:

  1. analytic grasp metrics [43], [36]
  2. learning-based approaches [2]

虽然基于计算机视觉的的learning-based方法在今年取得了不错的表现[22],但是这些方法不涉及抓取任务的 sequential aspect。

要么选择a single grasp pose [33]
要么重复选择the next most promising grasp greedily[24].

后来引入了RL方法作为robotic grasping in a sequential decision making context下的框架,但是局限于:

single object [34]
simple geometric shapes such as cubes [40].

本实验中,使用realistic simulated benchmark比较了多种RL方法。
由于成功的generalization通常需要训练大量的objects和scenes [33] [24],需要多个视角和control,因此on-policy不适用于多样化的grasping scenarios,而Off-policy reinforcement learning methods不错~

Aim:to understand which off-policy RL algorithms are best suited for vision-based robotic grasping.
Contributions

  1. a simulated grasping benchmark for a robotic arm with a two-finger parallel jaw gripper, grasping random objects from a bin.
  2. present an empirical evaluation of off-policy deep RL algorithms on vision-based robotic grasping tasks. 包括一下6种算法:
1. the grasp success prediction approach proposed by [24], 
2. Q-learning [28], 
3. path consistency learning (PCL) [29], 
4. deep deterministic policy gradient (DDPG) [25],
5. Monte Carlo policy evaluation [39], 
6. Corrected Monte-Carlo, a novel off-policy algorithm that extends Monte Carlo policy evaluation for unbiased off-policy learning.

Results show that deep RL can successfully learn grasping of diverse objects from raw pixels, and can grasp previously unseen objects in our simulator with an average success rate of 90%.

II. RELATED WORK

Model-free algorithms for deep RL的两个主要领域:

  1. policy gradient methods [44], [38], [27], [45]
  2. value-based methods [35], [28], [25], [15], [16], with actor-critic algorithms combining the two classes [29], [31], [14].

但是Model-free algorithms的通病就是很难tune。

但是Model-free algorithms的相关工作,including popular benchmarks [7], [1], [3]主要集中在

  • applications in video games
  • relatively simple simulated robot locomotion tasks
    而没有我们所需要的,能进行多样化的任务,且能泛化到新环境中。

有很多RL方法应用到了真实的机器人任务上,比如:

  1. guided policy search methods用于解决一些操作任务:contact-rich, vision-based skills [23], non-prehensile manipulation [10], and tasks involving significant discontinuities [5], [4]
  2. 或者直接应用model-free algorithms用于机器人学习技能:fitted Qiteration [21], Monte Carlo return estimates [37], deep deterministic policy gradient [13], trust-region policy optimization [11], and deep Q-networks [46]

这些成功的强化学习的应用通常只能tackle individual skills,而不能泛化去完成机器没有训练到的技能。
铺垫了这么多,就是为了强调The goal of this work is to provide a systematic comparison of deep RL approaches to robotic grasping,而且能泛化到新物体in a cluttered environment where objects may be obscured and the environment dynamics are complex(不像[40], [34], and [19]只考虑抓取形状简单的物体,我们的实验是很高级的!)。

用于抓取diverse sets of objects的不是强化学习的其他学习策略,也有很多,这里作者推荐我们看看下面这篇survey。

[2].J. Bohg, A. Morales, T. Asfour, and D. Kragic. 
Data-driven grasp synthesisa survey. Transactions on Robotics, 2014.

以前的方法主要依赖这三种sources of supervision:

  1. human labels [17], [22],
  2. geometric criteria for grasp success computed offline [12],
  3. robot self-supervision, measuring grasp success using sensors on the robot’s gripper [33]
    后来也有DL的方法出现:[20], [22], [24], [26], [32].

III. PRELIMINARIES

这里说明了论文中的一些符号,不赘述


Snip20190226_33.png

IV. PROBLEM SETUP

  • 仿真环境:Bullet simulator
  • timesteps:T = 15
  • 在最后一个step给出二进制的reward
  • 抓取成功的reward:


    Snip20190226_34.png
  • 抓取失败的reward:0
  • 当前的状态st包括当前视角的RGB image和当前的timestep t,用于让policy知道在这个episode结束前还有多少steps,用于做一些决定,如事都有时间做一个pre-grasp manipulation,或者是否要立即移动到a good grasping position。
  • 机械臂使用position control of the vertically-oriented gripper进行控制
    连续的action使用笛卡尔displacement来表示。其中fai是wrist绕着z轴的旋转


    Snip20190226_36.png
  • 当夹爪移动到低于某个fixed height threshold时,夹爪自动合上
  • 新episode开始时,物体的位置和方向在bin中随机放置
Snip20190226_35.png

1) Regular grasping.

900个for训练集,100for测试集
每个episode有5个objects in the bin
每20个episodes换一次objects


Snip20190226_37.png

2) Targeted grasping in clutter

所有的episodes用一样的objects
7个objects中选择3个target objects
当抓取到target objetc时机械臂奖励reward


Snip20190226_38.png

V. REINFORCEMENT LEARNING ALGORITHMS

A. Learning to Grasp with Supervised Learning

Levine et al. [24]. This method does not consider long-horizon returns, but instead uses a greedy controller to choose the actions 
with the highest predicted probability of producing a successful grasp.

B. Off-Policy Q-Learning

C. Regression with Monte Carlo Return Estimates

D. Corrected Monte Carlo Evaluation

E. Deep Deterministic Policy Gradient

F. Path Consistency Learning

G. Summary and Unified View

VI. EXPERIMENTS

评估RL算法的四个要点:

  1. overall performance
  2. data-efficiency
  3. robustness to off-policy data
  4. hyperparameter sensitivity

All algorithms use variants of the deep neural network architecture shown in Figure 3 to represent the Q-function.


Snip20190226_39.png

A. Data Efficiency and Performance

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 158,425评论 4 361
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,058评论 1 291
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 108,186评论 0 243
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,848评论 0 204
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,249评论 3 286
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,554评论 1 216
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,830评论 2 312
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,536评论 0 197
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,239评论 1 241
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,505评论 2 244
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,004评论 1 258
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,346评论 2 253
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 32,999评论 3 235
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,060评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,821评论 0 194
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,574评论 2 271
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,480评论 2 267

推荐阅读更多精彩内容

  • 文|慕念暖 —1— 愿我们都是陈小希,都有江辰,执子之手,与子偕老。 随着《小美好》的结束,我们每一个人都又被生活...
    慕念暖阅读 248评论 0 1
  • 此文仅用来叙念童年时的懵懂岁月 ,和我们共同怀念的往事。
    芳华籽阅读 242评论 0 0
  • 关联法条 《侵权责任法》 第六条:行为人因过错侵害他人民事权益,应当承担侵权责任。根据法律规定推定行为人有过错,行...
    怪人阿安阅读 2,188评论 0 2