Manipulation tasks such as preparing a meal or
assembling furniture remain highly challenging for robotics
and vision. Traditional task and motion planning (TAMP)
methods can solve complex tasks but require full state ob-
servability and are not adapted to dynamic scene changes.
Recent learning methods can operate directly on visual inputs
but typically require many demonstrations and/or task-specific
reward engineering. In this work we aim to overcome previous
limitations and propose RLBC a reinforcement learning (RL) approach
to task planning that learns to combine primitive (BC) skills.
First, compared to previous learning methods, our approach requires
neither intermediate rewards nor complete task demonstrations
during training. Second, we demonstrate the versatility of
our vision-based task planning in challenging settings with
temporary occlusions and dynamic scene changes. Third, we
propose an efficient training of basic skills from few synthetic
demonstrations by exploring recent CNN architectures and data
augmentation. Notably, while all of our policies are learned on
visual inputs in simulated environments, we demonstrate the
successful transfer and high success rates when applying such
policies to manipulation tasks on a real UR5 robotic arm.
The policies have been trained using MImE, a simple interface based on pybullet simulator
that provides tools to create complex manipulation tasks with a UR5 robotic arm.
It is composed of 3 environments ranging from simple to more complex tasks:
UR5-Pick, UR5-Bowl and UR5-Breakfast.
Illustration of our approach. (a): The master policy is executed at a coarse interval of n time-steps to select among K skill policies. Each skill policy generates control for a primitive action such as grasping or pouring. (b) FiLM CNN architecture conditions the network on performing a given skill, which permits learning a shared representation of all skills along with visual features that can be used for master task planning.
MImE is composed of 3 robotic environments for manipulation. In every environments, the agent controls the robot end-effector and observes the environment through a camera placed in front of the robot. The goal of the agent is to output the correct sequence of actions to perform the task at hand. In UR5-Pick, a cube is on the table and the goal is to grasp a cube and lift it in the air. In UR5-Bowl, a cube and a bowl are on the table, the agent has to put the cube into the bowl. In UR5-Breakfast, the goal is to prepare a simple breakfast, a bottle and a cup are on the table and the goal is to pour the two containers in the bowl without spilling drops.
@inproceedings{rlbc2020,
author = {Robin Strudel and Alexander Pashevich and Igor Kalevatykh and Ivan Laptev and Josef Sivic and Cordelia Schmid},
title = {Learning to combine primitive skills: A step towards versatile robotic manipulation},
booktitle = {ICRA},
year = {2020},
}