Learning to combine primitive skills

A step towards versatile robotic manipulation

Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision. Traditional task and motion planning (TAMP) methods can solve complex tasks but require full state ob- servability and are not adapted to dynamic scene changes. Recent learning methods can operate directly on visual inputs but typically require many demonstrations and/or task-specific reward engineering. In this work we aim to overcome previous limitations and propose RLBC a reinforcement learning (RL) approach to task planning that learns to combine primitive (BC) skills. First, compared to previous learning methods, our approach requires neither intermediate rewards nor complete task demonstrations during training. Second, we demonstrate the versatility of our vision-based task planning in challenging settings with temporary occlusions and dynamic scene changes. Third, we propose an efficient training of basic skills from few synthetic demonstrations by exploring recent CNN architectures and data augmentation. Notably, while all of our policies are learned on visual inputs in simulated environments, we demonstrate the successful transfer and high success rates when applying such policies to manipulation tasks on a real UR5 robotic arm.

The policies have been trained using MImE, a simple interface based on pybullet simulator that provides tools to create complex manipulation tasks with a UR5 robotic arm. It is composed of 3 environments ranging from simple to more complex tasks: UR5-Pick, UR5-Bowl and UR5-Breakfast.

RLBC Overview

(a) Temporal hierarchy of master and skill policies
(b) FiLM CNN architecture used for the skill and master policies

Illustration of our approach. (a): The master policy is executed at a coarse interval of n time-steps to select among K skill policies. Each skill policy generates control for a primitive action such as grasping or pouring. (b) FiLM CNN architecture conditions the network on performing a given skill, which permits learning a shared representation of all skills along with visual features that can be used for master task planning.

MImE Environments


MImE is composed of 3 robotic environments for manipulation. In every environments, the agent controls the robot end-effector and observes the environment through a camera placed in front of the robot. The goal of the agent is to output the correct sequence of actions to perform the task at hand. In UR5-Pick, a cube is on the table and the goal is to grasp a cube and lift it in the air. In UR5-Bowl, a cube and a bowl are on the table, the agent has to put the cube into the bowl. In UR5-Breakfast, the goal is to prepare a simple breakfast, a bottle and a cup are on the table and the goal is to pour the two containers in the bowl without spilling drops.




To cite us, please use:
  author    = {Robin Strudel and Alexander Pashevich and Igor Kalevatykh and Ivan Laptev and Josef Sivic and Cordelia Schmid},
  title     = {Learning to combine primitive skills: A step towards versatile robotic manipulation},
  booktitle = {ICRA},
  year      = {2020},