Training robots to execute tasks in the real world requires data — the more, the better. The problem is that creating these datasets takes a lot of time and effort, and methods don’t scale well. That’s where Robot Learning with Semantically Imagined Experience (ROSIE) comes in.

The basic concept is straightforward: enhance training data with hallucinated elements to change details, add variations, or introduce novel distractions. Studies show a robot additionally trained on this data performs tasks better than one without.

Supplementary Video for Scaling Robot Learning with Semantically Imagined Experience 5 20 screenshot
This robot is able to deposit an object into a metal sink it has never seen before, thanks to hallucinating a sink in place of an open drawer in its original training data.

Suppose one has a dataset consisting of a robot arm picking up a coke can and placing it into an orange lunchbox. That training data is used to teach the arm how to do the task. But in the real world, maybe there is distracting clutter on the countertop. Or, the lunchbox in the training data was empty, but the one on the counter right now already has a sandwich inside it. The further a real-world task differs from the training dataset, the less capable and accurate the robot becomes.

ROSIE aims to alleviate this problem by using image diffusion models (such as Imagen) to enhance the training data in targeted and direct ways. In one example, a robot has been trained to deposit an object into a drawer. ROSIE augments this training by inpainting the drawer in the training data, replacing it with a metal sink. A robot trained on both datasets competently performs the task of placing an object into a metal sink, despite the fact that a sink never actually appears in the original training data, nor has the robot ever seen this particular real-world sink. A robot without the benefit of ROSIE fails the task.

Here is a link to the team’s paper, and embedded below is a video demonstrating ROSIE both in concept and in action. This is also in a way a bit reminiscent of a plug-in we recently saw for Blender, which uses an AI image generator to texture entire 3D scenes with a simple text prompt.