Teaching Robots with Show and Tell: Using Foundation Models to Synthesize Robot Policies from Language and Visual Demonstration

University of Washington

Abstract

We introduce a modular, neuro-symbolic framework for teaching robots new skills through language and visual demonstration. Our approach, ShowTell, composes a mixture of foundation models to synthesize robot manipulation programs that are easy to interpret and generalize across a wide range of tasks and environments. ShowTell is designed to handle complex demonstrations involving high level logic such as loops and conditionals while being intuitive and natural for end-users. We validate this approach through a series of real-world robot experiments, showing that ShowTell out-performs a state-of-the-art baseline based on GPT4-V, on a variety of tasks, and that it is able to generalize to unseen environments and within category objects.

Approach Overview

An overview of the ShowTell framework. First, the visual and spoken components of the demo are processed. An LLM synthesizes a modular program that can jointly reason about a provided demonstration and novel observations to transfer the demonstrated skill to new scenes.

Example Synthesized Programs

Qualitative examples of code synthesized by ShowTell for a set of representative demonstrations. The representative demonstrations show the ability to follow high level logic including (a) conditionals (b) iteration and (c) segmentation.

Generalization to Unseen Objects, Viewpoints, and Scenes

Wide Variety of Manipulation Skills