CoPa: General Robotic Manipulation through
Spatial Constraints of Parts with Foundational Model

Haoxu Huang2,3,4*, Fanqi Lin1,2,4*, Yingdong Hu1,2,4, Shengjie Wang1,2,4, Yang Gao1,2,4
1Institute of Interdisciplinary Information Sciences, Tsinghua University. 2Shanghai Qi Zhi Institute. 3Shanghai Jiao Tong University. 4Shanghai Artificial Intelligence Laboratory.
*The first two authors contributed equally.


We propose Robotic Manipulation through Spatial Constraints of Parts (CoPa), a novel framework that incorporates common sense knowledge embedded within foundation vision-language models (VLMs), such as GPT-4V, into the low-level robotic manipulation tasks.


CoPa is capable of handling diverse open-set instructions and objects in a zero-training manner.

Abstract

Foundation models pre-trained on web-scale data are shown to encapsulate extensive world knowledge beneficial for robotic manipulation in the form of task planning. However, the actual physical implementation of these plans often relies on task-specific learning methods, which require significant data collection and struggle with generalizability. In this work, we introduce Robotic Manipulation through Spatial Constraints of Parts (CoPa), a novel framework that leverages the common sense knowledge embedded within foundation models to generate a sequence of 6-DoF end-effector poses for open-world robotic manipulation. Specifically, we decompose the manipulation process into two phases: task-oriented grasping and task-aware motion planning. In the task-oriented grasping phase, we employ foundation vision-language models (VLMs) to select the object’s grasping part through a novel coarse-to-fine grounding mechanism. During the task-aware motion planning phase, VLMs are utilized again to identify the spatial geometry constraints of task-relevant object parts, which are then used to derive post-grasp poses. We also demonstrate how CoPa can be seamlessly integrated with existing robotic planning algorithms to accomplish complex, long-horizon tasks. Our comprehensive real-world experiments show that CoPa possesses a fine-grained physical understanding of scenes, capable of handling open-set instructions and objects with minimal prompt engineering and without additional training.

CoPa

Left Our pipeline. Given an instruction and scene observation, CoPa first generates a grasp pose through Task-Oriented Grasping Module . Subsequently, a Task-Aware Motion Planning Module is utilized to obtain post-grasp poses. Right. Examples of real-world experiments. Boasting a fine-grained physical understanding of scenes, CoPa can generalize to open-world scenarios, handling open-set instructions and objects with minimal prompt engineering and without the need for additional training.

Grounding Module

This module is utilized to identify the grasping part for task-oriented grasping or task-relevant parts for task-aware motion planning. The grounding process is divided into two stages: coarse-grained object grounding and fine-grained part grounding. Specifically, we first segment and label objects within the scene using SoM. Then, in conjunction with the instruction, we employ GPT-4V to select the grasping/task-relevant objects. Finally, similar fine-grained part grounding is applied to locate the specific grasping/task-relevant parts.

Task-Oriented Grasping Module

This module is employed to generate grasp poses. Initially, grasp pose candidates are generated from the scene point cloud using GraspNet. Concurrently, given the instruction and the scene image, the grasping part is identified by a grounding module. Ultimately, the final grasp pose is selected by filtering candidates based on the grasping part mask and GraspNet scores.

Task-Aware Motion Planning Module

This module is used to obtain a series of post-grasp poses. Given the instruction and the current observation, we first employ a grounding module to identify task-relevant parts within the scene. Subsequently, these parts are modeled in 3D, and are then projected and annotated onto the scene image. Following this, VLMs are utilized to generate spatial constraints for these parts. Finally, a solver is applied to calculate the post-grasp poses based on these constraints.



Visualization of Spatial Constraints

Visualization for

Visualization of spatial constraints & post-grasp end-effector trajectory

Integration with High-Level Planning

We demonstrate the seamless integration with ViLa to accomplish long-horizon tasks. The high-level planner generates a sequence of sub-goals, which are then executed by CoPa. The results show that CoPa can be easily integrated with existing high-level planning algorithms to accomplish complex, long-horizon tasks.

User: "Make me a cup of pour-over coffee."

ViLa:
  1. “Scoop coffee beans”
  2. “Pour beans into coffee machine”
  3. “Turn on coffee machine”
  4. “Put funnel onto carafe”
  5. “Pour powder into funnel”
  6. “Pour water to funnel”
User: "Set up table for a romantic dinner."

ViLa:
  1. “Put flowers into vase”
  2. “Right fallen bottle”
  3. “Place fork and knife”
  4. “Pour wine”