Foundation models pre-trained on web-scale data are shown to encapsulate extensive world knowledge beneficial for robotic manipulation in the form of task planning. However, the actual physical implementation of these plans often relies on task-specific learning methods, which require significant data collection and struggle with generalizability. In this work, we introduce Robotic Manipulation through Spatial Constraints of Parts (CoPa), a novel framework that leverages the common sense knowledge embedded within foundation models to generate a sequence of 6-DoF end-effector poses for open-world robotic manipulation. Specifically, we decompose the manipulation process into two phases: task-oriented grasping and task-aware motion planning. In the task-oriented grasping phase, we employ foundation vision-language models (VLMs) to select the object’s grasping part through a novel coarse-to-fine grounding mechanism. During the task-aware motion planning phase, VLMs are utilized again to identify the spatial geometry constraints of task-relevant object parts, which are then used to derive post-grasp poses. We also demonstrate how CoPa can be seamlessly integrated with existing robotic planning algorithms to accomplish complex, long-horizon tasks. Our comprehensive real-world experiments show that CoPa possesses a fine-grained physical understanding of scenes, capable of handling open-set instructions and objects with minimal prompt engineering and without additional training.
Left Our pipeline. Given an instruction and scene observation, CoPa first generates a grasp pose through Task-Oriented Grasping Module . Subsequently, a Task-Aware Motion Planning Module is utilized to obtain post-grasp poses. Right. Examples of real-world experiments. Boasting a fine-grained physical understanding of scenes, CoPa can generalize to open-world scenarios, handling open-set instructions and objects with minimal prompt engineering and without the need for additional training.
This module is utilized to identify the grasping part for task-oriented grasping or task-relevant parts for task-aware motion planning. The grounding process is divided into two stages: coarse-grained object grounding and fine-grained part grounding. Specifically, we first segment and label objects within the scene using SoM. Then, in conjunction with the instruction, we employ GPT-4V to select the grasping/task-relevant objects. Finally, similar fine-grained part grounding is applied to locate the specific grasping/task-relevant parts.
This module is employed to generate grasp poses. Initially, grasp pose candidates are generated from the scene point cloud using GraspNet. Concurrently, given the instruction and the scene image, the grasping part is identified by a grounding module. Ultimately, the final grasp pose is selected by filtering candidates based on the grasping part mask and GraspNet scores.
This module is used to obtain a series of post-grasp poses. Given the instruction and the current observation, we first employ a grounding module to identify task-relevant parts within the scene. Subsequently, these parts are modeled in 3D, and are then projected and annotated onto the scene image. Following this, VLMs are utilized to generate spatial constraints for these parts. Finally, a solver is applied to calculate the post-grasp poses based on these constraints.
Visualization of spatial constraints & post-grasp end-effector trajectory
We demonstrate the seamless integration with ViLa to accomplish long-horizon tasks. The high-level planner generates a sequence of sub-goals, which are then executed by CoPa. The results show that CoPa can be easily integrated with existing high-level planning algorithms to accomplish complex, long-horizon tasks.