大模型/世界模型

综述

如何描述任务？
- 通过物理量定量的描述
- 通过图像作为结果，描述最终结果，(visual) goal-conditioning
- 通过语言，描述任务过程，language-conditioning
- 灵巧操作（dexterous manipulation）应该如何进行描述？
可能存在的问题？
- 算力需求？
- 可靠性？尤其是基于语言的模型，其多次回答的答案一致性？
- 对未见过的新任务的泛化能力？
- 对不同场景的适应能力？（different texture？visually-diverse environment）
- 是否可以完成灵巧操作任务？
- 多相机/传感器的模态融合？
- goal-conditioning和language-conditioning是否可以结合？

2025

RDT-1B: A DIFFUSION FOUNDATION MODEL FOR BIMANUAL MANIPULATION

Songming Liu et al., 北京信息科学与技术国家研究中心, ICLR 2025

主要贡献：diffusion foundation model for bimanual manipulation
- 大量数据进行了训练pre-trained on large-scale datasets， Specifically, our collection of pre-training datasets includes 46 datasets of various robots, with a total size of 1M+ trajectories and 21TB.
- 针对目标双臂机器人再进行fine-tune：we have collected 6K+ trajectories, making our dataset one of the largest bimanual datasets nowadays;
存在的问题
多模态输入：
- Low-dimensional inputs：物理量的输入low-dimensional vectors that represent physical quantities of the robot, including the proprioception, the action chunk, and the control frequency.
- Image inputs：图像输入
- Language inputs：语言输入

2024

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim et al., Stanford等, arxiv 2024

端到端大模型，大量数据，基于Llama、Dino和SigLIP组合和fine-tuning而来 trained on a diverse collection of 970k real-world robot demonstrations builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP
存在的问题：
- 只支持单一图像观测，没有考虑多图像或多传感器融合 only supports single-image observations. In reality, real-world robot setups are heterogeneous, with a wide range of possible sensory inputs
- 提升推理吞吐量improving the inference throughput以完成高频控制任务
- 性能提升，提高任务完成率
- 其他问题：VLM基模大小对性能的影响？Does co-training on robot action prediction data and Internet-scale vision-language data substantially improve VLA performance? What visual features are best-suited for VLA models?

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

Wenlong Huang et al., Stanford & Columbia, CoRL 2024

具体方法：使用LVM large vision model（dino v2）提取关键点，VLM vision-language model（GPT-4o）接受关键点和视觉输入，输出Python function
- 如何自动的选取与物体交互的关键点，Keypoint proposal：通过DINOv2提出features，通过SAM添加mask，通过k-mean对mask进行聚类后提取中心作为keypoint candidate
存在的问题（这篇文章的附录里讨论了一些有意思的问题）：
- 正文中：
  - 依赖于前向模型，假设物体无形变First, the optimization framework relies on a forward model of keypoints based on rigidity assumption, albeit a high-frequency feedback loop that relaxes the accuracy requirement of the model
  - 闭环操作要求精确的特征点跟踪Second, ReKep relies on accurate point tracking to correctly optimize actions in closed-loop, which is itself a challenging 3D vision task due to heavy intermittent occlusions.
  - 任务流程固定Lastly, the current formulation assumes a fixed sequence of stages (i.e., skeletons) for each task.
- 附录中：
  - 生成的提示词的稳定性Prompting and robustness：we have observed that when dealing with tasks that span many stages with several temporally dependent constraints (A.14), the VLMs lack enough robustness to obtain consistent success
  - 只在任务空间进行规划时，可能会遇到运动学问题Task-space planning
  - 铰接物体Articulated object manipulation
  - 双臂协作bimanual coordination

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen et al., Google, CVPR 2024

主要贡献：We endow VLMs quantitative spatial reasoning capability
方法：通过a large-scale spatial VQA dataset去训练VLMs
“gen eralist robot policies” (GRPs)

Octo: An Open-Source Generalist Robot Policy

Dibya Ghosh et al., UCB/CMU/Google, RSS 2024

主要贡献：large transformer-based policy，使用800k轨迹训练
存在的问题：
- 融合多相机信息时有问题current Octo model struggles with adequately processing wrist camera information. Often finetuning results were stronger when using only a third person camera instead of combining third person and wrist camera.
- 语言作为条件vs. 目标作为条件we notice a large difference between language-conditioned policy performance and goal-conditioned policy performance
- Open X-Embodiment dataset提供的数据存在取舍，需要扩充数据 Expanding the data used to train Octo is a natural avenue of improvement. Since the Open X-Embodiment dataset is comprised of optimal robot demonstrations, the current model trains via imitation; future work may consider learning from sub-optimal or online interaction data that require alternative objectives.
- 只试验了单臂双臂，没有导航任务或移动机器人
什么是language-conditioned policy？什么是goal-conditioned policy
- language commands和goal image

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abby O’Neill et al., 很多组织, ICRA 2024

主要贡献：数据库和RT-X，一个high-capacity model，基于RT-1和RT-2分别训练了两个模型
存在的问题：
- 未考虑多传感器的模态融合
- 是否可以应用到更多机器人上

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

Xiaoqi Li et al., PKU, CVPR 2024

主要贡献：以物体为中心的操作规划object-centric robot manipulation，使用思维链微调和推理确保机械臂末端位姿的稳定性和可解释性 a chain-of-thought fine-tuning and inference strategy that exploits MLLMs’ reasoning ability to enable robust and explainable end-effector’s pose predictions. 分辨方向和位置
实现方法：视觉前端使用CLIP，任务决策使用LLaMa-adapter
存在的问题
- 也是有场景多样性情况下的决策生成问题 In contrast, position predictions are susceptible to domain gaps caused by factors like lighting and texture. even though there might be imprcise directions
这篇论文写作中的问题：没有引用并详细分析table 1？有的任务成功率并不高，为什么？

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

An-Chieh Cheng et al., UCSD/NVidia, NeurIPS 2024

强化VLM的空间感知和推理能力enhance VLMs’ spatial perception and reasoning capa bilities
存在的问题：
- 使用AABBs（Axis-Aligned Bounding Boxes），使得label representation的结果不够准确，可以考虑换用OBBS（oriented bounding boxes）

ManiFoundation Model for General-Purpose Robotic Manipulation of Contact Synthesis with Arbitrary Objects and Robots

Zhixuan Xu et al., NUS, IROS 2024

主要贡献
存在的问题：
- 只处理quasi-static任务，应扩展到high-dynamic任务，整个路径中包括多个步骤
- 现在模型关注与单手指尖接触，应扩展到多手、面接触的场景
- 扩展现在的网络参数

2023

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan et al., Google, RSS 2023

端到端大模型，完全重新训练，大量数据large and broad dataset，可以执行very long-horizon tasks（50 stages）
- Our model is built on a Transformer architecture and takes a history of images and task description as input and directly outputs tokenized actions
- Our primary dataset consists of ∼130k robot demonstrations, collected with a fleet of 13 robots over the course of 17 months.
存在的问题：
- 采用了模仿学习方法（imitation learning method），因此性能无法超过示范（it may not be able to surpass the performance of the demonstrators）
- 无法生成其未见过的动作 the generalization to new instrunctions is limited to the combinations of previously seen concepts and RT-1 is not yet able to generalize to a completely new motion that has not been seen before
- 在不同的环境中的适应性需要提升 expand the generalization capabilities of these models to generalize to much more diverse environments
- 缺乏灵巧操作的任务 method is presented on a large but not very dexterous set of manipulation tasks

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan e al., Google,

实现方案：fine-tuning large VLMs，to directly perform closed-loop robot control
incorporate VLM和end-to-end机器人控制：实现VLA（vision-language-action） enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web
使用了大量数据 Representing the knowledge necessary to perform such a wide range of tasks requires large models and webscale datasets.
存在的问题：
- 产生新动作的能力 the robot does not acquire any ability to perform new motions by virtue of including this additional experience. The model’s physical skills are still limited to the distribution of skills seen in the robot data, but it learns to deploy those skills in new ways.
- 所需算力过高

RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

Konstantinos Bousmalis et al., Google, TMLR2023

主要贡献：
- 验证large transformer sequence model对机器人任务可行
- 检验通过小样本从demonstration中学得新任务的能力
实现方案：a visual goal-conditioned decision transformer/large transformer sequence model
- 大量数据训练we trained RoboCat on a very large dataset of diverse manipulation behaviours
存在的问题：
- 多模态任务
- 融合language-conditioning和(visual) goal-conditioning
- 使用RL进行微调？
- 适应更多场景We hope that next-generation foundation agents will demonstrate robustness to different basket textures and operate in more visually-diverse environments in the wild

2022

PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation

Mohit Shridhar et al., UW, CoRL 2022

算是learning from demonstration？通过a few demo习得任务，这篇的引用文献和对limitation的分析值得一看（在supplimental material中也有部分讨论）
- Transformer-based，通过voxel observation和language goal，输出离散的机械臂末端位置角度和夹爪开闭，使用了CLIP的language encoder（pre-trained language model）
- The language encodings are finetuned with a linear layer and then appended with the voxel encodings to form the input sequence.
评论：这篇paper方法是clip和transformer结合，agent的training似乎用了很多时间但是没提到数据量，只是强调自己train好之后可以通过a few demonstration习得任务
存在的问题：
- Sampling-based motion planner：如果任务对规划路径敏感（如倒水需要平滑的路径)
- Dynamic manipulation：缺乏real-time closed-loop maneuvering
- Dexterous manipulation：Using discretized actions with N-DoF robots like multi-fingered hands is also non-trivial.
- Generalization to novel instances and objects：We observe that changing the shape of the handles does not affect performance. However, handles with randomized textures and colors confuse the agent since it has only seen one type of drawer color and texture during training.
- Scope of language grounding：PERACT’s understanding of verb-noun phrases is closely grounded in demonstrations and tasks
- Predicting task completion：需要外部额外的信息已确认任务完成
- History and partial observability：relies purely on the current observation to predict the next action.
- Data augmentation with kinematic feasibility：训练数据需要考虑机械臂是否可以执行路径
- Balanced datasets：For instance, PERACT might have a tendency to always “place blue blocks on yellow blocks” if such an example is over-represented in the training data.
- Multi-task optimization：和上一个有点类似，训练一个项目会影响另一个项目的性能，需要平衡数据量Since all tasks are weighted equally, optimizing for certain tasks with common elements (e.g., moving blocks), might adversarial affect the performance on other dissimilar tasks (e.g., turning taps)
- Deployment risks：安全性没有保障might contain harmful biases
训练数据量究竟大不大？
- 训练RerAct sgentThe agent was trained with a batch-size of 16 on 8 NVIDIA V100 GPUs for 16 days (600K iterations)
- 在agent训练好后learning from demonstration不算大PERACT is trained with just a few demonstrations： We report average success rates on 25 evaluation episodes per task (25x18 = 450 total episodes) for agents trained with n = 10, 100 demonstrations per task Without any sim-to-real transfer or pre-training, we trained a multi-task PERACT agent from scratch on 7 tasks (with 18 unique variations) from a total of just 53 demonstrations.

Z's learning note

Explorer

mani_w_found_model

大模型/世界模型

综述

2025

RDT-1B: A DIFFUSION FOUNDATION MODEL FOR BIMANUAL MANIPULATION

2024

OpenVLA: An Open-Source Vision-Language-Action Model

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Octo: An Open-Source Generalist Robot Policy

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

ManiFoundation Model for General-Purpose Robotic Manipulation of Contact Synthesis with Arbitrary Objects and Robots

2023

RT-1: Robotics Transformer for Real-World Control at Scale

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

2022

PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation

Table of Contents

Backlinks

Z's learning note

Explorer

mani_w_found_model

大模型/世界模型 §

综述 §

2025 §

RDT-1B: A DIFFUSION FOUNDATION MODEL FOR BIMANUAL MANIPULATION §

2024 §

OpenVLA: An Open-Source Vision-Language-Action Model §

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation §

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities §

Octo: An Open-Source Generalist Robot Policy §

Open X-Embodiment: Robotic Learning Datasets and RT-X Models §

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation §

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models §

ManiFoundation Model for General-Purpose Robotic Manipulation of Contact Synthesis with Arbitrary Objects and Robots §

2023 §

RT-1: Robotics Transformer for Real-World Control at Scale §

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control §

RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation §

2022 §

PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation §

Table of Contents

Backlinks

大模型/世界模型

综述

2025

RDT-1B: A DIFFUSION FOUNDATION MODEL FOR BIMANUAL MANIPULATION

2024

OpenVLA: An Open-Source Vision-Language-Action Model

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Octo: An Open-Source Generalist Robot Policy

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

ManiFoundation Model for General-Purpose Robotic Manipulation of Contact Synthesis with Arbitrary Objects and Robots

2023

RT-1: Robotics Transformer for Real-World Control at Scale

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

2022

PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation