大模型/世界模型
综述
- 如何描述任务?
- 通过物理量定量的描述
- 通过图像作为结果,描述最终结果,(visual) goal-conditioning
- 通过语言,描述任务过程,language-conditioning
- 灵巧操作(dexterous manipulation)应该如何进行描述?
- 可能存在的问题?
- 算力需求?
- 可靠性?尤其是基于语言的模型,其多次回答的答案一致性?
- 对未见过的新任务的泛化能力?
- 对不同场景的适应能力?(different texture?visually-diverse environment)
- 是否可以完成灵巧操作任务?
- 多相机/传感器的模态融合?
- goal-conditioning和language-conditioning是否可以结合?
2025
RDT-1B: A DIFFUSION FOUNDATION MODEL FOR BIMANUAL MANIPULATION
Songming Liu et al., 北京信息科学与技术国家研究中心, ICLR 2025
- 主要贡献:diffusion foundation model for bimanual manipulation
- 大量数据进行了训练
pre-trained on large-scale datasets,Specifically, our collection of pre-training datasets includes 46 datasets of various robots, with a total size of 1M+ trajectories and 21TB. - 针对目标双臂机器人再进行fine-tune:
we have collected 6K+ trajectories, making our dataset one of the largest bimanual datasets nowadays;
- 大量数据进行了训练
- 存在的问题
- 多模态输入:
- Low-dimensional inputs:物理量的输入
low-dimensional vectors that represent physical quantities of the robot, including the proprioception, the action chunk, and the control frequency. - Image inputs:图像输入
- Language inputs:语言输入
- Low-dimensional inputs:物理量的输入
2024
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim et al., Stanford等, arxiv 2024
- 端到端大模型,大量数据,基于Llama、Dino和SigLIP组合和fine-tuning而来
trained on a diverse collection of 970k real-world robot demonstrationsbuilds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP - 存在的问题:
- 只支持单一图像观测,没有考虑多图像或多传感器融合
only supports single-image observations. In reality, real-world robot setups are heterogeneous, with a wide range of possible sensory inputs - 提升推理吞吐量
improving the inference throughput以完成高频控制任务 - 性能提升,提高任务完成率
- 其他问题:VLM基模大小对性能的影响?Does co-training on robot action prediction data and Internet-scale vision-language data substantially improve VLA performance? What visual features are best-suited for VLA models?
- 只支持单一图像观测,没有考虑多图像或多传感器融合
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
Wenlong Huang et al., Stanford & Columbia, CoRL 2024
- 具体方法:使用LVM large vision model(dino v2)提取关键点,VLM vision-language model(GPT-4o)接受关键点和视觉输入,输出Python function
- 如何自动的选取与物体交互的关键点,Keypoint proposal:通过DINOv2提出features,通过SAM添加mask,通过k-mean对mask进行聚类后提取中心作为keypoint candidate
- 存在的问题(这篇文章的附录里讨论了一些有意思的问题):
- 正文中:
- 依赖于前向模型,假设物体无形变
First, the optimization framework relies on a forward model of keypoints based on rigidity assumption, albeit a high-frequency feedback loop that relaxes the accuracy requirement of the model - 闭环操作要求精确的特征点跟踪
Second, ReKep relies on accurate point tracking to correctly optimize actions in closed-loop, which is itself a challenging 3D vision task due to heavy intermittent occlusions. - 任务流程固定
Lastly, the current formulation assumes a fixed sequence of stages (i.e., skeletons) for each task.
- 依赖于前向模型,假设物体无形变
- 附录中:
- 生成的提示词的稳定性Prompting and robustness:
we have observed that when dealing with tasks that span many stages with several temporally dependent constraints (A.14), the VLMs lack enough robustness to obtain consistent success - 只在任务空间进行规划时,可能会遇到运动学问题Task-space planning
- 铰接物体Articulated object manipulation
- 双臂协作bimanual coordination
- 生成的提示词的稳定性Prompting and robustness:
- 正文中:
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen et al., Google, CVPR 2024
- 主要贡献:
We endow VLMs quantitative spatial reasoning capability - 方法:通过a large-scale spatial VQA dataset去训练VLMs
- “gen eralist robot policies” (GRPs)
Octo: An Open-Source Generalist Robot Policy
Dibya Ghosh et al., UCB/CMU/Google, RSS 2024
- 主要贡献:large transformer-based policy,使用800k轨迹训练
- 存在的问题:
- 融合多相机信息时有问题
current Octo model struggles with adequately processing wrist camera information. Often finetuning results were stronger when using only a third person camera instead of combining third person and wrist camera. - 语言作为条件vs. 目标作为条件
we notice a large difference between language-conditioned policy performance and goal-conditioned policy performance - Open X-Embodiment dataset提供的数据存在取舍,需要扩充数据
Expanding the data used to train Octo is a natural avenue of improvement. Since the Open X-Embodiment dataset is comprised of optimal robot demonstrations, the current model trains via imitation; future work may consider learning from sub-optimal or online interaction data that require alternative objectives. - 只试验了单臂双臂,没有导航任务或移动机器人
- 融合多相机信息时有问题
- 什么是language-conditioned policy?什么是goal-conditioned policy
- language commands和goal image
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abby O’Neill et al., 很多组织, ICRA 2024
- 主要贡献:数据库和RT-X,一个high-capacity model,基于RT-1和RT-2分别训练了两个模型
- 存在的问题:
- 未考虑多传感器的模态融合
- 是否可以应用到更多机器人上
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
Xiaoqi Li et al., PKU, CVPR 2024
- 主要贡献:以物体为中心的操作规划object-centric robot manipulation,使用思维链微调和推理确保机械臂末端位姿的稳定性和可解释性
a chain-of-thought fine-tuning and inference strategy that exploits MLLMs’ reasoning ability to enable robust and explainable end-effector’s pose predictions.分辨方向和位置 - 实现方法:视觉前端使用CLIP,任务决策使用LLaMa-adapter
- 存在的问题
- 也是有场景多样性情况下的决策生成问题
In contrast, position predictions are susceptible to domain gaps caused by factors like lighting and texture.even though there might be imprcise directions
- 也是有场景多样性情况下的决策生成问题
- 这篇论文写作中的问题:没有引用并详细分析table 1?有的任务成功率并不高,为什么?
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
An-Chieh Cheng et al., UCSD/NVidia, NeurIPS 2024
- 强化VLM的空间感知和推理能力
enhance VLMs’ spatial perception and reasoning capa bilities - 存在的问题:
- 使用AABBs(Axis-Aligned Bounding Boxes),使得label representation的结果不够准确,可以考虑换用OBBS(oriented bounding boxes)
ManiFoundation Model for General-Purpose Robotic Manipulation of Contact Synthesis with Arbitrary Objects and Robots
Zhixuan Xu et al., NUS, IROS 2024
- 主要贡献
- 存在的问题:
- 只处理quasi-static任务,应扩展到high-dynamic任务,整个路径中包括多个步骤
- 现在模型关注与单手指尖接触,应扩展到多手、面接触的场景
- 扩展现在的网络参数
2023
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan et al., Google, RSS 2023
- 端到端大模型,完全重新训练,大量数据
large and broad dataset,可以执行very long-horizon tasks(50 stages)Our model is built on a Transformer architecture and takes a history of images and task description as input and directly outputs tokenized actionsOur primary dataset consists of ∼130k robot demonstrations, collected with a fleet of 13 robots over the course of 17 months.
- 存在的问题:
- 采用了模仿学习方法(imitation learning method),因此性能无法超过示范(
it may not be able to surpass the performance of the demonstrators) - 无法生成其未见过的动作
the generalization to new instrunctions is limited to the combinations of previously seen concepts and RT-1 is not yet able to generalize to a completely new motion that has not been seen before - 在不同的环境中的适应性需要提升
expand the generalization capabilities of these models to generalize to much more diverse environments - 缺乏灵巧操作的任务
method is presented on a large but not very dexterous set of manipulation tasks
- 采用了模仿学习方法(imitation learning method),因此性能无法超过示范(
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan e al., Google,
- 实现方案:
fine-tuning large VLMs,to directly perform closed-loop robot control - incorporate VLM和end-to-end机器人控制:实现VLA(vision-language-action)
enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web - 使用了大量数据
Representing the knowledge necessary to perform such a wide range of tasks requires large models and webscale datasets. - 存在的问题:
- 产生新动作的能力
the robot does not acquire any ability to perform new motions by virtue of including this additional experience. The model’s physical skills are still limited to the distribution of skills seen in the robot data, but it learns to deploy those skills in new ways. - 所需算力过高
- 产生新动作的能力
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Konstantinos Bousmalis et al., Google, TMLR2023
- 主要贡献:
- 验证large transformer sequence model对机器人任务可行
- 检验通过小样本从demonstration中学得新任务的能力
- 实现方案:a visual goal-conditioned decision transformer/large transformer sequence model
- 大量数据训练
we trained RoboCat on a very large dataset of diverse manipulation behaviours
- 大量数据训练
- 存在的问题:
- 多模态任务
- 融合language-conditioning和(visual) goal-conditioning
- 使用RL进行微调?
- 适应更多场景
We hope that next-generation foundation agents will demonstrate robustness to different basket textures and operate in more visually-diverse environments in the wild
2022
PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation
Mohit Shridhar et al., UW, CoRL 2022
- 算是learning from demonstration?通过a few demo习得任务,这篇的引用文献和对limitation的分析值得一看(在supplimental material中也有部分讨论)
- Transformer-based,通过voxel observation和language goal,输出离散的机械臂末端位置角度和夹爪开闭,使用了CLIP的language encoder(pre-trained language model)
The language encodings are finetuned with a linear layer and then appended with the voxel encodings to form the input sequence.
- 评论:这篇paper方法是clip和transformer结合,agent的training似乎用了很多时间但是没提到数据量,只是强调自己train好之后可以通过a few demonstration习得任务
- 存在的问题:
- Sampling-based motion planner:如果任务对规划路径敏感(如倒水需要平滑的路径)
- Dynamic manipulation:缺乏real-time closed-loop maneuvering
- Dexterous manipulation:
Using discretized actions with N-DoF robots like multi-fingered hands is also non-trivial. - Generalization to novel instances and objects:
We observe that changing the shape of the handles does not affect performance. However, handles with randomized textures and colors confuse the agent since it has only seen one type of drawer color and texture during training. - Scope of language grounding:
PERACT’s understanding of verb-noun phrases is closely grounded in demonstrations and tasks - Predicting task completion:需要外部额外的信息已确认任务完成
- History and partial observability:
relies purely on the current observation to predict the next action. - Data augmentation with kinematic feasibility:训练数据需要考虑机械臂是否可以执行路径
- Balanced datasets:
For instance, PERACT might have a tendency to always “place blue blocks on yellow blocks” if such an example is over-represented in the training data. - Multi-task optimization:和上一个有点类似,训练一个项目会影响另一个项目的性能,需要平衡数据量
Since all tasks are weighted equally, optimizing for certain tasks with common elements (e.g., moving blocks), might adversarial affect the performance on other dissimilar tasks (e.g., turning taps) - Deployment risks:安全性没有保障
might contain harmful biases
- 训练数据量究竟大不大?
- 训练RerAct sgent
The agent was trained with a batch-size of 16 on 8 NVIDIA V100 GPUs for 16 days (600K iterations) - 在agent训练好后learning from demonstration不算大
PERACT is trained with just a few demonstrations:We report average success rates on 25 evaluation episodes per task (25x18 = 450 total episodes) for agents trained with n = 10, 100 demonstrations per taskWithout any sim-to-real transfer or pre-training, we trained a multi-task PERACT agent from scratch on 7 tasks (with 18 unique variations) from a total of just 53 demonstrations.
- 训练RerAct sgent