Intro
- RL的要素:
-
Element Note -
如何指定?何时给予 ?是每一步时给出反馈还是在结束时才给反馈?(如何可以以不同方式进行反馈,是否等价?是否影响训练耗时?参考reward shaping) -
终止条件是什么?
-
A framework for efficient robotic manipulation
Albert Zhan et al., arXiv 2020
方法
- Procedure
- A small number of demonstrations are collected and stored in a replay buffer
- The convolutional encoder weights are initialized with unsupervised contrastive pre-training on the demonstration data.
- An off-policy model-free RL algorithm is trained (online policy learning) with augmented images on both data collected online during training and the initial demonstrations.
- Benefits:
- Data efficiency: 15-50 minutes of total training time
- A simple unified framework:
- General & lightweight setup: robot*1, GPU*1, camera*2, a handful of demonstrations, a sparse reward function
Front-end | NN middle-ware | RL agent |
---|---|---|
Pre-train: Expert demonstration into Replay buffer for pre-train Refine: Two RGB images as the input | Convolution (neural network) encoder query encoder? to generate query-key pairs Only pre-trained offline | SAC agent (soft actor critic) online training |
实验环境
- Real-world: xArm robot
- Simulation: OpenAI Gym Fetch environment pytorch, mujoco, gym
CURL: constrastive unsupervised representations for reinforcement learning
Michael Laskin et al., PMLR 2020
- Progress: , link, Github
- CURL:综合了contrastive learning with RL
One could use any RL alogorithm in the CURL pipeline, be it on-policy or off-policy
- 用来解决RL算法在处理sample时低效的两种方法:
- Auxiliary task on the agent’s sensory observations. Use auxiliary self-supervision tasks to accelerate the learning progress of model-free RL methods
- World models that predict the future.
- 几种学习方式
- Self-supervised learning
- Contrastive learning
- Self-supervised learning for RL
- World models for sample-efficiency
- Sample-efficient RL for image-based control
Autonomous inverted helicopter flight via reinforcement learning
Andrew Y. Ng et al., Experimental Robotics IX 2006
-
Problem formulation:
Problem State set Action set Solution Solution improvement Control a helicopter, with trajectory (yaw) is given (encoded) in the helicopter body coordinates
Tilting the rotor plane forwards/backwards or sideways.
The main rotor’s thrust.
The tail rotor’s thrust.the PEGASUS algorithm Is an innovative method -
A helicopter has 12-dimensional helicopter state and 4-dimensional helicopter control inputs in total.
- 12-dimensional state:
- (Not being used) The spatial/world coordinate
. - Position
. - Roll
, pitch , yaw . - Velocity and angular velocities
- Position
- Forward, sideways, and down relative to the current position of the helicopter
- (Being used) in the helicopter body coordinates
, stands for body. are natural symmetries, encoded directly into the model rather than force an algorithm to learn. 位置 , , 和偏航 直接给定,通过RL学习剩下的8个参数
- (Not being used) The spatial/world coordinate
- 4-dimensional control:
, : The longitudinal (front-back) and latitudinal (left-right) cyclic pitch controls. (Tilting the rotor plane forwards/backwards or sideways) : The (main rotor) collective pitch control. (Affects the main rotor’s thrust) : The tail rotor collective pitch control. (Affects the tail rotor’s thrust)
- 12-dimensional state:
-
Method: the PEGASUS algorithm
-
Given system state at a point
, move through a particular trajectory by slowly vary .
Continuous reinforcement learning to adapt multi-objective optimization online for robot motion
Kai Zhang et al., IJARS 2020
-
Progess 60%, link
-
Problem to solve:
- How to determine values for coefficients of a compound optimization function automatically
- How to make the coefficients self-adapt to environmental changes.
-
RL要素:
-
Element Note -
如何指定?何时给予 ?是每一步时给出反馈还是在结束时才给反馈?(如何可以以不同方式进行反馈,是否等价?是否影响训练耗时?参考reward shaping) -
终止条件是什么?
The agent can only receive a positive reward when it succeeds in arriving at the goal region within the time limit
-
-
State/observation: motion trajectories (no CNN is needed to process raw sensor data)
-
The agent does not learn specific coefficients but learns a mapping from motion trajectories to coefficient changes. This enables the robot to adapt to different numbers of obstacles moving in unknown ways during the navigation. (?) Learning specific coefficients directly gives an upper and lower limit to the coefficient. Mapping the changes does not have such a problem.
-
为什么学得的action是
形式的: 希望RL agent学得的是一个与时间有关的函数 ,体现出coefficient是如何变化的
Visual reinforcement learning with imagined goals
Ashvin Nair et al., NeurIPS2018
-
Progess 80%, link, Github: env, implementation Environment setup instruction:
To see the original implementation, checkout version
v0.1.2
of this repo. -
术语:
-VAE: automated discovery of interpretable factorised latent representations from raw image data in a completely unsupervised manner. ICLR2017 paper an adjustable hyperparameter
that balances latent channel capacity and independence constraints with reconstruction accuracy - HER: hindsight experience replay
- TD3: twin delayed deep deterministic policy gradients
-
主要工作:
- 实现了一个无需用户干预自行训练,在训练后可以完成general purpose任务的agent。无须手动指定reward function。
- 无监督训练:Agent自行选定可行的goal
(随机选取任意可实现的状态作为goal,理解为在整个任务空间中进行采样),自行训练RL agent如何从任意state实现所选定的goal。 所有goal和state均通过应用 -VAE进行抽象,提取latent representation作为任务中state的表征以提高通用性。 的reward通过比较执行action后后续状态 与 的latent representation的距离得出。从而得以计算 -value - 训练后应用:用户在使用时人为指定goal,agent将自行训练的结果应用到实现用户指定goal的任务上。 用户指定的goal通过latent distribution与采样的goal(抽象后的representation)建立联系,从而通过处理已有的representation拟合数据(通过HER将离散数据连续化)。进而从RL agent训练得出的policy中选取适用于当前任务的policy。
- 无监督训练:Agent自行选定可行的goal
- 实现了一个无需用户干预自行训练,在训练后可以完成general purpose任务的agent。无须手动指定reward function。
-
各个模组如何分工解决问题:
- raw input -> abstract representation:
-VAE - tabular domains -> continuous domains: 先在任务空间中采样,再通过HER转化到连续空间
- RL agent:任何value-based方法均可,文中采用TD3
- raw input -> abstract representation:
-
解决的问题:如何实现unsupervised representation learning?如何实现relabeling?
-
RL要素:
-
Element Note and goal Observed into a latent embedding by -VAE encoder
latent state
latent goaland Modelless with Q-function distance in the latent space
E.g.: Euclidean distance in latent space, from current state to the goalTraining method TD3, any value-based RL algorithm could be used -
如何指定?何时给予 ?是每一步时给出反馈还是在结束时才给反馈?(如何可以以不同方式进行反馈,是否等价?是否影响训练耗时?参考reward shaping) -
终止条件是什么?
-
-
如何实现
- 如何选取goal(4.4 automated goal-generation for exploration):
自行在状态空间中采样作为goal
a self-supervised “practice” phase during training, where the algorithm proposes its own goals, and then practices how to reach them.
- relabeling
可以不需要实际raw input,直接从VAE的输出空间中生成latent goals。在实际raw input给出一个
后,与不同的abstract goal比较得出对应reward,从而增加样本数量。先进行离散的采样,tabular domain到continuous domain通过HER实现。 - 算法流程:
- Collect data with a simple exploration policy, though any exploration strategy could be used for this stage
- Train a VAE latent variable model on state observations and finetune it over the course of training
- Use this latent variable model for multiple purposes:
- Sample a latent goal
from the model and condition the policy on this goal. - Embed all states and goals using the model’s encoder
- When training goal-conditioned value function, resample goals from the prior and compute rewards in the latent space
- Sample a latent goal
- 如何选取goal(4.4 automated goal-generation for exploration):
自行在状态空间中采样作为goal
-
技术细节:
- pyTorch + MuJoCo(仿真环境)
Contextual imagined goals for self-supervised robotic learning
Ashvin Nair et al., CoRL2019
-
主要工作:
how to autonomously set goals that are feasible but diverse Propose a conditional goal-setting model that aims to propose goals that are feasible from the robot’s current state.
-
相比Visual reinforcement learning with imagined goals的改进:
-
-VAE -> Context-conditioned VAE VAE version Note Variational Auto-Encoder A probabilistic generative model that has been shown to learn structured representations of high-dimensional data
Encoder, from state to latent representation , normal random distribution
Decoder, -> Conditional VAE Generate samples based on structured input. conditions the output on some input variable and smples from , Context-conditioned VAE use the initial state in a rollout as the input
-
Unsupervised visuomotor control through distributional planning networks
Tianhe Yu et al., RSS2019
- Progress %, link, Github
- 术语:
- DPN: distributional planning networks
- UPN: universal planning networks
- 主要工作:
aim to learn an unsupervised embedding space under which the robot can measure progress towards a goal for itself. build upon universal planning networks (UPN), which learn abstract representations for visuomotor control tasks using expert demonstration trajectories.
-
Representation learned from UPN, as a metric for reward function.
to learn such representation, the UPN is construced as a model-based planner that performs gradient-based planning in a latent space using a learned forward dynamics model
-
- 技术实现:
given initial and goal observation
and , the model uses an encoder to encode the images into latent embeddings: where is a convolutional neural network. - UPN的训练:
- 首先给出initial和goal observation,通过CNN进行encoding
- 初始化一个随机的plan,通过gradient descent的方法(GDP,gradient descent planner)通过迭代得出predicted plan
- UPN的训练: