Reinforcement learning methods/theory
Reinforcement Learning for MDPs with Constraints
Peter Geibel, ECML 2006
- Terms:
: a finite state set : a finite action set : state transition probabilities : reward obtained : value of a policy : an additional second reward function : constrained value function - CMDP: constrained MDP
- 主要工作:
- consider MDP with two criteria/two kinds of constraints
- CMDP (constrained Markov Decision Process): expected value of an infinite horizon cumulative return Have a constraint on the second criterion function itself, i.e. on the expected value of the return
- CPMDP (constrained probability of constraint violation): 是否服从inequality constraint, or there is maximum allowable probability that the single returns violate the constraints constrain the probability that the return, considered a random variable, violates a constraint. MDPs with constrained probability of constraint violation
- 3种解决这种控制问题的reinforcement learning方法
- LinMDP: Linear programming
CMDP:将unconstrained MDP通过额外的constraint expressing
增加约束(即要求按照某一policy时最终结果大于某一值) CPMDP的解决方案:将CPMDP转化为CMDP(通过映射 )。缺陷:LinMDP might be suboptimal for solving CPMDP - 解决方案:vary the
- 解决方案:vary the
- WeiMDP: A weighted approach
- Expressed the probability of entering an undesirable state as an (undiscounted) second value function
- 引入权重
, weighted reward function - 可使用Q-learning求解
- AugMDP: State space extension
- 当进入某个状态后accumlated costs低于
时,施加一个an additional negative reward 。这个high absolute values of 将防止进入这个状态。 - 局限性:要求问题的maximum cost有上限
- 当进入某个状态后accumlated costs低于
- RecMDP: Recursive reformulation of the constraint
- Develop a new value function
- LinMDP: Linear programming
CMDP:将unconstrained MDP通过额外的constraint expressing
- consider MDP with two criteria/two kinds of constraints
Rainbow: Combining Improvements in Deep Reinforcement Learning
Matteo Hessel et al., AAAI 2018
-
DQN collection
Extensions Problem to solve Method Double DQN (DDQN) overestimation bias, due to the maximization step in decoupling, select the action from its evaluation: A3C learning form multi-step bootstrap targets Distributional Q-learning Noisy DQN Prioritized DDQN Dueling DDQN -
Combine all 6 extensions. What is the contribution of each components