Reinforcement learning methods/theory

tag:#rl#mdp

Reinforcement Learning for MDPs with Constraints

Peter Geibel, ECML 2006

  • Terms:
    • : a finite state set
    • : a finite action set
    • : state transition probabilities
    • : reward obtained
    • : value of a policy
    • : an additional second reward function
    • : constrained value function
    • CMDP: constrained MDP
  • 主要工作:
    • consider MDP with two criteria/two kinds of constraints
      1. CMDP (constrained Markov Decision Process): expected value of an infinite horizon cumulative return Have a constraint on the second criterion function itself, i.e. on the expected value of the return
      2. CPMDP (constrained probability of constraint violation): 是否服从inequality constraint, or there is maximum allowable probability that the single returns violate the constraints constrain the probability that the return, considered a random variable, violates a constraint. MDPs with constrained probability of constraint violation
    • 3种解决这种控制问题的reinforcement learning方法
      • LinMDP: Linear programming CMDP:将unconstrained MDP通过额外的constraint expressing 增加约束(即要求按照某一policy时最终结果大于某一值) CPMDP的解决方案:将CPMDP转化为CMDP(通过映射)。缺陷:LinMDP might be suboptimal for solving CPMDP
        • 解决方案:vary the
      • WeiMDP: A weighted approach
        • Expressed the probability of entering an undesirable state as an (undiscounted) second value function
        • 引入权重, weighted reward function
        • 可使用Q-learning求解
      • AugMDP: State space extension
        • 当进入某个状态后accumlated costs低于时,施加一个an additional negative reward 。这个high absolute values of 将防止进入这个状态。
        • 局限性:要求问题的maximum cost有上限
      • RecMDP: Recursive reformulation of the constraint
        • Develop a new value function

Rainbow: Combining Improvements in Deep Reinforcement Learning

Matteo Hessel et al., AAAI 2018

  • DQN collection

    ExtensionsProblem to solveMethod
    Double DQN (DDQN)overestimation bias, due to the maximization step in decoupling, select the action from its evaluation:
    A3Clearning form multi-step bootstrap targets
    Distributional Q-learning
    Noisy DQN
    Prioritized DDQN
    Dueling DDQN
  • Combine all 6 extensions. What is the contribution of each components

Playing Atari with deep reinforcement learning

Volodymyr Mnih et al., NIPS Deep Learning Workshop 2013