A collection of reviews on MPC vs. RL

1. Relations between model predictive control and reinforcement learning

Daniel Gorges, IFAC 2017

  • LQR
    1. based on Q-learning, convergence proved.
    2. LQR based on Q-learning vs. LQR based system identification.
    3. LQR based on actor-critic structures.
  • Adding constraint:
    1. Constraint handling has been addressed indirectly by introducing penalties for constraint violations in the cost function (Ernst et al., 2009; Riedmiller, 2012) or indireclty (He and Jagannathan, 2007; Zhang et al., 2009. ?)
    2. Direct constraint handling has only been investigaed for input constraints (ref?).
  • Feasibility of MPC has not been studied for reinforcement elarning
  • Synergies between MPC and RL for discrete-time linear time-invariant system with state and input constraints and a quadratic cost function exploiting knowledge. (Sutton et al., 1992) Table 1: a comparison
propertyMPCRL
Modelrequirednot required
Convexityrequired (usually)not required
adaptivityimmature (usually based on robustness)mature(inherent)
Online complexityhigh (except explicit and neural MPC)low
offline complexitylow (except explicit andneural MPC)high
stability theorymature (e.g. based on terminal cost)immature
feasibility theorymature (e.g. based on terminal constraints)immature
robustness theorymature (e.g. based on tubes or ISS)immature
Constraint handlingmature (inherent)immature (except input constraints)

1.1. MPC

  • Consider stage cost and terminal cost
  • terminal weight matrix has the solution of the algebraic Riccati equation
  • Solution for Finite-horizon optimal control problem (FHOCP): solve by quadratic programming or multi-parametric-QP (mc-QP)
  • Solution for Infinite-horizon optimal control problem (IHOCP)
  • Stability and Feasibility: guaranteed inherently (IHOCP) or by imposing terminal constraint (FHOCP)

1.2. RL

  • Dynamic programming: value iteration, policy iteration, policy search
  • Actor-critic structure with critic neural network and actor neural network

2. Reinforcement learning with model predictive control - recent development

Tri Tran et al., ICOCTA 2019

  1. Gaussian-Process (GP) learning-based MPC, use offline trained GP model Use GP regression to estimate the mean and co-variance as the uncertainty sets for a robust MPC
  2. Using stage cost of MPC value funtion to approximate the Q-function (Gros’ paper)
AlgorithmPolicyReturn estimationUpdate constraintsData distribution
NFQDiscrete 3, deterministic-step Q1-step QBootstrap with old Off-policy fixed apriori
(D)DQNDiscrete, deterministic1-step QBootstrap with old Off-policy experience replay
DDPGContinuous, deterministic1-step QBootstrap with old , wOff-policy experience replay
TRPODiscrete/continuous stochastic-step QPolicy constraintOn-policy
PPODiscrete/continuous stochasticn-step advantage (GAE)Clipped objectiveOn-policy
A3CDiscrete/continuous stochasticn-step advantage (GAE)-On-policy
ACERDiscrete/continuous stochasticn-step advantage (GAE)Average policy networkOn-policy + Off-policy
AbbreviationFull name
NFQNeural Fitted Q iteration
DDQNDouble DeepQ-network
DDPGDeep Deterministic Policy Gradient
GAEGeneralized Advantage Estimation
TRPOTrust Region Policy Optimisation
PPOProximal Policy Optimisation
A3CAsynchronous Advantage Actor Critic
ACERActor Critic with Exprerience Replay
  • MPC is used to deliver an (almost) optimal policy. RL is then considered as an effective tuning tool for MPC
  • Using MPC instead of DNN