A collection of reviews on MPC vs. RL

1. Relations between model predictive control and reinforcement learning

Daniel Gorges, IFAC 2017

LQR
1. based on Q-learning, convergence proved.
2. LQR based on Q-learning vs. LQR based system identification.
3. LQR based on actor-critic structures.
Adding constraint:
1. Constraint handling has been addressed indirectly by introducing penalties for constraint violations in the cost function (Ernst et al., 2009; Riedmiller, 2012) or indireclty (He and Jagannathan, 2007; Zhang et al., 2009. ?)
2. Direct constraint handling has only been investigaed for input constraints (ref?).
Feasibility of MPC has not been studied for reinforcement elarning
Synergies between MPC and RL for discrete-time linear time-invariant system with state and input constraints and a quadratic cost function exploiting knowledge. (Sutton et al., 1992) Table 1: a comparison

property	MPC	RL
Model	required	not required
Convexity	required (usually)	not required
adaptivity	immature (usually based on robustness)	mature(inherent)
Online complexity	high (except explicit and neural MPC)	low
offline complexity	low (except explicit andneural MPC)	high
stability theory	mature (e.g. based on terminal cost)	immature
feasibility theory	mature (e.g. based on terminal constraints)	immature
robustness theory	mature (e.g. based on tubes or ISS)	immature
Constraint handling	mature (inherent)	immature (except input constraints)

Consider stage cost and terminal cost
terminal weight matrix has the solution of the algebraic Riccati equation
Solution for Finite-horizon optimal control problem (FHOCP): solve by quadratic programming or multi-parametric-QP (mc-QP)
Solution for Infinite-horizon optimal control problem (IHOCP)
Stability and Feasibility: guaranteed inherently (IHOCP) or by imposing terminal constraint (FHOCP)

Tri Tran et al., ICOCTA 2019

properties: similar to Table 1. Additional: both MPC and RL has recursive feasibility.
Gros’ paper is inspired by Go paper; Gorges’ paper

Gaussian-Process (GP) learning-based MPC, use offline trained GP model Use GP regression to estimate the mean and co-variance as the uncertainty sets for a robust MPC
Using stage cost of MPC value funtion to approximate the Q-function (Gros’ paper)

Algorithm	Policy	Return estimation	Update constraints	Data distribution
NFQ	Discrete 3, deterministic-step Q	1-step Q	Bootstrap with old	Off-policy fixed apriori
(D)DQN	Discrete, deterministic	1-step Q	Bootstrap with old	Off-policy experience replay
DDPG	Continuous, deterministic	1-step Q	Bootstrap with old , w	Off-policy experience replay
TRPO	Discrete/continuous stochastic	-step Q	Policy constraint	On-policy
PPO	Discrete/continuous stochastic	n-step advantage (GAE)	Clipped objective	On-policy
A3C	Discrete/continuous stochastic	n-step advantage (GAE)	-	On-policy
ACER	Discrete/continuous stochastic	n-step advantage (GAE)	Average policy network	On-policy + Off-policy

Abbreviation	Full name
NFQ	Neural Fitted Q iteration
DDQN	Double DeepQ-network
DDPG	Deep Deterministic Policy Gradient
GAE	Generalized Advantage Estimation
TRPO	Trust Region Policy Optimisation
PPO	Proximal Policy Optimisation
A3C	Asynchronous Advantage Actor Critic
ACER	Actor Critic with Exprerience Replay

MPC is used to deliver an (almost) optimal policy. RL is then considered as an effective tuning tool for MPC
Using MPC instead of DNN