1. Relations between model predictive control and reinforcement learning §
Daniel Gorges, IFAC 2017
LQR
based on Q-learning, convergence proved.
LQR based on Q-learning vs. LQR based system identification.
LQR based on actor-critic structures.
Adding constraint:
Constraint handling has been addressed indirectly by introducing penalties for constraint violations in the cost function (Ernst et al., 2009; Riedmiller, 2012) or indireclty (He and Jagannathan, 2007; Zhang et al., 2009. ?)
Direct constraint handling has only been investigaed for input constraints (ref?).
Feasibility of MPC has not been studied for reinforcement elarning
Synergies between MPC and RL for discrete-time linear time-invariant system with state and input constraints and a quadratic cost function exploiting knowledge. (Sutton et al., 1992)
Table 1: a comparison
Gaussian-Process (GP) learning-based MPC, use offline trained GP model
Use GP regression to estimate the mean and co-variance as the uncertainty sets for a robust MPC
Using stage cost of MPC value funtion to approximate the Q-function (Gros’ paper)
Algorithm
Policy
Return estimation
Update constraints
Data distribution
NFQ
Discrete 3, deterministic-step Q
1-step Q
Bootstrap with old
Off-policy fixed apriori
(D)DQN
Discrete, deterministic
1-step Q
Bootstrap with old
Off-policy experience replay
DDPG
Continuous, deterministic
1-step Q
Bootstrap with old , w
Off-policy experience replay
TRPO
Discrete/continuous stochastic
-step Q
Policy constraint
On-policy
PPO
Discrete/continuous stochastic
n-step advantage (GAE)
Clipped objective
On-policy
A3C
Discrete/continuous stochastic
n-step advantage (GAE)
-
On-policy
ACER
Discrete/continuous stochastic
n-step advantage (GAE)
Average policy network
On-policy + Off-policy
Abbreviation
Full name
NFQ
Neural Fitted Q iteration
DDQN
Double DeepQ-network
DDPG
Deep Deterministic Policy Gradient
GAE
Generalized Advantage Estimation
TRPO
Trust Region Policy Optimisation
PPO
Proximal Policy Optimisation
A3C
Asynchronous Advantage Actor Critic
ACER
Actor Critic with Exprerience Replay
MPC is used to deliver an (almost) optimal policy. RL is then considered as an effective tuning tool for MPC