Integrate RL and MPC

Information Theoretic MPC for Model-based Reinforcement Learning

Grady Williams et al., ICRA 2017

  • progress: 20%
  • Youtube video
  • Main idea: using a trained full-connected, multi-layer nerual network to be the function approximation (estimated model) for the MPC. The MPC use this NN to predict n steps and the the 1st step as the control input.
  • System dynamics does not depend on controller/cost function. For any input (or in MDP), generate by system dynamics. So dataset collected from any controller can be used.
  • Two layer structure
    1. A full-connected, multi-layer neural network is trained by given data (state-action-acceleration, ). The trained model is an approximative system model to generate data by bootstrapping.
    2. Use MPC. Get current state, use the trained model to generate some data, use KL divergence (and something else) to find the possible trajectory estimation closest to optimal trajectory. Execute the first predicted action.
  • Someone’s implementation(Ferreri Fabio)(LemonPi)

Learning-based model predictive control for safe exploration and reinforcement learning

Torsten Koller, arXiv: 1906.12189, 2019

  • Switch between safe mode and standard mode (hybrid MPC?). Model is not being improved.
  • Learning-based MPC
    1. Improve the controller overt time (updating the model). Drawback: extensive offline computations
    2. Instead of updating a model, samples are used to grow a safe set in a robust MPC setting over time.
  • Model-based RL e.g. Gaussian processes

Practical reinforcement learning of stabilizing economic MPC

Mario Zanon et al., ECC 2019

  • link
  • Use RL to tune the MPC formulation (using MPC as a function approximator in RL)

Data-driven economic NMPC using reinforcement learning

Sebastien Gros et al., TAC2020

  • progress: 70%, link
  • Main idea: using MPC’s cost function (such as LQR) to replace reward in RL, following policy selected by optimizing MPC cost function. A rough model (with parameters) is provided at the beginning. During the execution, the model is refined (adjusting parameters) by RL TD actor-critic. The purpose is to optimize performance (defined by a suitably chosen cost) rather than penalizing deviations from a given reference.

Key component of RL and MPC:

RL componentMPC component
: and
:
Given , optimizing to get a serial of input sequence .
Take the as the policy for state
:
Implied by value function and Q function.
Option1: Learn from data (hard)
Option2: learn from data:
ENMPC stage cost

are constraints.
/model:
, 1 at , 0 anywhere else. Dirac distribution.
Next state is determined by the true systme. Prediction can be made by a rough system model
is the parameters of the system model to be adjust.
function:
𝕗
with constraints. See Eq. 21
Performance index: share the same equation with the MDP’s value function?
: extra cost
𝕗: terminal cost built by classic heuristic(e.g., LQR)
: stage cost
𝕗: weighted constraints for state
: weighted constraints for state-action pair
-value:
and relationship:

Q-learning for ENMPC

  • Q-learning needs(Sutton’s book):

    • Maintain a Q table starting with a random one
    • Know the reward.
      • Option 1: By input the , to the system dynamics, we can predict transition to . Use the cost function to estimate stage cost as reward. (p2~3)
      • Option 2: Learn from data (p2 and Section IV)
    • A policy to select the next action (given by MPC optimization)
  • Parameter updates

    • On/off-policy here is wether use the policy to update system dynamics’ parameters, not the Q value (Q-learning does this off-policy)
    • On-policy: where (Only consider constraint because is selected among possible control input during optimization) Changes in the NMPC parameters are readily applied in the closed-loop after update the equation.
    • Off-policy: batch update is selected according to the fixed NMPC policy , while the learning is performed within a alternative NMPC scheme base on the parameters , and not applied to the real system. Two set of parameters, the “real system” is used to apply the RL iterations, the “alternative” is used to refine the parameter, it can be plug into/replace the “real system” after it converged.
  • Compare to [[1.theory/control/rl_overview#td-lambda|TD()]](Sutton’s book)

    : approximate value of state , with the vector of feature weights. (let )

  • NO guarantee to find the global optimum of the parameters

  • Pseudocode

    initializing Q(start with any -table) loop n episodes:   Optimizing MPC scheme (Eq. 21) to select an action:a = out of   reset the environment   while not done:     s’, r = env.step(a) # take this step     Use Eq. 21 to optimizing     a’ = # select next action by using MPC      =     Q[s][a] +=     (If you do it on-policy, update parameters here: )          s=s’, a=a’ return Q (If you do it off-policy, update parameters here then start new episodes)

Ref: value function: (@p5, eq21)

𝕗𝕗

variable/functionNote
parameterized model
state
control input
parameter can be adjusted via RL tools
extra cost,classic ENMPC ,Section V-A
calssic ENMPC
𝕗terminal cost e.g., a quadratic cost from LQR (f: final?)
Weight, manually defined.
parameterized stage cost
slack variable, control the relaxation of constraint
()
Section IV-C
Constraints yield infinite value -> turn into finite so that value function could work
input constraint
parameterized constraints

Reinforcement Learning for MDPs with Constraints

Peter Geibel, ECML 2006, link

Progress: