Integrate RL and MPC
Information Theoretic MPC for Model-based Reinforcement Learning
Grady Williams et al., ICRA 2017
- progress: 20%
- Youtube video
- Main idea: using a trained full-connected, multi-layer nerual network to be the function approximation (estimated model) for the MPC. The MPC use this NN to predict n steps and the the 1st step as the control input.
- System dynamics does not depend on controller/cost function. For any input
(or in MDP), generate by system dynamics. So dataset collected from any controller can be used. - Two layer structure
- A full-connected, multi-layer neural network is trained by given data (state-action-acceleration,
). The trained model is an approximative system model to generate data by bootstrapping. - Use MPC. Get current state, use the trained model to generate some data, use KL divergence (and something else) to find the possible trajectory estimation closest to optimal trajectory. Execute the first predicted action.
- A full-connected, multi-layer neural network is trained by given data (state-action-acceleration,
- Someone’s implementation(Ferreri Fabio)(LemonPi)
Learning-based model predictive control for safe exploration and reinforcement learning
Torsten Koller, arXiv: 1906.12189, 2019
Related work
- Switch between safe mode and standard mode (hybrid MPC?). Model is not being improved.
- Learning-based MPC
- Improve the controller overt time (updating the model). Drawback: extensive offline computations
- Instead of updating a model, samples are used to grow a safe set in a robust MPC setting over time.
- Model-based RL e.g. Gaussian processes
Practical reinforcement learning of stabilizing economic MPC
Mario Zanon et al., ECC 2019
- link
- Use RL to tune the MPC formulation (using MPC as a function approximator in RL)
Data-driven economic NMPC using reinforcement learning
Sebastien Gros et al., TAC2020
- progress: 70%, link
- Main idea: using MPC’s cost function (such as LQR) to replace reward in RL, following policy selected by optimizing MPC cost function. A rough model (with parameters) is provided at the beginning. During the execution, the model is refined (adjusting parameters) by RL TD actor-critic. The purpose is to optimize performance (defined by a suitably chosen cost) rather than penalizing deviations from a given reference.
Key component of RL and MPC:
RL component | MPC component |
---|---|
Given Take the | |
Implied by value function and Q function. Option1: Learn Option2: learn from data: | ENMPC stage cost |
Next state is determined by the true systme. Prediction can be made by a rough system model | |
with constraints. See Eq. 21 | Performance index: share the same equation with the MDP’s value function? |
Q-learning for ENMPC
-
Q-learning needs(Sutton’s book):
- Maintain a Q table starting with a random one
- Know the reward.
- Option 1: By input the
, to the system dynamics, we can predict transition to . Use the cost function to estimate stage cost as reward. (p2~3) - Option 2: Learn
from data (p2 and Section IV)
- Option 1: By input the
- A policy to select the next action (given by MPC optimization)
-
Parameter updates
- On/off-policy here is wether use the policy to update system dynamics’ parameters, not the Q value (Q-learning does this off-policy)
- On-policy:
where (Only consider constraint because is selected among possible control input during optimization) Changes in the NMPC parameters are readily applied in the closed-loop after update the equation. - Off-policy:
batch update is selected according to the fixed NMPC policy , while the learning is performed within a alternative NMPC scheme base on the parameters , and not applied to the real system. Two set of parameters, the “real system” is used to apply the RL iterations, the “alternative” is used to refine the parameter, it can be plug into/replace the “real system” after it converged.
-
Compare to [[1.theory/control/rl_overview#td-lambda|TD(
)]](Sutton’s book) : approximate value of state , with the vector of feature weights. (let ) -
NO guarantee to find the global optimum of the parameters
-
Pseudocode
initializing Q(start with any
-table) loop n episodes: Optimizing MPC scheme (Eq. 21) to select an action:a = out of reset the environment while not done: s’, r = env.step(a) # take this step Use Eq. 21 to optimizing a’ = # select next action by using MPC = Q[s][a] += (If you do it on-policy, update parameters here: ) s=s’, a=a’ return Q (If you do it off-policy, update parameters here then start new episodes)
Ref: value function: (@p5, eq21)
variable/function | Note |
---|---|
parameterized model | |
state | |
control input | |
parameter can be adjusted via RL tools | |
extra cost,classic ENMPC | |
calssic ENMPC | |
terminal cost e.g., a quadratic cost from LQR (f: final?) | |
Weight, manually defined. | |
parameterized stage cost | |
slack variable, control the relaxation of constraint ( Section IV-C Constraints yield infinite value -> turn into finite so that value function could work | |
input constraint | |
parameterized constraints |
Reinforcement Learning for MDPs with Constraints
Peter Geibel, ECML 2006, link