1. [50%]An automated measure of MDP similarity for transfer in reinforcement learning
Haitham Bou Ammar et al., AAAI 2014
- Main idea: measure the similarity between MDPs by using Restricted Boltzmann Machines (RBM) and MDP tuples(s,a,s’) as the input.
- To measure two MDPs:
- Sample them uniformly to generate data sets
and . (You don’t need policy here, just take random action to cover as much of the space.) - Use dataset
to train the RBM. - Once the RBM is trained, use dataset
and the parameters of the RBM to get the error/distance. - The error is measurement.
- Sample them uniformly to generate data sets
- Experiments:
- use 3 different environments: Inverted Pendulum, Cart Pole, and Mountain Car. MDP comparisons are within each environment.
- By adjusting parameters (inertia and damping constant for IP, length & damping constant for CP, mass for MC), generate 50 MDPs for each environment, sample 5000
tuples within each MDP. - Compare within each 50-MDP groups, generate the error.
- Result:
- The error can be used for clustering MDPs into different phrases for the environment (oscillatory, damped, critically damped).
- It can be used to evaluate the performance of applying policy trained base on the reference MDP to another MDP (the smaller the error is, the higher average it will get by using the current policy).
- Not used for transferring the trained policy to another environment (such as from IP to CP).