1. [50%]An automated measure of MDP similarity for transfer in reinforcement learning

Haitham Bou Ammar et al., AAAI 2014

  • Main idea: measure the similarity between MDPs by using Restricted Boltzmann Machines (RBM) and MDP tuples(s,a,s’) as the input.
  • To measure two MDPs:
    1. Sample them uniformly to generate data sets and . (You don’t need policy here, just take random action to cover as much of the space.)
    2. Use dataset to train the RBM.
    3. Once the RBM is trained, use dataset and the parameters of the RBM to get the error/distance.
    4. The error is measurement.
  • Experiments:
    1. use 3 different environments: Inverted Pendulum, Cart Pole, and Mountain Car. MDP comparisons are within each environment.
    2. By adjusting parameters (inertia and damping constant for IP, length & damping constant for CP, mass for MC), generate 50 MDPs for each environment, sample 5000 tuples within each MDP.
    3. Compare within each 50-MDP groups, generate the error.
  • Result:
    1. The error can be used for clustering MDPs into different phrases for the environment (oscillatory, damped, critically damped).
    2. It can be used to evaluate the performance of applying policy trained base on the reference MDP to another MDP (the smaller the error is, the higher average it will get by using the current policy).
    3. Not used for transferring the trained policy to another environment (such as from IP to CP).