Paper reference
Paper1: Towards Resolving Unidentifiability in Inverse Reinforcement Learning HERE
Paper2: Identifiability in inverse reinforcement learning HERE
Paper3: Identifiability and generalizability from multiple experts in Inverse Reinforcement Learning HERE
This papers are quite theoretical and not so easy to read. But they, at least for me, reveals something to do with generalization.
Preliminaries: IRL & Identifiability
IRL, as a subset of Imitation Learning, aims to recover the reward function of certain MDP, given the reward-free environment $E$ and an optimal agent policy $\pi$. The goal is to deduce a reward function $R$ such that policy $\pi$ is optimal in $MDP\ (E,R)$.
However, there usually exists infinitely many reward functions that meet our requirements. In other words, it is impossible to find the exact, original reward function. This is called unidentifiable issue. We do not just learn a reward that makes agent to imitate expert in current environment. We want to predict expert’s action even after environment changed.
Progress made
Paper1 made several progresses in resolving the issue.
Paper1 first separate the unidentifiability into 3 classes: 1) Trivial reward function, which assigns all state-action with the same reward 2)Any reward function is behaviorally invariant under certain arithmetic operation, say rescaling 3) the behavior cannot be sufficient to distinguish 2 reward functions. The authors named 1) & 2) as representational unidentifiability and 3) as experimental unidentifiability.
Paper1 second pointed out that experimental unidentifiability is avoidable. And Paper1, finally set up a “richer model” for IRL where learners can observe the agent behaving optimally in a number of environments of the learner’s choice.
Later in NIPS2021, Paper2 made a further progress. It obtained several conclusions: 1)the reward can be fully determined given the optimal policy and the value function, but the optimal policy gives us no direct information about the value function 2)given knowledge of the optimal policy under two different discount rates, or sufficiently different transition laws, we can uniquely identify the rewards (up to a constant shift).
Entropy-regularized MDP, as a common variation of classic MDP, is closely tied to IRL settings. As classic MDP aiming to maximize sum of discounted reward, entropy-regularized version maximizes a entropy regularized sum:
$$V^{\pi}_{\lambda}(s):=E^{\pi}_s[\sum^{\inf}_{t=0}\gamma^t (r(s_t, a_t)+\lambda\mathcal{H}(\pi(\cdot|s_t)))]$$
where $\mathcal{H}(\pi(\cdot|s_t))=-\sum_{a\in A}\pi(a)log(\pi(a))$ is the entropy of $\pi$. Through a series of derivations, Paper2 rewrite the optimal policy as:
$$\pi^{optimal}_{\lambda}=exp(( Q^{\pi^{optimal}}_{\lambda}(s,a) - V^{optimal}_{\lambda}(s))/\lambda)$$
$$\pi^{optimal}_{\lambda}=exp((f(s,a)+E_{s1\sim T(\cdot|s,a)}[\gamma V^{optimal}_{\lambda}(s_1)-V^{optimal}_{\lambda}(s)]))$$
Therefore Paper2 made following observations:
- The optimal policy will select all actions in A with some positive probabilities
- If $\lambda$ is increased, this has the effect of ‘flattening out’ the choice of actions
- 5). Conversely, sending $\lambda \rightarrow 0$ will result in a true maximizer being chosen, and the regularized problem degenerates to the classical