From DAgger, to HG-DAgger and more recent advances
DAgger
Dataset Aggregation (DAgger) is a imitation learning algorithm proposed in AISTAT11 paper A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning by Stéphane Ross, Geoffrey J. Gordon and J. Andrew Bagnell. It is a simple yet effective algorithm that has been widely used in imitation learning, and as you can tell from the title, it’s not related to human-in-the-loop RL.
How does DAgger work?
DAgger is easy, and efficient. So it might be helpful to first understand the whole learning process.
First we have 2 policies, which we call novice policy $\pi_n$ and epert policy $\pi^\star$. Also we have a dataset $\mathcal{D}$ that some state-action pairs.
At very begining of DAgger (Round-1), we initialize an empty $\mathcal{D}$ and our novice policy, which now is called $\pi_{n,1}$, as we are doing the 1st round. Then we collect some data by running $\pi_{n,1}$ in the environment with a mixed policy $\pi_1$ that is a convex combination of $\pi_{n,1}$ and $\pi^\star$:
$$ \pi_1(a|s)=\beta\pi_{n,1}(a|s)+(1-\beta)\pi^\star(a|s) $$
where $\beta$ is a hyperparameter that controls the mixture ratio. The ratio is decaying as the round goes on, e.g. $\beta_i=p^{i-1}$. Anothor option $\beta_i=\mathbb{I}(i==1)$ also works well.
After sampling $T$ steps, we generate a dataset $\mathcal{D_1}$ and label it with expert policy $\mathcal{D_i}={(s,\pi^\star(s))}$. The newly-labelled dataset is aggregated in to base dataset by $\mathcal{D}=\mathcal{D}\cup\mathcal{D_1}$. Finally, we train a new novice policy $\pi_{n,2}$ on the aggregated dataset $\mathcal{D}$ by Behavioral Clone or other methods. We keep doing new rounds like this until the novice policy is good enough.
Why is DAgger good?
Imitation learning, fundamentally, involves the task of sequential decision making, where the data collected is generated by following a sampled policy. Nevertheless, during the era of DAgger, most imitation learning algorithms primarily employed supervised learning, a method that lacks the sequential nature required for handling distribution shifts effectively. Consequently, this approach resulted in the accumulation of classification errors at each step, ultimately leading to a substantial compounded error. As described in the paper, the mistakes is linear in task horizon $T$ and the classification loss (cost) $\epsilon$.
For a naive supervised learning approach,
$$ \hat\pi = \argmin_{\pi \in \Pi}\mathbb{E}_ {s\sim d_{\pi^\star}}[l(s,\pi)] $$
ignores the sampled data’s shifting distribution. And the Theorem 2.1 in the paper by Ross and Bagnell provides a tight bound for the loss:
$$ \text{Let } \mathbb{E}_ {s\sim d_{\pi^\star}}[l(s,\pi)]=\epsilon, \text{then } J(\pi)\leq J(\pi^\star)+T^2\epsilon $$
DAgger, unlike these supervised methods, utilizes online-learning to provide non-regret learning.
A no-regret algorithm is an algorithm that produces a sequence of policies $\pi_1,\pi_2,…,\pi_T$ such that the average regret with respect to the best policy in hindsight goes to 0 as N goes to $\infty$.
Notice that the analysis of DAgger isn’t quite in the scope of this blog, but you are encouraged to read the original paper linked here. It’s quite well-written and easy to follow.
HG-DAgger
HG-DAgger is a human-in-the-loop imitation learning algorithm proposed in HG-DAgger: Interactive Imitation Learning with Human Experts by Michael Kelly, Chelsea Sidrane, Katherine Driggs-Campbell, and Mykel J. Kochenderfer. HG-DAgger is important to IIL methods as many other IIL algorithms resemble it. It’s particularly good in robotics area and easy to implement.
Human Gated
Human-gated control is a simple idea: when the agent is conducting, or about to conduct dangerous actions, human will intervene and take control. The agent will then learn from the human’s action. Such idea will be employed in the following sections.
How does HG-DAgger work?
HG-DAgger is a simple algorithm that is easy to implement. It’s a combination of DAgger and human-gated control. Similar to DAgger, it assumes 2 policies, novice policy $\pi_n$ and expert policy $\pi^\star$.
At the very begining, an base dataset is initialized called $\mathcal{D}$, and a novice policy $\pi_{n,1}$ is trained on it. Then we run the novice policy in the environment with human gating. That is, during the rollout, if human finds any undesired behavior, he/she will intervene and take control. The newly collected data is aggregated into the base dataset $\mathcal{D}$, and a new novice policy $\pi_{n,2}$ is trained on the bigger dataset. The process is repeated until the novice policy is good enough.
IWR & SIRIUS: HG-DAgger Improved with Reweighting
There are 2 papers with similar idea that improve the HG-DAgger with data-reweighting. The first one is IWR proposed in Human-in-the-Loop Imitation Learning using Remote Teleoperation by Ajay Mandlekar et al. The second one is SIRIUS proposed in Robot Learning on the Job: Human-in-the-Loop Autonomy and Learning During Deployment by Huihan Liu et al.
The key idea shared between these works is that, the data has different value during the rollout. The state-action pairs generated by policy itself is on-distribution thus regularize the policy (so it won’t crazily drifted away when folllowing human), while human intervention is a more precious resource that should be utilized.
In IWR, the intuition is, human often takes over control when near bottleneck situation. So, IWR balances the sample from human and agent by equal weights. So during training, 50% data is generated by the policy itself, and thus serves as a regularization term, while the other 50% is generated by human, and agent learn more about hard part in the task.
In SIRIUS, the reweighting is more carefully designed, and bined into human demonstration, pre-intervention, intervention and robot action. Based on their value, for example, pre-intervention is bad data since the robot is about to fail, the data is reweighted, assigning higher weights to good data and lower weights to bad data.
HACO
HACO was first proposed in ICLR22 paper Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization by Quanyi Li, Zhenghao Peng and Bolei Zhou. It follows a similar idea as HG-Dagger family that human can actively decide wether to intervene or not. It uses a Actor-Critic framework that distinguishes itself from HG-DAgger family and acheives better performance.
Overview
So how does the HACO work?
Similarly, we denote agent’s novice policy as $\pi_n(a_n|s)$ and the human intervention indicator as a boolean $I(s,a_n)$. The safe action is applied to the environment as $\hat{a}=I(s,a_n)a_n+(1-I(s,a_n))a_h$ where $a_h$ is the human action. The overall policy is
$$ \pi(a|s)=(1-I(s,a_n))\pi_n(a|s)+\pi_h(a|s)G(s) $$
where $G(s)$ is the expected value of the human intervention indicator $I(s,a_n)$, i.e.the probability of the agent choosing an action that will be rejected by the human.
HACO paper claims to have 3 objectives:
- Proxy Value: The agent should maximize a proxy value that relects human’s intention.
- Exhausiveness: The agent should explore the state space as much as possible, under human permission.
- Automation: The agent should learn to automate the task as much as possible, reducing human burden.
The 3 objectives introduce different optimization targets that finally combined as:
$$ \max_{\pi}\mathbb{E}[Q(s,a) + \mathcal{H}(\pi) - Q^I(s,a)] $$
We will detail the 3 objectives in the following sections.
Proxy Value
HACO is under a reward-free setting, which makes it more a imitation learning algorithm. The proxy value is learnt by conservative Q-Learning, and optimizing the objective:
$$ \min_{\phi}\mathbb{E}_{(s_t,a_n,a_h,I(s,a_n))\sim \mathcal{B}}[I(s,a_n)(Q(s,a_n;\phi)-Q(s,a_h;\phi))] $$
The objective, in plain language, is forcing the Q-function to assign higher value to human action than novice action, when human intervention is triggered.
Exhausiveness (Entropy Regularization)
Indeed, the agent is rambling/learning in a human-permitted subspace. Hoever, as we see in the Proxy Value section, the high-value state might not be encountered during autonomous explorable sampling. In other words, agent is unlikely to visit those states evoking high-value state-action pairs, without being encouraged. The HACO paper claims that this hinders the propagation of learnt reward.
To tackle this, HACO introduces an entropy regularization term to encourage the agent to explore more states:
$$ \min_{\phi}\mathbb{E}[y-Q(s_t,\hat{a}_t;\phi)]\newline $$
$$ y=\gamma \mathbb{E}[Q(s_{t+1},a’;\phi’)-\alpha\log\pi_n(a’|s_{t+1})] $$ where $s_t,\hat{a_t}, s_{t+1}$ sampled from $\mathcal{B}$ and $a’\sim\pi_n(\cdot|s_{t+1})$. Notice that $\phi’$ is the target network parameter of $\phi$.
Automation (Reduce Human Intervention)
In HACO’s framwork, human always helps the agent when it’s getting into troubles. In this manner, the agent can possibly abuse the intervention, as deleberately getting into troubles to get human’s help. To prevent this, HACO introduces a term to penalize the agent for triggering human intervention.
If the step induces intervention, let’s say step $t$ where $I(s_{t-1},a_{n,t-1})=0$, and $I(s_t,a_{n,t})=1$, we set add a cost: $$ C(s,a_n)=1-\frac{a_n^Ta_h}{||a_n||||a_h||} $$
The cost is the cosine distance between the agent’s action and human’s action. The intuition is that if the intervened agent acts almost the same as human, it won’t get much punished.
The intervention penalty is separately learnt by an auxilary Q-funtion, namely $Q^I(s,a)$. The objective is:
$$ Q^I(s_t,a_{n,t})=C(s_t,a_{n,t})+ \gamma \mathbb{E}[Q^I(s_{t+1},a_{t+1})] $$
$$ s_{t+1}\sim \mathcal{B}, a_{t+1}\sim\pi_n(\cdot|s_{t+1}) $$
which is every similar to a SARSA update.
Policy Learning
The policy is learnt by a conventional actor-critic framework, with HACO’s carefully designed Q-function: $$ \max_{\theta}\mathbb{E}_{s_t\sim\mathcal{B}}[Q(s_t,a_n)-\alpha\log\pi_n(a_n|s_t;\theta)-Q^I(s_t,a_n)] $$
Notice that $Q^I$ is negative as it represents a cost rather than reward.
PVP
PVP, proposed in NIPS23 paper Reward-free Policy Learning through Active Human Involvement by Zhenghao Peng, Wenjie Mo, Chenda Duan, Quanyi Li, Bolei Zhou, is the latest advance in IIL family, with a similar published time as SIRIUS (RSS23). It’s a simplified version of HACO, with a Q-function directly learning from intervention data. PVP involves 2 key components: Proxy Value and Balance Sampling. And the overall structure for PVP is a DQN for discrete action space and a TD3 for continuous action space.
Proxy Value
HACO, as we have seen, uses a conservative Q-function to learn the proxy value. PVP, on the other hand, uses a more direct approach. It uses a Q-function to learn the value of human intervetion, assiging +1 to human action and -1 to agent action during the intervention. The objective for Q-function on intevrention data is: $$ \mathcal{L}_{PVP} = MSE(Q(s,a_h),1) + MSE(Q(s,a_n),-1) $$
And during non-intevrention phase, the Q-function is simply doing TD-learning with 0 reward: $$ \mathcal{L}_{TD} = MSE(Q(s,a_n),r+\gamma Q(s’,\pi(s’))),\newline r=0 $$
Notice that, this Q-function does not satisfy the Bellman equation, and thus is not a conventional Q-function in RL context. Instead, it is more a function about good or bad. Since the $\mathcal{L}_{PVP}$ can be decomposed as: $$ \begin{aligned} &MSE(Q(s,a_h),1) + MSE(Q(s,a_n),-1)\newline &= (Q(s,a_h)-1)^2 + (Q(s,a_n)-(-1))^2 \newline &= Q(s,a_h)^2 + Q(s,a_h)^2 -2 (Q(s,a_h) - Q(s,a_n)) +2 \end{aligned} $$ where the middle term is $Q(s,a_h) - Q(s,a_n)$ forcing Q to assign higher value to human actions than novicxe actions.
Balanced Sampling
During the PVP training, we maintain 2 buffers: novice buffer and human buffer. The novcie buffer contains data that are automatically generated by the policy while the human buffer contains intevention data. We sample half-half from them, similar to the idea of IWR mentioned above.
Perspective
I’ve impelemted few of these algorithm, and they are indeed efficient. They also serves as a nice safety guarantee: human gating keeps the agent in a reasonable region. The idea can be further unified with RLHF, language feedback and more. However, IIL, especially the active-involevement methods, assumes human have the ability to control over the robot, and human is quick enough to foresee the danger and intervene. This is obviously a strong assumption, if working outside autonomous driving and easy-to-control robots.