The intervention-based imitation learning (IIL) family

Update Nov 2025: I am surprised by PI integrates IIL method into the $\pi$*-0.6 model, and I firmly believe that human / end-user will be integrated into the post-post-training of robotic foundation models in certain ways.

In this blog, we discuss the imitation learning in an online fashion with human from DAgger, to HG-DAgger and more recent advances

DAgger

Dataset Aggregation (DAgger) is an imitation learning algorithm proposed in the AISTATS11 paper A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning by Stéphane Ross, Geoffrey J. Gordon and J. Andrew Bagnell. It is a simple yet effective algorithm that has been widely used in imitation learning, and as you can tell from the title, it’s not related to human-in-the-loop RL.

How does DAgger work?

DAgger is easy, and efficient. So it might be helpful to first understand the whole learning process.

First we have 2 policies, which we call novice policy $\pi_n$ and expert policy $\pi^\star$. Also we have a dataset $\mathcal{D}$ that contains some state-action pairs.

At the very beginning of DAgger (Round-1), we initialize an empty $\mathcal{D}$ and our novice policy, which now is called $\pi_{n,1}$, as we are doing the 1st round. Then we collect some data by running $\pi_{n,1}$ in the environment with a mixed policy $\pi_1$ that is a convex combination of $\pi^\star$ and $\pi_{n,1}$:

$$ \pi_1(a|s)=\beta\pi^\star(a|s)+(1-\beta)\pi_{n,1}(a|s) $$

where $\beta$ is a hyperparameter that controls the mixture ratio, weighting the expert. The expert’s share decays as the rounds go on, e.g. $\beta_i=p^{i-1}$, so that $\beta_1=1$ means the first round is driven entirely by the expert and control is gradually handed over to the novice. Another option $\beta_i=\mathbb{I}(i==1)$ — expert-only on the first round, novice-only thereafter — also works well.

After sampling $T$ steps, we generate a dataset $\mathcal{D_1}$ and label it with the expert policy $\mathcal{D_i}={(s,\pi^\star(s))}$. The newly-labelled dataset is aggregated into the base dataset by $\mathcal{D}=\mathcal{D}\cup\mathcal{D_1}$. Finally, we train a new novice policy $\pi_{n,2}$ on the aggregated dataset $\mathcal{D}$ by Behavioral Cloning or other methods. We keep doing new rounds like this until the novice policy is good enough.

Why is DAgger good?

Imitation learning, fundamentally, involves the task of sequential decision making, where the data collected is generated by following a sampled policy. Nevertheless, during the era of DAgger, most imitation learning algorithms primarily employed supervised learning, a method that lacks the sequential nature required for handling distribution shifts effectively. Consequently, this approach resulted in the accumulation of classification errors at each step, ultimately leading to a substantial compounded error. As described in the paper, for a naive supervised learner these mistakes compound quadratically in the task horizon $T$ (for a per-step classification loss $\epsilon$); DAgger is what brings this down to linear in $T$.

For a naive supervised learning approach,

$$ \hat\pi = \argmin_{\pi \in \Pi}\mathbb{E}_ {s\sim d_{\pi^\star}}[l(s,\pi)] $$

ignores the sampled data’s shifting distribution: the policy is fit under the expert’s state distribution $d_{\pi^\star}$, yet at deployment it visits its own induced distribution $d_{\hat\pi}$. Theorem 2.1 of the DAgger paper (Ross, Gordon & Bagnell, 2011; the quadratic bound itself goes back to Ross & Bagnell, 2010) provides a tight bound for the loss:

$$ \text{Let } \mathbb{E}_ {s\sim d_{\pi^\star}}[l(s,\pi)]=\epsilon, \text{then } J(\pi)\leq J(\pi^\star)+T^2\epsilon $$

DAgger, unlike these supervised methods, utilizes online-learning to provide no-regret learning.

A no-regret algorithm is an algorithm that produces a sequence of policies $\pi_1,\pi_2,…,\pi_N$ such that the average regret with respect to the best policy in hindsight goes to 0 as $N$ goes to $\infty$.

Notice that the analysis of DAgger isn’t quite in the scope of this blog, but you are encouraged to read the original paper linked here. It’s quite well-written and easy to follow.

HG-DAgger

HG-DAgger is a human-in-the-loop imitation learning algorithm proposed in HG-DAgger: Interactive Imitation Learning with Human Experts by Michael Kelly, Chelsea Sidrane, Katherine Driggs-Campbell, and Mykel J. Kochenderfer. HG-DAgger is important to IIL methods as many other IIL algorithms resemble it. It’s particularly good in robotics area and easy to implement.

Human Gated

Human-gated control is a simple idea: when the agent is conducting, or about to conduct dangerous actions, human will intervene and take control. The agent will then learn from the human’s action. Such idea will be employed in the following sections.

How does HG-DAgger work?

HG-DAgger is a simple algorithm that is easy to implement. It’s a combination of DAgger and human-gated control. Similar to DAgger, it assumes 2 policies, novice policy $\pi_n$ and expert policy $\pi^\star$.

At the very beginning, a base dataset is initialized called $\mathcal{D}$, and a novice policy $\pi_{n,1}$ is trained on it. Then we run the novice policy in the environment with human gating. That is, during the rollout, if human finds any undesired behavior, he/she will intervene and take control. The newly collected data is aggregated into the base dataset $\mathcal{D}$, and a new novice policy $\pi_{n,2}$ is trained on the bigger dataset. The process is repeated until the novice policy is good enough.

Worth noting: beyond the human-gated data collection, HG-DAgger also learns a safety threshold on a model-uncertainty-based risk metric (using an ensemble of networks), which can be used to predict where the trained novice is reliable in the state space. This uncertainty estimate is part of what distinguishes HG-DAgger from “DAgger plus a human switch.”

One theoretical caveat for the whole human-gated family: classic DAgger’s no-regret guarantee relies on on-policy sampling from the $\beta$-mixed policy. Once a human gates control, that sampling assumption breaks, so HG-DAgger and its descendants generally trade away DAgger’s formal guarantee in exchange for safety and higher-quality labels.

IWR & SIRIUS: HG-DAgger Improved with Reweighting

There are 2 papers with similar idea that improve HG-DAgger with data-reweighting. The first one is IWR proposed in Human-in-the-Loop Imitation Learning using Remote Teleoperation by Ajay Mandlekar et al. The second one is SIRIUS proposed in Robot Learning on the Job: Human-in-the-Loop Autonomy and Learning During Deployment by Huihan Liu et al.

The key idea shared between these works is that the data has different value during the rollout. The state-action pairs generated by the policy itself are on-distribution and thus regularize the policy (so it won’t crazily drift away when following the human), while human intervention is a more precious resource that should be utilized.

In IWR, the intuition is that humans often take over control near bottleneck situations. So IWR balances the samples from human and agent by equal weights. During training, 50% of the data is generated by the policy itself, serving as a regularization term, while the other 50% is generated by the human, and the agent learns more about the hard part of the task.

In SIRIUS, the reweighting is more carefully designed, and binned into human demonstration, pre-intervention, intervention and robot action. Based on their value — for example, pre-intervention is bad data since the robot is about to fail — the data is reweighted, assigning higher weights to good data and lower weights to bad data.

HACO

HACO was first proposed in the ICLR22 paper Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization by Quanyi Li, Zhenghao Peng and Bolei Zhou. It follows a similar idea as the HG-DAgger family in that the human can actively decide whether to intervene or not. It uses an Actor-Critic framework that distinguishes itself from the HG-DAgger family and achieves better performance. (HACO itself builds on the group’s earlier Expert Guided Policy Optimization, EGPO, CoRL 2021, where a PPO expert — rather than a human — provided the interventions.)

Overview

So how does HACO work?

Similarly, we denote the agent’s novice policy as $\pi_n(a_n|s)$ and the human intervention indicator as a boolean $I(s,a_n)$. The safe action applied to the environment is $\hat{a}=I(s,a_n)a_n+(1-I(s,a_n))a_h$ where $a_h$ is the human action. The overall policy is

$$ \pi(a|s)=(1-I(s,a_n))\pi_n(a|s)+\pi_h(a|s)G(s) $$

where $G(s)$ is the expected value of the human intervention indicator $I(s,a_n)$, i.e. the probability of the agent choosing an action that will be rejected by the human.

HACO paper claims to have 3 objectives:

Proxy Value: The agent should maximize a proxy value that reflects human’s intention.
Exhaustiveness: The agent should explore the state space as much as possible, under human permission.
Automation: The agent should learn to automate the task as much as possible, reducing human burden.

The 3 objectives introduce different optimization targets that finally combine as:

$$ \max_{\pi}\mathbb{E}[Q(s,a) + \mathcal{H}(\pi) - Q^I(s,a)] $$

We will detail the 3 objectives in the following sections.

Proxy Value

HACO is under a reward-free setting, which makes it more of an imitation learning algorithm. The proxy value is learnt by conservative Q-Learning, optimizing the objective:

$$ \min_{\phi}\mathbb{E}_{(s_t,a_n,a_h,I(s,a_n))\sim \mathcal{B}}[I(s,a_n)(Q(s,a_n;\phi)-Q(s,a_h;\phi))] $$

The objective, in plain language, is forcing the Q-function to assign higher value to the human action than the novice action, when human intervention is triggered.

Exhaustiveness (Entropy Regularization)

Indeed, the agent is rambling/learning in a human-permitted subspace. However, as we saw in the Proxy Value section, the high-value states might not be encountered during autonomous explorable sampling. In other words, the agent is unlikely to visit those states evoking high-value state-action pairs without being encouraged. The HACO paper claims that this hinders the propagation of the learnt reward.

To tackle this, HACO introduces an entropy regularization term to encourage the agent to explore more states:

$$ \min_{\phi}\mathbb{E}[y-Q(s_t,\hat{a}_t;\phi)]\newline $$

$$ y=\gamma \mathbb{E}[Q(s_{t+1},a’;\phi’)-\alpha\log\pi_n(a’|s_{t+1})] $$ where $s_t,\hat{a_t}, s_{t+1}$ are sampled from $\mathcal{B}$ and $a’\sim\pi_n(\cdot|s_{t+1})$. Notice that $\phi’$ is the target network parameter of $\phi$.

Automation (Reduce Human Intervention)

In HACO’s framework, the human always helps the agent when it’s getting into trouble. In this manner, the agent can possibly abuse the intervention, e.g. deliberately getting into trouble to get the human’s help. To prevent this, HACO introduces a term to penalize the agent for triggering human intervention.

If the step induces intervention — say step $t$ where $I(s_{t-1},a_{n,t-1})=0$ and $I(s_t,a_{n,t})=1$ — we add a cost: $$ C(s,a_n)=1-\frac{a_n^Ta_h}{||a_n||||a_h||} $$

The cost is the cosine distance between the agent’s action and the human’s action. The intuition is that if the intervened agent acts almost the same as the human, it won’t get much punished.

The intervention penalty is separately learnt by an auxiliary Q-function, namely $Q^I(s,a)$. The objective is:

$$ Q^I(s_t,a_{n,t})=C(s_t,a_{n,t})+ \gamma \mathbb{E}[Q^I(s_{t+1},a_{t+1})] $$

$$ s_{t+1}\sim \mathcal{B}, a_{t+1}\sim\pi_n(\cdot|s_{t+1}) $$

which is very similar to a SARSA update.

Policy Learning

The policy is learnt by a conventional actor-critic framework, with HACO’s carefully designed Q-function: $$ \max_{\theta}\mathbb{E}_{s_t\sim\mathcal{B}}[Q(s_t,a_n)-\alpha\log\pi_n(a_n|s_t;\theta)-Q^I(s_t,a_n)] $$

Notice that $Q^I$ is negative as it represents a cost rather than a reward.

PVP

PVP, proposed in the NeurIPS23 paper Learning from Active Human Involvement through Proxy Value Propagation by Zhenghao Peng, Wenjie Mo, Chenda Duan, Quanyi Li, Bolei Zhou, is the latest advance in the IIL family, with a similar publication time as SIRIUS (RSS23). It’s a simplified version of HACO, with a Q-function directly learning from intervention data. PVP involves 2 key components: Proxy Value and Balanced Sampling. The overall structure for PVP is a DQN for discrete action spaces and a TD3 for continuous action spaces.

Proxy Value

HACO, as we have seen, uses a conservative Q-function to learn the proxy value. PVP, on the other hand, uses a more direct approach. It uses a Q-function to learn the value of human intervention, assigning +1 to the human action and -1 to the agent action during the intervention. The objective for the Q-function on intervention data is: $$ \mathcal{L}_{PVP} = MSE(Q(s,a_h),1) + MSE(Q(s,a_n),-1) $$

And during the non-intervention phase, the Q-function is simply doing TD-learning with 0 reward: $$ \mathcal{L}_{TD} = MSE(Q(s,a_n),r+\gamma Q(s’,\pi(s’))),\newline r=0 $$

This TD step is the “propagation” that gives the method its name: the $\pm 1$ labels from intervention data are propagated through the value function to the unlabeled exploration data. Notice that this Q-function does not satisfy the Bellman equation, and thus is not a conventional Q-function in the RL context. Instead, it is more a function about good or bad. The $\mathcal{L}_{PVP}$ can be decomposed as: $$ \begin{aligned} &MSE(Q(s,a_h),1) + MSE(Q(s,a_n),-1) &\\ =& (Q(s,a_h)-1)^2 + (Q(s,a_n)-(-1))^2 \\ =& Q(s,a_h)^2 + Q(s,a_n)^2 -2 (Q(s,a_h) - Q(s,a_n)) +2 \end{aligned} $$ where minimizing the cross term $-2(Q(s,a_h) - Q(s,a_n))$ forces $Q$ to assign higher value to human actions than to novice actions.

Balanced Sampling

During PVP training, we maintain 2 buffers: a novice buffer and a human buffer. The novice buffer contains data that are automatically generated by the policy, while the human buffer contains intervention data. We sample half-half from them, similar to the idea of IWR mentioned above.

Perspective

I’ve implemented a few of these algorithms, and they are indeed efficient. They also serve as a nice safety guarantee: human gating keeps the agent in a reasonable region. The idea can be further unified with RLHF, language feedback and more. However, IIL — especially the active-involvement methods — assumes the human has the ability to control the robot, and is quick enough to foresee the danger and intervene. This is obviously a strong assumption if working outside autonomous driving and easy-to-control robots.