Benchmarking Data Quality in Imitation Learning

ICML DataWorld 2025 Workshop (presented), UCLA Robot Intelligence Lab

RINSE is a framework for evaluating and filtering demonstration quality in imitation learning using two complementary smoothness-based metrics: Spectral Arc Length (SAL) and Trajectory-Envelope Distance (TED). These metrics operate without policy training, rollouts, or dataset-global statistics, enabling efficient data curation before any learning begins.

RINSE method overview

Method overview: SAL operates in the frequency domain on speed profiles, while TED measures geometric deviation from a smooth Bézier envelope with contact-aware partitioning.

Motivation

Behavioral cloning (BC) is fundamentally limited by demonstration quality. The BC policy loss decomposes as:

\[ \mathcal{L}(\theta) = \mathbb{E}_s\!\left[\|\pi_\theta(s) - \mu^*(s)\|^2\right] + \mathbb{E}_s\!\left[\mathrm{Tr}\!\left(\mathrm{Var}[a \mid s]\right)\right] \]

The second term is a noise floor \(\bar{\sigma}^2\) that cannot be reduced by training. With the BC regret bound \(J(\hat{\pi}) - J(\pi^*) \leq H^2 \varepsilon\), even modest reductions in per-step error amplify into substantial rollout improvements through the \(H^2\) compounding factor. RINSE targets this noise floor by identifying and removing low-quality demonstrations.

Quality Diagnostic: Data Collection Modality

Applied to the RTIS benchmark (5 manipulation tasks, 3 collection modalities), both SAL and TED consistently rank kinesthetic > VR > spacemouse, matching the downstream policy success ordering. Rate-controlled spacemouse introduces up to 4× larger TED scores compared to kinesthetic teaching.

RTIS modality comparison

SAL (higher = smoother) and TED (lower = smoother) for the Open Drawer task. Scores correlate with policy success across collection modalities.

Filtering: RoboMimic Benchmarks

Smoothness-based filtering on three RoboMimic benchmarks with Diffusion Policy (Transformer backbone). Demos are ranked by TED or SAL, and a budget of \(K\) top demonstrations is selected.

TaskSource|\(\mathcal{D}\)|\(K\)TEDSALFull
Squaremh3005075%72%76%
Squarebetter1005082%82%78%
Tool Hangph20010058%71%63%
Transportmh3005046%55%39%
Transportb-b502536%54%42%

SAL filtering achieves a +16% improvement on Transport (bimanual task with substantial free-space motion) using only 50 of 300 demos. On Tool Hang, SAL with half the demonstrations exceeds full-data performance by +8%. Filtered top-50 on Square converge to 75% success in ~40k steps vs ~450k for the full 300 demos (>10× convergence speed-up).

Filtering: Real-World Experiments

Evaluated on a U-Factory xArm 7-DOF robot with two RealSense cameras (wrist and side-mounted). All policies use Diffusion Policy (1D Conv UNet backbone), evaluated at epoch 50 over 25 randomized trials.

Task|\(\mathcal{D}\)|\(K\)RandomTEDSALFull
Push Block20015064%80%72%68%
Push Block20010059%88%84%68%
Push Block2005052%84%84%68%
Pick & Place804055%76%68%76%

TED filtering achieves +20% higher success on Push Block with half the data. At every subset size, TED outperforms random subsets by 16-32 percentage points, confirming that the quality ranking (not just subset size) drives improvement.

Filtering: Retrieval-Augmented Learning (LIBERO-10)

Integrated into the STRAP retrieval framework on LIBERO-10 (10 language-conditioned manipulation tasks). The top 400 SDTW candidates per query are re-ranked by TED or SAL, retaining the top 200.

MethodMug-MWMokaSoup-SCCBMug-PStoveBowl-CSoup-CMug2Book-CAvg
STRAP160401622847624488040.6
+TED140324838729234488446.2
+SAL140448346884405810045.0

TED+STRAP and SAL+STRAP improve aggregate success by +5.6% and +4.4% respectively. TED excels on contact-sensitive tasks (Cream-Cheese-Butter: +32%), while SAL excels on tasks with free-space motion (Book-Caddy: +20%).

Re-Mix Domain Reweighting

RINSE scores are used as soft quality weights within the Re-Mix framework for Group-DRO-style domain reweighting across 7 Open X-Embodiment domains. RINSE-weighted allocations achieve Spearman \(\rho \geq 0.89\) and cosine similarity \(\geq 0.989\) with Re-Mix base allocations, which are validated to improve downstream policy performance. All variants concentrate weight on the toto domain, identified as having the highest marginal training value.

Metric Complementarity

SAL excels at detecting free-space jitter (e.g., Transport task), while TED captures spatial artifacts near contact (e.g., Square task). Together they provide comprehensive trajectory quality assessment across different task structures.

SAL filtered variance grid

SAL trajectory-filtered variance reduction across tasks

TED filtered variance grid

TED trajectory-filtered variance reduction across tasks

Video 1:

Policy rollout for the RoboMimic Square task: lift the square nut and place it on the peg with high precision

Video 2:

Policy rollout for the Push-T task: push the T block to the traced goal location

Filtration test results

Mean score plots for the filtration test: filtered demos outperform the remaining set and even the full-dataset policy