ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

Code πŸ“„ Docs
schematic

What is ReinFlow?

ReinFlow is a general online RL fine-tuning framework for diversee flow matching policies. It is...

How ReinFlow Works

Experiments

1. Environment and Data

Input Type Reward Type Environment Data Source
State Dense OpenAI Gym D4RL data
State Sparse Franka Kitchen Human-teleoperated data from D4RL
Visual Sparse Robomimic Human-teleoperated data processed like DPPO's paper

2. Performance Evaluation and Comparisons

2.1 Comparison with DPPO and FQL

πŸ“ˆ ReinFlow consistently enhances performance across Gym and Franka Kitchen benchmarks, boosting success rates by 40.09% in Robomimic tasks with fewer steps than DPPO.

Hopper-v2 Wall Time
(A) Hopper-v2
Walker-v2 Wall Time
(B) Walker-v2
Ant-v2 Wall Time
(C) Ant-v2
Humanoid-v3 Wall Time
(D) Humanoid-v3
Legend44
Wall time efficiency results of state-based locomotion tasks in OpenAI Gym. Dashed lines indicate the behavior cloning level.
Kitchen-complete Wall Time
(A) Kitchen-complete
Kitchen-complete Sample
(B) Kitchen-complete
Kitchen-mixed Sample
(C) Kitchen-mixed
Kitchen-partial Sample
(D) Kitchen-partial
Legend3
Task completion rates of state-input manipulation tasks in Franka Kitchen. (A) and (B) show the completion rates against the wall time and sample cost. We only show the sample cost plots for agents trained on mixed and partial datasets for brevity.
Can Success Rate
(A) Can
Square Success Rate
(B) Square
Transport Success Rate
(C) Transport
Legend4
Success rates in visual manipulation tasks in Robomimic.

2.2 Comparison with Other Diffusion RL Methods

βš–οΈ ReinFlow outperforms diffusion RL baselines in stability and asymptotic performance across continuous control tasks like Ant-v0, Hopper-v2, and Walker2d-v2.

πŸ“Š Using consistent hyperparameters from prior work, ReinFlow demonstrates superior performance, as shown in Fig. 5(A-C).

Diffusion RL Baselines in Ant-v0
(A) Ant-v0
Diffusion RL Baselines in Hopper-v2
(B) Hopper-v2
Diffusion RL Baselines in Walker2d-v2
(C) Walker2d-v2
Fine-tuning locomotion tasks Ant-v0, Hopper-v2, and Walker2d-v2 with Diffusion RL baselines and ReinFlow.

3. The Design Choice and Key Factors Affecting ReinFlow

We examine how pre-trained models, denoising steps, noise levels, regularization and denoising step number impact ReinFlow.

3.1 RL Provides Another Scaling Dimension

βž– Scaling data or inference steps: quickly plateaus.

πŸ“ˆ Fine-tuning with RL (ReinFlow): consistently enhances performance.

ReFlow Policy in Hopper-v2
(A) ReFlow Policy in Hopper-v2
Shortcut Policy in Square
(B) Shortcut Policy in Square
ReFlow Policy in Square
(C) ReFlow Policy in Square
RL provides another way of scaling apart from increasing pre-training data and inference consumption, which has a plateauing effect. The improvement is invariant of flow policy’s time distribution and is achievable at four steps in Hopper (A) and one step in Square (B, C).

3.2 Flow Matching's Time Distribution

ReinFlow's performance is robust to changes in time sampling.

βœ… Uniform

βœ… Logit-normal

βœ… Beta (slightly better for single-step fine-tuning.)

3.3 Noise Network Inputs

Conditioning noise on state alone works well.

Conditioning noise on both state and time generates more diverse actions and improves success rates.

Noise Input's Effect in Ant-v0
(A) Noise Input's Effect in Ant-v0
Noise Condition's Effect in Kitchen-partial
(B) Noise Condition's Effect in Kitchen-partial-v0
Conditioning on state and time yields a higher success rate than only conditioning on states.

3.4 Noise Level and Exploration

βš–οΈ Noise magnitude affects exploration: too little traps the policy, too much hurts execution.

πŸš€ Optimal noise levels enable significant gains, especially in complex tasks.

πŸ”’ Policies become less sensitive to noise once in the correct region.

3.5 Regularization and Exploration

πŸ… Entropy regularization promotes exploration, while behavior cloning regularization can trap the policy and is unnecessary.

πŸ… Entropy regularization outperforms Wasserstein-2 constraints used in offline RL methods like FQL.

Noise Level Affects Exploration in Ant-v0
(A) Noise Level Affects Exploration in Ant-v0
Regularization Affects ReinFlow in Humanoid-v3
(B) Regularization Affects ReinFlow in Humanoid-v3
Noise level and regularization's effect. (A) demonstrates constant noise with different standard deviation affects ReinFlow's exploration. (B) shows how entropy regularization with coefficient $\alpha$ and $W_2$ regularization with different coefficients $\beta$ influences ReinFlow.

3.6 The Number of Denoising Steps

πŸ“ˆ Increasing the number of denoising steps $K$ in ReinFlow boosts initial rewards in Franka Kitchen’s Shortcut Policy but results in rapid reward plateaus, as shown in Fig. 4(A).

🎯 In visual manipulation tasks, lowering noise standard deviation with higher $K$ improves performance for pre-trained policies with low success rates.

Denoising Steps in Kitchen-complete
(A) Denoising Steps in Kitchen-complete
Fine-tuning Shortcut Policy in Kitchen-complete-v0 at denoising steps $K=1,2,4$, averaged over three seeds (0, 42, 3407).

Common Questions about ReinFlow