Q-Flow: Stable and Expressive Reinforcement Learning
with Flow-Based Policy

1 KAIST
* Corresponding author  ·  jdoo2@kaist.ac.kr

Overview

There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization requires backpropagating through numerical solvers and often leads to instability. Existing approaches typically address this issue by restricting the expressive capacity of flow-based policies, resulting in a trade-off between optimization stability and representational flexibility. Q-Flow resolves this dilemma by learning value functions over the intermediate states of the flow, enabling stable and expressive policy optimization without unrolling the solver.

✦ Key Insights
1
Flow dynamics are deterministic, so value is invariant along the trajectory. Each intermediate noise integrates to the same action, so its value transfers back to the corresponding waypoint.
2
Intermediate value gradients steer the velocity field without BPTT. Using the intermediate value gradient as guidance updates the policy without unrolling the solver, removing the instability at its source.
3
Q-Flow achieves the best empirical performance. It consistently outperforms prior flow-based methods across diverse tasks.

Challenge: The Expressivity–Stability Dilemma

Flow-based policies can model complex, multimodal action distributions that simpler policies cannot. But optimizing a flow policy for reward is difficult: each action is produced by integrating a velocity field through a numerical ODE solver, so improving it for return means differentiating through that entire generation process. Two prior strategies address this, but each comes with a fundamental limitation.

2D Experimentsbehavior under varying BC strength

To see the dilemma concretely, we visualize each method on two 2D synthetic tasks, sweeping the BC regularization strength:

Swiss roll dataset
Swiss Roll
Two spirals dataset
Two Spirals

The two synthetic testbeds. Hover a task to highlight its row in the comparison below.

Backprop Through Time (BPTT)

Expressivity ✓ Stability ✗
Swiss Roll BPTT on Swiss roll
Two Spirals BPTT on Two spirals

One-step Distillation

Expressivity ✗ Stability ✓
Swiss Roll one-step on Swiss roll
Two Spirals one-step on Two spirals

Q-Flow (Ours)

Expressivity ✓ Stability ✓
Swiss Roll Q-Flow on Swiss roll
Two Spirals Q-Flow on Two spirals

Backprop Through Time (BPTT)

Maximize the critic directly on the action $a \sim \pi_\theta(\cdot \mid s)$ generated by the flow, backpropagating the $Q$-gradient through the entire ODE rollout, with a conditional flow-matching (behavioral-cloning) term anchoring the policy to the data:

$$\mathcal{L}_\pi(\theta) = -\,\mathbb{E}_{\substack{s \sim \mathcal{D} \\ a \sim \pi_\theta(\cdot \mid s)}}\!\bigg[Q_\phi(s, a)\bigg] + \alpha\, \mathbb{E}_{\substack{\tau \sim \mathcal{U}(0,1) \\ x_0 \sim \mathcal{N}(0, I) \\ (s, a) \sim \mathcal{D}}}\!\bigg[\big\|v_\theta(x_\tau, \tau, s) - (a - x_0)\big\|^2\bigg]$$

where $v_\theta$ is the flow's velocity field and $x_\tau = (1-\tau)\,x_0 + \tau\,a$.

In the 2D experiments above:

Strong BC Expressive, capturing the complex data structure.
Weak BC Unstable: backpropagating through the ODE solver destabilizes policy optimization and drifts off the data distribution.

One-step Distillation

FQL2 first fits a behavioral-cloning flow policy $\pi^{\text{BC}}_\theta$, whose velocity field $v_\theta$ is trained by flow matching:

$$\mathcal{L}_{\mathrm{BC}}(\theta) = \mathbb{E}_{\substack{\tau \sim \mathcal{U}(0,1) \\ x_0 \sim \mathcal{N}(0, I) \\ (s, a) \sim \mathcal{D}}}\!\bigg[\big\|v_\theta(x_\tau, \tau, s) - (a - x_0)\big\|^2\bigg]$$

with $x_\tau = (1-\tau)\,x_0 + \tau\,a$.

It then distills this flow into a one-step policy $\pi^{\text{one-step}}_\omega$ that maps noise $x_0$ directly to an action, trained to maximize the critic while staying close to the flow's output (coefficient $\alpha$):

$$\mathcal{L}_\pi(\omega) = -\,\mathbb{E}_{\substack{s \sim \mathcal{D} \\ a^\omega \sim \pi^{\text{one-step}}_\omega(\cdot \mid s)}}\!\bigg[Q_\phi(s, a^\omega)\bigg] + \alpha\, \mathbb{E}_{\substack{s \sim \mathcal{D} \\ a^\omega \sim \pi^{\text{one-step}}_\omega(\cdot \mid s) \\ a^\theta \sim \pi^{\text{BC}}_\theta(\cdot \mid s)}}\!\bigg[\big\|a^\omega - a^\theta\big\|^2\bigg]$$

In the 2D experiments above:

Strong BC Limited Expressivity due to one-step prediction.
Weak BC Stability gained by avoiding unrolling the solver multiple times.

Q-Flow escapes this trade-off. It keeps the full flow policy yet steers it with the gradient of a learned intermediate value rather than backpropagating through the solver, giving stable optimization at no cost to expressivity. See the method below.


Method

A flow policy turns noise $x_0 \sim \mathcal{N}(0, I)$ into an action $a = x_1$ by integrating a velocity field along a trajectory $x_\tau$, $\tau \in [0, 1]$. Its dynamics are deterministic, so every intermediate state $x_\tau$ flows to exactly one terminal action:

$$x_1 \;:=\; \Psi^\pi_{1,\tau}(x_\tau,\, s),$$

where $\Psi^\pi_{1,\tau}$ rolls policy $\pi$ forward from $x_\tau$ to $x_1$. This determinism underlies both components below.

1 Flow-Consistent Value

Since $x_\tau$ deterministically reaches $x_1$, Q-Flow assigns each intermediate state the value of the action it will produce:

$$V^\pi(s,\, x_\tau,\, \tau) \;:=\; Q\!\left(s,\; \Psi^\pi_{1,\tau}(x_\tau, s)\right)$$
  • $V^\pi(s, x_\tau, \tau)$: Value of intermediate state $x_\tau$ at flow-time $\tau$.
  • $Q(s, \cdot)$: Standard outer critic, evaluated at the produced action.

Every waypoint gets a flow-consistent value, turning the flow trajectory into a value field the policy can follow.

2 Intermediate Value Gradient Matching

With a value at every $x_\tau$, Q-Flow steers the policy by the gradient of that value without backpropagating through the ODE solver. The target velocity combines behavior cloning (BC) anchor with intermediate value guidance:

$$v_{\text{target}}(x_1, x_0, \tau, s) \;=\; \underbrace{(x_1 - x_0)}_{\text{BC anchor}} \;+\; \frac{1}{\lambda}\,\underbrace{\nabla_{x_\tau} V^\pi_\omega(s, x_\tau, \tau)}_{\text{value guidance}}$$
  • $(x_1 - x_0)$: BC anchor that keeps generation on the data support.
  • $\nabla_{x_\tau} V^\pi_\omega$: Value guidance toward latent actions that lead to high-value clean actions.
  • $\lambda > 0$: Guidance strength where smaller $\lambda$ favors the learned value and larger $\lambda$ favors the data.

Matching is local to each $\tau$, so the policy improves without unrolling the ODE solver while keeping the flow's full expressivity.

Full Algorithm

Q-Flow jointly learns three networks: the outer critic $Q_\phi$, the inner value $V^\pi_\omega$, and the flow policy $v_\theta$.

Value learning. The outer critic backs up environment reward; the inner value distills it onto latent actions $x_\tau$.

$$\mathcal{L}_Q(\phi) \;=\; \mathbb{E}_{\substack{(s, a, r, s') \sim \mathcal{D} \\ a' \sim \pi_\theta}}\!\left[\big(Q_\phi(s, a) - r - \gamma\, Q_{\bar\phi}(s', a')\big)^2\right]$$ $$\mathcal{L}_V(\omega) \;=\; \mathbb{E}_{\substack{\tau \sim \mathcal{U}(0,1) \\ x_0 \sim \mathcal{N}(0, I) \\ (s,\, x_1) \sim \mathcal{D}}}\!\left[\big(V^\pi_\omega(s, x_\tau, \tau) - Q_{\bar\phi}\big(s,\, \Psi^\pi_{1,\tau}(x_\tau, s)\big)\big)^2\right]$$

Policy learning. Match the flow velocity to the value-guided target at every step.

$$\mathcal{L}_\pi(\theta) \;=\; \mathbb{E}_{\substack{\tau \sim \mathcal{U}(0,1) \\ x_0 \sim \mathcal{N}(0, I) \\ (s,\, x_1) \sim \mathcal{D}}}\!\left[\big\|v_\theta(x_\tau, \tau, s) - \mathrm{sg}\big[\,v_{\text{target}}(x_1, x_0, \tau, s)\,\big]\big\|^2\right]$$

Flow Evolution

We provide flow-evolution visualizations of Q-Flow on the 2D experiments. Each video pairs the two policies (BC and Q-Flow) column by column over the full flow trajectory ($\tau = 0 \to 1$), from the same noise initialization.

Swiss Roll

Two Spirals

(Top) Sample trajectories from noise to action: BC flow (left) vs. Q-Flow (right).
(Bottom) BC velocity field (left) and the learned intermediate value landscape $V^\pi_\omega$ with its gradient $\nabla_{x_\tau} V^\pi_\omega$ (right).


Results

We evaluate on OGBench1, a challenging suite spanning navigation, locomotion, manipulation, and puzzle tasks.

Standard Setting

Q-Flow achieves an average score of 54.4, outperforming FQL (43.8) by +10.6 pp and consistently leading across all task categories. Largest gains: +31 pp on antmaze-giant, +25 pp on humanoidmaze-medium, +19 pp on puzzle-3×3.

Gaussian Diffusion Flow
Task type Environment IQL 4 ReBRAC 5 IDQL 6 CAC 7 FAWAC 2 FBRAC 2 IFQL 6 FQL 2 Q-Flow
(ours)
Locomotion antmaze-large 53±381±521±533±46±160±628±579±389±5
antmaze-giant 4±126±80±00±00±04±43±29±640±4
humanoidmaze-medium 33±222±81±053±819±138±560±1458±583±4
humanoidmaze-large 2±12±11±00±00±02±011±24±28±2
antsoccer 8±20±012±42±412±016±133±660±256±4
Manipulation scene 28±141±346±340±730±345±530±356±260±2
puzzle-3×3 9±121±110±219±06±214±419±130±149±3
puzzle-4×4 7±114±129±315±31±013±125±517±229±2
cube-single 83±391±295±285±981±479±779±296±195±1
cube-double 7±112±115±66±25±215±314±329±236±3
Average 23.431.023.025.316.028.630.243.854.4

+ Advanced Offline RL Techniques

Following the advanced evaluation protocol introduced in QAM3, we combine Q-Flow with three offline RL techniques: ensemble critics (10 critics for more stable value estimates), pessimistic value backup (lower confidence bound to penalize out-of-distribution actions), and action chunking (predicting a sequence of future actions at each step). Under this harder protocol, Q-Flow achieves 47.5 on average, outperforming the strongest baseline QAM-E (44.8) by +2.7 pp.

Gaussian Diffusion Flow
Task type Environment ReBRAC 5 DAC 8 QSM 9 FBRAC 2 IFQL 6 FQL 2 QAM 3 QAM-E 3 Q-Flow
(ours)
Locomotion antmaze-large 94±188±290±32±233±475±677±581±394±1
antmaze-giant 54±414±613±50±01±01±215±71±241±4
humanoidmaze-medium 67±882±383±536±383±266±464±356±685±2
humanoidmaze-large 16±30±09±20±022±58±210±42±27±1
Manipulation scene-sparse 65±767±585±145±684±279±197±197±197±1
puzzle-3×3-sparse 77±858±1055±80±0100±070±1299±1100±0100±0
puzzle-4×4-sparse 0±00±00±017±40±05±30±036±50±0
cube-double 9±234±256±30±011±145±364±565±538±3
cube-triple 1±05±23±10±00±03±13±15±13±1
cube-quadruple 8±42±219±00±02±12±22±15±210±4
Average 39.135.041.310.033.635.443.144.847.5

Citation

@inproceedings{doo2026qflow,
  title     = {Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy},
  author    = {JaeHyeok Doo and Byeongguk Jeon and Seonghyeon Ye and Kimin Lee and Minjoon Seo},
  booktitle = {International Conference on Machine Learning},
  year      = {2026}
}

References

  1. S. Park, K. Frans, B. Eysenbach, and S. Levine. "OGBench: Benchmarking Offline Goal-Conditioned RL." arXiv preprint arXiv:2406.05027, 2025.
  2. S. Park, Q. Li, and S. Levine. "Flow Q-Learning." International Conference on Learning Representations (ICLR), 2025. FQL · FBRAC · FAWAC · IFQL
  3. Q. Li and S. Levine. "Q-Learning with Adjoint Matching." International Conference on Learning Representations (ICLR), 2026. QAM · QAM-E
  4. I. Kostrikov, A. Nair, and S. Levine. "Offline Reinforcement Learning with Implicit Q-Learning." International Conference on Learning Representations (ICLR), 2022. IQL
  5. D. Tarasov, V. Kolesnichenko, and S. Kolesnikov. "Revisiting the Minimalist Approach to Offline Reinforcement Learning." Advances in Neural Information Processing Systems (NeurIPS), 2023. ReBRAC
  6. P. Hansen-Estruch, I. Kostrikov, M. Janner, J. Kuba, and S. Levine. "IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies." arXiv preprint arXiv:2304.10573, 2023. IDQL · IFQL
  7. Z. Ding and C. Jin. "Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning." International Conference on Learning Representations (ICLR), 2024. CAC
  8. L. Fang, L. Wan, S. Liu, A. Gretton, and C. Zhang. "Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning." International Conference on Learning Representations (ICLR), 2025. DAC
  9. M. Psenka, A. Escontrela, P. Abbeel, and A. Ma. "Learning a Diffusion Model Policy from Rewards via Q-Score Matching." arXiv preprint arXiv:2312.11752, 2024. QSM