A flow policy turns noise $x_0 \sim \mathcal{N}(0, I)$ into an action $a = x_1$ by integrating a velocity
field along a trajectory $x_\tau$, $\tau \in [0, 1]$. Its dynamics are deterministic, so every intermediate state $x_\tau$ flows to exactly one terminal action:
where $\Psi^\pi_{1,\tau}$ rolls policy $\pi$ forward from $x_\tau$ to $x_1$. This determinism underlies both components below.
Q-Flow jointly learns three networks: the outer critic $Q_\phi$, the inner
value $V^\pi_\omega$, and the flow policy $v_\theta$.
Value learning. The outer critic backs up environment reward; the inner value distills it onto latent actions $x_\tau$.
$$\mathcal{L}_Q(\phi) \;=\; \mathbb{E}_{\substack{(s, a, r, s') \sim \mathcal{D} \\ a' \sim \pi_\theta}}\!\left[\big(Q_\phi(s, a) - r - \gamma\, Q_{\bar\phi}(s', a')\big)^2\right]$$
$$\mathcal{L}_V(\omega) \;=\; \mathbb{E}_{\substack{\tau \sim \mathcal{U}(0,1) \\ x_0 \sim \mathcal{N}(0, I) \\ (s,\, x_1) \sim \mathcal{D}}}\!\left[\big(V^\pi_\omega(s, x_\tau, \tau) - Q_{\bar\phi}\big(s,\, \Psi^\pi_{1,\tau}(x_\tau, s)\big)\big)^2\right]$$
Policy learning. Match the flow velocity to the value-guided target at every step.
$$\mathcal{L}_\pi(\theta) \;=\; \mathbb{E}_{\substack{\tau \sim \mathcal{U}(0,1) \\ x_0 \sim \mathcal{N}(0, I) \\ (s,\, x_1) \sim \mathcal{D}}}\!\left[\big\|v_\theta(x_\tau, \tau, s) - \mathrm{sg}\big[\,v_{\text{target}}(x_1, x_0, \tau, s)\,\big]\big\|^2\right]$$
We provide flow-evolution visualizations of Q-Flow on the 2D experiments. Each video pairs the two policies (BC and Q-Flow) column by column over the full flow trajectory
($\tau = 0 \to 1$), from the same noise initialization.