Five diffusion papers worth reading this Monday (June 8, 2026)

Five diffusion papers worth reading this Monday (June 8, 2026)

This extended weekend batch (June 6–8, ~72h) surfaces five preprints across distinct corners of the diffusion pipeline. DAVE (KAIST, ICML 2026) identifies DC-component trajectory lock-in in Flow Matching transformers and cures it training-free, recovering Recall +93% on MS-COCO. TrioPose (CAS) redesigns pose conditioning as a native third stream in MM-DiT with zero-initialized residuals, reaching AP 64.33 (+30% over GRPose) on Human-Art. STREAM (DEEPNOID) introduces Riemannian Flow Matching in UNI feature space for histopathology generation, achieving gFID 6.61 on TCGA-BRCA. AsyncPatch (Google DeepMind) proves the first valid ELBO for joint per-pixel asynchronous diffusion, enabling zero-shot inpainting with FID 15–20% below RePaint. FreeAnimate (Tsinghua SIGS, ICASSP 2026) achieves training-free human animation by using DDIM-inverted preview frames as structural scaffolding, outperforming trained baselines on out-of-domain data.

ArXiv Diffusion Models Digest
2026. 6. 8. · 23:20
구독 2개 · 콘텐츠 3개

리서치 브리프

Today's digest covers the extended weekend batch — ArXiv submissions from June 6–8 bundled into Monday's release. Eleven diffusion-model preprints surfaced across cs.CV and cs.LG; five made the cut. Two carry venue confirmations (ICML 2026 and ICASSP 2026). The set spans training-free diversity enhancement, native pose conditioning in MM-DiT, Riemannian flow matching for pathology, per-pixel asynchronous noise theory from Google DeepMind, and a training-free animation framework. All five papers were submitted June 5 — community engagement data is not yet available, which is expected for sub-72h submissions.

1. DAVE: breaking trajectory lock-in to restore generation diversity

ArXiv: 2606.06813 | KAIST | cs.CV | ICML 2026
Peer-review status: ICML 2026 accepted. Code: github.com/daheekwon/DAVE (MIT License; release pending publication).
When you run the same Flow Matching model on the same prompt five times, the outputs often look suspiciously similar. The paper's diagnosis: in SD3's Transformer block 5, the DC component — the zero-frequency spatial mean of internal representations — accounts for 51.2% of feature energy and reaches pairwise cosine similarity ≥ 0.99 across different noise seeds by the early denoising steps. The authors, Dahee Kwon, Haeun Lee, and Jaesik Choi, call this "early DC drift": trajectories converge before the model has meaningfully committed to content. 1
The fix is deliberately minimal. DAVE attenuates the DC component by a factor α=0.5 during the first 15–20% of denoising steps (τ=0.15), applied only to a selected pool of Transformer blocks ℒ. No retraining, no weight changes: "the zero-frequency spatial average (DC) component exhibits strong trajectory lock-in in the early denoising steps across seeds." 1
DAVE overview: left panel shows the DC attenuation formula and a bar chart of spatial-frequency power, right panel shows original vs. DAVE outputs for three prompts (girl, boxed meal, train in snow)
DAVE suppresses the DC component in early denoising steps, widening the distribution of generated samples with negligible cost. 1
The results on SD3.5 (MS-COCO) are substantial: Recall jumps from 0.2546 to 0.4916 (+93%) and FID improves from 36.38 to 29.56. On ImageNet with SD3.5, Recall rises from 0.2589 to 0.6489 (+151%). The CLIP score dips slightly (0.3129 → 0.3055) — a modest fidelity-diversity trade-off rather than a collapse. In in-batch diversity evaluation (batch=4, MS-COCO), DAVE's Recall of 0.441 beats DiverseFlow (0.209), OSCAR (0.245), SPELL (0.235), and PG (0.204) by a wide margin. 1
A useful ablation: block-wise analysis reveals that early blocks (1–3) primarily modulate color, block 0 affects object scale, and block 14 modulates texture. DAVE also transfers to SD3.5-Large-Turbo (Recall: 0.384 → 0.431) without modification, confirming the mechanism is not specific to the full-inference model.
Why read it: Diversity collapse in flow-matching T2I models is a known issue with few practical handles. DAVE's finding that DC attenuation at τ=0.15 unlocks most of the diversity budget is the kind of result that's immediately testable on any SD3-family checkpoint. For anyone building sampling strategies or studying representation geometry in DiTs, the block-wise frequency analysis alone is worth the read.

2. TrioPose: pose as a native modality in MM-DiT

ArXiv: 2606.07053 | Institute of Automation, CAS | cs.CV
Peer-review status: Preprint. No public code repository.
ControlNet-style adapters were designed for UNet-based diffusion models. When researchers tried to port them to MM-DiT (SD3, SD3.5), they ran into a structural problem: "naive signal concatenation in MM-DiTs severely disrupts pre-trained latent distributions," as Dian Gu and Zhengyi Yang of the Chinese Academy of Sciences put it. The symptom is convergence failure and structural deformation in multi-person generation scenarios. 2
TrioPose's answer is to treat pose as a third native stream alongside text and image from the start of training, rather than as an injected side-channel. The architecture — Triple-Stream Pose-Aware DiT, or TSPA-DiT — adds a dedicated Pose Stream to SD3.5M's MM-DiT blocks, connected via zero-initialized dual residual injection (ZeroConv₁ before Joint Attention, ZeroConv₂ after MLP). Only the first 12 of the model's DiT blocks run the Pose Stream; deeper blocks where pose information is already integrated do not, saving compute. 2
TrioPose architecture: (a) the full TSPA-DiT pipeline with three input streams — text, image latent, pose latent; (b) one TSPA-DiT block with zero-conv dual residual injection; (c) Pose-Guided Spatial Loss Weighting using ViTPose heatmaps; (d) the Learnable Relational Bias Mask with five physical-state categories
TrioPose processes pose, text, and image as parallel streams. Zero-initialized residuals prevent the pose stream from disrupting pre-trained alignment at training start. 2
Two supporting components address multi-person complexity specifically. The Learnable Relational Bias Mask (LRBM) classifies limb-to-limb relationships into five physical states (non-related, pure self, overlap internal, background, host-to-overlap) and maps each to a continuous attention bias — replacing the hard binary masks that fail when limbs from different people overlap. Ablation confirms the value: swapping LRBM for a binary mask drops AP from 64.33 to 62.63 and raises FID from 1.65 to 1.82. 2
Results across three benchmarks, all versus the strongest prior baselines:
BenchmarkTrioPose APPrior best APFID
Human-Art64.33GRPose: 49.50 (+30%)1.65
CrowdPose58.56Stable-Pose: 50.250.78
OCHuman62.59Stable-Pose: 58.840.92
A diagnostic result worth noting: the TSPA-DiT architecture alone — no LRBM, no Pose-Guided Spatial Loss — lifts AP from 8.17 (vanilla SD3.5M baseline) to 63.50. The vanilla baseline's near-zero score reflects structural collapse that generic adapter injection cannot fix. 2
Why read it: Any lab adapting ControlNet-style pose conditioning to DiT/MM-DiT architectures will encounter the same distribution-disruption problem TrioPose identifies. The TSPA-DiT's zero-init residual scheme and the LRBM's continuous attention bias are both transferable ideas. No code is available yet, but the architecture is described with enough detail to reproduce.

3. STREAM: Riemannian flow matching for histopathology

ArXiv: 2606.07036 | DEEPNOID Inc. | cs.CV
Peer-review status: Preprint. Code to be released upon acceptance.
Existing SOTA histopathology generation models (ZoomLDM, PixCell) work in VAE latent space. Won June Cho and colleagues at DEEPNOID, a Korean AI pathology company, ran a diagnostic test: linear probe AUROC on SPIDER-breast reveals that UNI (ViT-L/16) patch features score 0.995, while ZoomLDM's VAE latents score 0.624 and PixCell's 0.640. VAE compression discards the very semantic detail that makes pathology generation medically useful. 3
There is a second problem. The team found that 62–76% of output diversity in both ZoomLDM and PixCell comes from the conditioning VFM signal rather than from the learned latent distribution — what they call conditioning collapse. PixCell's design exposes this most starkly: forced to generate without a conditioning image, its gFID on TCGA-BRCA explodes to 104.18. The model never learned to generate unconditionally. 3
STREAM addresses both problems by operating directly in UNI feature space. UNI's ℓ₂-normalized patch tokens naturally lie on the unit hypersphere 𝒮^(d−1), so STREAM applies Riemannian Flow Matching with a bridge-type stochastic schedule — Brownian-bridge perturbation with σ(t)=σ_max·sin(πt) — replacing the deterministic SLERP conditioning path. This provides the first valid ELBO for joint diffusion in this geometric setting. 3
STREAM framework: a frozen UNI ViT-L/16 encoder maps histopathology patches to 256 tokens on the unit hypersphere; a DiT learns bridge-perturbed geodesic transport; a separately trained anisotropic decoder uses SVD of the velocity-field Jacobian to reconstruct tissue patches with directionally calibrated noise
STREAM routes the generative process through UNI's high-AUROC feature space on the hypersphere, rather than through VAE latents with low semantic fidelity. 3
The second stage is an Anisotropic Decoder whose noise covariance is determined by SVD of the trained DiT's velocity-field Jacobian. High-energy directions (U_H) receive small noise — protecting pathologically meaningful variance — while low-energy directions (U_L) receive large noise to introduce stochasticity. The two stages interact super-additively: ablations show that Bridge alone gives rFID 6.51, Anisotropic Decoder alone gives 5.09, but together they reach rFID 3.52 (gFID 6.86 vs. 9.07 for each component alone). 3
Final numbers on TCGA-BRCA: gFID 6.61 (vs. ZoomLDM 7.43), rFID 2.42 (vs. PixCell 2.88), FvD 78.04 (vs. ZoomLDM 196.41). On TCGA-COADREAD: gFID 7.68 (vs. ZoomLDM 8.09). 3
Why read it: STREAM introduces a principled geometry-aware alternative to VAE-based generative pipelines for medical imaging — a domain where semantic fidelity matters more than aesthetic diversity. The conditioning-collapse diagnostic methodology (measuring ρ_cond in existing models) is independently useful for auditing any VFM-conditioned generative system. The super-additivity result is a reminder that noise schedule and decoder are not independent design choices.

4. AsyncPatch: per-pixel asynchronous noise and the first joint-diffusion ELBO

ArXiv: 2606.07079 | Google DeepMind | cs.CV
Peer-review status: Preprint. No public code (Google DeepMind paper).
Standard diffusion models assign one noise level to the entire image at each timestep. Every pixel is equally corrupted, equally denoised. Samuele Papa and colleagues at Google DeepMind ask what happens if different spatial regions are allowed to have independent noise levels — and work out the full theoretical framework to make this principled. 4
The key theoretical result: for joint diffusion over N tokens with independent timesteps, the paper proves "the first valid ELBO for joint diffusion, which requires different approaches than classical diffusion models." The derivation uses concentration results for monotone random walks on (0,1)^N, averaging the DDPM-style ELBO over all possible path orderings. This makes the objective tractable without additional approximation. 4
Training on independent timesteps naively creates a distributional mismatch: during training, the model almost never sees states where all patches are at the same noise level, but those states are essential for standard sampling. AsyncPatch resolves this with a budget-constrained timestep sampling strategy — fix the global average corruption level, then sample patch-wise timesteps subject to that constraint. This aligns the training and sampling distributions without discarding the flexibility. 4
AsyncPatch Figure 1: comparison of four sampling schedules — traditional (all patches same noise), autoregressive (raster-scan noise decay), inpainting (known region stays clean, unknown region denoises), adaptive (uncertainty-guided) — with example forward diffusion paths showing spatially varying noise levels across patches
AsyncPatch supports four sampling modes within one model: traditional generation, autoregressive raster-scan, zero-shot inpainting, and adaptive uncertainty-guided generation. 4
The inpainting case is the most directly useful application: keep the known region's noise level at 0 (clean) and denoise only the masked region. No inpainting-specific training objective, no fine-tuning: "inpainting is obtained by choosing a spatial noise schedule that keeps the known region clean and denoises only the complementary mask, while no task-specific inpainting objective is required." 4
TaskAsyncPatchRePaintΔ
ImageNet 256 FID (unconditional)8.06LDM: 8.24−2%
Extrema mask inpainting FID39.045.7−15%
Wide mask inpainting FID22.127.6−20%
LSUN Bedroom, Extrema FID20.223.7−15%
LSUN Bedroom, Square FID14.515.6−7%
AsyncPatch also enables autoregressive generation (raster-scan across 8×8 patches, 16 steps per patch) and an Input Guidance mechanism that conditions texture synthesis on the score difference between clean and partially corrupted regions — outperforming Firefly 3 and Gen-3 on texture transfer in the paper's qualitative examples. 4
Why read it: The theoretical contribution — a valid ELBO for joint diffusion — is self-contained and opens up a class of spatially non-uniform generative processes that were previously unprincipled. For practitioners, the zero-shot inpainting performance without any inpainting-specific training is the compelling result. Code has not been released, but the problem formulation and the ELBO derivation are fully described.

5. FreeAnimate: training-free human image animation via preview-guided denoising

ArXiv: 2606.06885 | Tsinghua SIGS / HIT / Peng Cheng Laboratory | cs.CV | ICASSP 2026
Peer-review status: ICASSP 2026 accepted. Project page: freeani.github.io. No public code repository.
Training a human animation model means committing to a dataset distribution. DisCo (a trained baseline) achieves FID 30.75 on TikTok data, but when evaluated on the out-of-domain TED-Talks dataset its FID degrades to 75.48. FreeAnimate avoids this entirely: built on SD v1.5 + ControlNet, it requires no additional training and — as Yuan Zeng, Yujia Shi, Zongqing Lu, and QingMin Liao from Tsinghua SIGS put it — "being training-free, FreeAnimate is less affected by data distribution shifts, maintaining consistent performance across datasets." Its FID goes 27.82 (TikTok) → 24.31 (TED-Talks). 5
The framework has three stages. First, a Preview Generation Strategy constructs a structurally consistent preview frame for each target pose using MasaCtrl (self-attention token swap for appearance), T2I-Adapter (pose conditioning), Grounded-SAM (person segmentation), and MAT inpainting (background restoration). The preview is not the final output — it is a structural scaffold aligned to the target frame. 5
FreeAnimate framework: reference image and pose sequence feed into Preview Generation, which produces preview frames. DDIM Inversion of those frames generates initial noise and stores self-/cross-attention maps. During denoising, ControlNet applies pose conditioning while Inversion-Boosted Attention reuses the stored maps; Reference-Anchored Self-Attention keeps the reference image as an anchor.
FreeAnimate uses DDIM inversion of preview frames as structural scaffolding, storing attention maps that guide the final denoising pass without any weight updates. 5
Second, those preview frames are DDIM-inverted to extract initial noise and cache self/cross-attention maps. Third, during the actual denoising pass, Inversion-Boosted Attention (IBA) injects those cached attention maps to preserve structural alignment, while Reference-Anchored Self-Attention (RA-SA) keeps the original reference image as a persistent anchor for appearance consistency across the full animation. 5
Ablation isolates each component's contribution on the TikTok benchmark:
ConfigurationFIDFVD
Full FreeAnimate27.82170.18
No Preview Strategy50.07
No IBA39.34
No RA-SA39.20259.02
The Preview Generation Strategy is the single largest contributor: removing it raises FID by over 22 points. The paper also shows the strategy is model-agnostic: plugging it into MagicPose (a trained baseline) improves that model's FID from 25.50 to 24.61 and FVD from 216.01 to 180.49. 5
Inference cost: approximately 5,496 ms per frame at 5,561 MB peak VRAM (split: preview 3,103 ms + DDIM inversion 1,043 ms + denoising 1,350 ms). An upper-bound on quality: when high-quality driving video frames replace the synthesized previews, FID drops further to 23.54 and SSIM rises to 0.817, indicating preview quality is still the primary performance ceiling. 5
Why read it: Training-free animation on a frozen SD v1.5 + ControlNet stack outperforming trained baselines on out-of-domain data is a result worth examining carefully. The Preview Generation pipeline is a practical recipe — MasaCtrl, T2I-Adapter, Grounded-SAM, MAT are all publicly available — and the model-agnostic ablation means the preview strategy can be tested as a plug-in on other animation frameworks. ICASSP 2026 acceptance confirms the experimental bar was met.

Quick reference

PaperArXiv IDInstitutionCore methodKey resultCode
DAVE2606.06813KAISTDC component attenuation in early denoisingRecall +93% (MS-COCO, SD3.5); ICML 2026GitHub (pending)
TrioPose2606.07053CAS CASIATriple-stream pose MM-DiT + LRBMAP +30% over GRPose on Human-Art; FID 1.65None
STREAM2606.07036DEEPNOIDRiemannian FM + Anisotropic Decoder in UNI feature spacegFID 6.61, rFID 2.42 on TCGA-BRCAPending acceptance
AsyncPatch2606.07079Google DeepMindPer-pixel async noise + first joint-diffusion ELBOZero-shot inpainting FID −15–20% vs. RePaintNone
FreeAnimate2606.06885Tsinghua SIGSPreview-guided DDIM inversion + IBA + RA-SAFID 27.82 (TikTok), 24.31 (TED-Talks); ICASSP 2026None (project page)
Two papers — DAVE and FreeAnimate — have confirmed venue acceptances. Three are preprints without announced venues (TrioPose's submission follows NeurIPS 2026 formatting but has not declared acceptance). Code availability remains limited this batch: DAVE's repository exists but is gated on publication; STREAM will open-source upon acceptance; the remaining three have neither code nor release timelines. That asymmetry is worth noting when deciding which papers to read versus which to prototype from.
Cover: AI-generated illustration

이 콘텐츠를 둘러싼 관점이나 맥락을 계속 보강해 보세요.

  • 로그인하면 댓글을 작성할 수 있습니다.