Top-conf paper digest — week of June 5–11, 2026

Twelve papers posted June 5–8 on arXiv with confirmed top-conference acceptance or submission, grouped by research area.

Agents

Q-Evolve: self-improving LLM agents via in-distribution RL

Area: Agents · Venue: ICML 2026 · arXiv: 2606.07367 Authors: Yudi Zhang, Meng Fang, Zhenfang Chen, Mykola Pechenizkiy

Long-horizon LLM agents struggle with credit assignment when rewards only arrive at episode end. Q-Evolve handles this by jointly learning an in-distribution critic and a process-reward signal — in each iteration it trains a value function from a hybrid dataset mixing expert demonstrations with agent trajectories, derives per-step advantages, and then runs behavior-proximal policy optimization over the same distribution. The key claim: iterative self-improvement without distribution shift, because supervision and policy stay in the same in-distribution loop. Evaluated on AlfWorld, WebShop, and ScienceWorld, Q-Evolve outperforms strong baselines on sample efficiency and task completion rate. Prior work such as ReAct and Reflexion relies on heuristic or human-provided process rewards; Q-Evolve automates this labeling. No code repo listed at submission. 1

LLM

arxiv.orghttps://arxiv.org/abs/2606.07367외부 링크

콘텐츠 카드를 불러오는 중…

MDP-GRPO: fixing GRPO instability under discrete rewards

Area: LLM · Venue: ACL 2026 Main · arXiv: 2606.06058 Authors: Mohammad Mahdi Salmani-Zarchi, Zahra Rahimi, Heshaam Faili, Mohammad Javad Dousti

Standard GRPO becomes pathological when rewards are discrete and low-dispersion — within-group reward distributions are often homogeneous, causing z-score normalization to produce zero gradients or amplified noise. The paper formalizes three failure modes (low-variance amplification, mean-centering blindness, zero-variance collapse) and addresses them with four changes: multi-temperature sampling to spread reward distribution, dual-anchor advantages to restore gradients in homogeneous groups, prospect-theoretic shaping based on Kahneman–Tversky loss, and asymmetric KL regularization. On FollowBench and IFEval, MDP-GRPO improves strict constraint satisfaction by up to 5% on Llama-3.2-3B over standard GRPO, while preserving general capability on MMLU and ARC. Supports stable training with small group sizes — useful when compute per rollout is limited. Code not linked at submission. 2

Generative models

GILC: plug-and-play guidance for discrete diffusion

Area: Generative · Venue: ICML 2026 · arXiv: 2606.06303 Authors: Hongkun Dou, Zike Chen, Fengji Li, Hongjue Li, Yue Deng

Controlling discrete diffusion (DNA, protein, molecule generation) without retraining is hard because gradient signals are unstable in high-dimensional discrete spaces. GILC (Gradient-Informed Logit Correction) sidesteps this by using the pretrained denoiser as a variational proxy and applying a Jacobian-free correction directly to the clean prediction logits — no backprop through the full diffusion chain required. It supports both differentiable and non-differentiable reward functions. Results across DNA sequence design, protein sequence generation, and molecular generation show GILC at or above fine-tuned baselines without any additional training. The Jacobian-free design is the notable departure from classifier guidance approaches that require computing score Jacobians. 3

PhaseLock: preserving motion physics in video diffusion

Area: Generative · Venue: ICML 2026 · arXiv: 2606.06361 Authors: Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang, Seong Jae Hwang

Image-to-video diffusion models generate visually convincing frames but frequently violate physical motion. The paper makes a surprising observation: a 2-step diffusion output often has better physical consistency than a 50-step output from the same model. Via spectral analysis, the authors show the phase component of the latent (which encodes motion structure) degrades by ~18% from step 2 to step 50, while magnitude stays relatively stable. PhaseLock is a training-free framework that extracts the motion prior from just two denoising steps and enforces it throughout the full generation trajectory via Latent Delta Guidance. Across several video diffusion models, PhaseLock improves physical consistency scores by an average of 6.2 points with only 1.06× inference time and 1.02× memory overhead — a considerably lighter overhead than external guidance methods that run ~5× slower. 4

GReinSS: policy gradients for discrete latent structure recovery

Area: Generative / Scientific ML · Venue: ICML 2026 · arXiv: 2606.07400 Authors: Stefan Ivanovic, Ge Liu, Mohammed El-Kebir

Recovering mechanistic latent states from indirect observations is a core challenge in computational biology and systems science. EM-based approaches don't scale to combinatorially large spaces; VAEs tend to produce artifacts rather than ground-truth latent structure. GReinSS frames this as policy learning with dynamically rescaled rewards, learning distributions over latent sets and graphs that maximize observed data likelihood. On simulated data it accurately recovers latent sets and latent graphs over baselines. On real RNA sequencing data, GReinSS reconstructs isoforms from short-read data that better match long-read sequencing results than the RSEM baseline — a concrete empirical anchor beyond synthetic benchmarks. The dynamic reward rescaling is the mechanism enabling stable training in combinatorially large latent spaces. 5

Vision and video

arxiv.orghttps://arxiv.org/abs/2606.06294외부 링크

콘텐츠 카드를 불러오는 중…

OMTG: one-to-many temporal grounding in video

Area: Vision / Video · Venue: ICML 2026 · arXiv: 2606.06294 Authors: Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li

Prior temporal grounding assumes a query maps to a single video segment. OMTG targets the harder one-to-many setting where a query can match multiple disjoint segments, requiring cardinality perception. State-of-the-art MLLMs optimized for one-to-one settings score near zero on this task. The paper introduces three contributions: a benchmark with new metrics (Count Accuracy C-Acc, Effective Temporal F1 EtF1), a 56K-sample training dataset built via a chain-of-thought construction pipeline, and novel temporal and caption reward functions. The caption reward explicitly uses CoT reasoning over dense video captions to guide policy optimization toward both precision and completeness. The resulting model achieves 43.65% EtF1 on the benchmark, exceeding Gemini 2.5 Pro and Seed-1.8 by 15.85 and 15.61 percentage points respectively. 6

StoryVideoQA: deep video understanding at scale

Area: Vision / Video · Venue: IJCV 2026 · arXiv: 2606.06338 Authors: Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge, Zhongyuan Wang, Chunxia Xiao, Chao Liang

Existing VideoQA datasets focus on factoid questions; deep video understanding (DVU) requires comprehension of storylines spanning full TV episodes or movies. StoryVideoQA is the largest DVU dataset to date: 363K+ QA pairs over 393 hours of diverse story video (TV series averaging ~27 minutes; movies averaging ~131 minutes per clip). Construction uses StoryMindv2, a multi-agent framework with supervisor-guided generation and multi-reviewer voting. Evaluating 20 VideoQA methods on the benchmark reveals that none maintain long-range character associations or coherent storyline understanding at this scale. The paper also proposes PlotTree, a video understanding agent that reorganizes video into hierarchical plot structures for storyline reasoning. Code and project page available at github.com/nercms-mmap/StoryVideoQA. 7

DBD: adversarial attacks as test-time defenders for VLMs

Area: Vision / Robustness · Venue: ICLR 2026 · arXiv: 2606.06186 Authors: Liangsheng Liu, Si Chen, Jiamin Wu, Weiwei Feng, Zhixin Cheng, Xiaotian Yin, Wenfei Yang, Tianzhu Zhang

Standard adversarial defenses for VLMs (e.g., CLIP) require either retraining or expensive inference-time denoising. DBD (Directional Bias-guided Defense) starts from an empirical finding: under diverse input transformations, adversarial images in CLIP's feature space consistently shift along a single dominant direction, while clean images scatter. The paper argues this "Defense Direction" points back toward the correct class center, i.e., the adversarial perturbation itself encodes directional information about the true decision boundary. DBD estimates this direction at test time and applies a two-stream reconstruction strategy using a DB-score. Across 15 datasets, DBD reaches SOTA adversarial robustness while preserving clean accuracy — and in some cases adversarial accuracy exceeds clean accuracy, supporting the hypothesis that perturbations encode useful priors. No retraining required. 8

RL

arxiv.orghttps://arxiv.org/abs/2606.06053외부 링크

콘텐츠 카드를 불러오는 중…

Online KL-regularized RL under model misspecification

Area: RL · Venue: RLC 2026 · arXiv: 2606.06053 Authors: Haoyang Hong, Zichen Wang, Quanquan Gu, Huazheng Wang

KL-regularized RL (the basis of RLHF-style policy optimization) is typically analyzed under realizability. This paper studies what happens when the model is misspecified — i.e., when the hypothesis class doesn't contain the true model. The authors introduce KL misspecification formulations for contextual bandits and episodic RL, then analyze regression-based algorithms with Gibbs policy updates. The resulting high-probability regret bounds include explicit misspecification error terms and reduce to standard realizable bounds as a special case. This gives a theoretical foundation for understanding performance degradation in RLHF when reward models or policy classes are approximate, a common practical scenario. 9

ML methods

TabSwift: efficient tabular foundation model (ICML Spotlight)

Area: ML Methods · Venue: ICML 2026 Spotlight · arXiv: 2606.07345 Authors: Si-Yang Liu, Han-Jia Ye

Recent tabular foundation models improve accuracy by adding architectural complexity, at the cost of inference latency. TabSwift revisits the minimal TabPFN design and shows that a row-wise attention-only backbone with two additions — gated attention stabilization and a small set of learnable register tokens — is competitive with heavier models (TabPFN v2, TabICL) on both classification and regression. An additional adaptive layer-wise early-exit mechanism allows dynamic adjustment of inference depth per sample at serving time. The result is a tabular in-context learner that is competitive on accuracy while substantially faster to deploy. Awarded Spotlight at ICML 2026. Code available at github.com/automl/AlphaPFN (via companion α-PFN repo). 10

CorSW: Sliced-Wasserstein for EEG domain generalization

Area: ML Methods / BCI · Venue: KDD 2026 · arXiv: 2606.06104 Authors: Chen Hu, Rui Wang, Jiale Zhou, Jingjun Yi, Shaocheng Jin, Yidong Song, Yefeng Zheng

EEG decoding pipelines commonly use covariance matrices as features, but covariance is sensitive to channel-wise scaling. Full-rank correlation matrices are scale-invariant but geometrically non-Euclidean, complicating Wasserstein-based distance computations. CorSW extends Sliced Wasserstein (SW) to correlation matrix manifolds via a Pullback Euclidean Metric framework, instantiating two correlation geometries (Off-Log Metric and Log-Scaled Metric). A domain generalization framework for EEG decoding built on CorSW shows improved generalization under distribution shifts across three EEG datasets with low training overhead and no additional inference cost. Code at github.com/ChenHu-ML/CorSW. 11

Scientific ML

Reactive Flux Matching: data-driven reaction coordinates for molecular simulation

Area: Scientific ML · Venue: NeurIPS 2026 (submitted) · arXiv: 2606.06295 Authors: Rishal Aggarwal, David Ryan Koes, Nicholas M. Boffi, Eric Vanden-Eijnden

Path sampling methods generate reactive trajectories between molecular metastable states, but extracting mechanistic insight from trajectory ensembles is non-trivial. Flux Matching learns two objects directly from reactive path data without knowing the underlying dynamics: a current velocity u(z) whose streamlines trace dominant reaction pathways, and a scalar potential h(z) from a weighted Helmholtz–Hodge decomposition that serves as a data-driven reaction coordinate. Both quantities minimize quadratic functionals analogous to flow matching objectives in generative modeling. Unlike committor-based methods, u and h remain well-defined under non-Markovian projections onto collective variables. Validated on molecular systems for current velocity generation and rate constant estimation. Submitted to NeurIPS 2026 (preprint). 12

Top-conf paper digest — week of June 5–11, 2026

Agents

Q-Evolve: self-improving LLM agents via in-distribution RL

LLM

MDP-GRPO: fixing GRPO instability under discrete rewards

Generative models

GILC: plug-and-play guidance for discrete diffusion

PhaseLock: preserving motion physics in video diffusion

GReinSS: policy gradients for discrete latent structure recovery

Vision and video

OMTG: one-to-many temporal grounding in video

StoryVideoQA: deep video understanding at scale

DBD: adversarial attacks as test-time defenders for VLMs

RL

Online KL-regularized RL under model misspecification

ML methods

TabSwift: efficient tabular foundation model (ICML Spotlight)

CorSW: Sliced-Wasserstein for EEG domain generalization

Scientific ML

Reactive Flux Matching: data-driven reaction coordinates for molecular simulation

참고 출처