all things vision LMs

community

Activity Feed

AI & ML interests

None defined yet.

sergiopaniego

posted an update 3 days ago

Post

181

Frontier agents are this good partly because the model was trained inside the very harness it ships with.

NVIDIA's new paper "Polar: Agentic RL on Any Harness at Scale" brings that recipe to the open: it turns coding harnesses like Codex, Claude Code, Qwen Code or Pi into RL training environments without touching their internals.

The core idea: every agent, however complex or closed, talks to a model through an API, so they put a proxy there. The harness runs exactly like in production while the proxy records prompts, sampled token ids and logprobs. Trajectories get rebuilt outside, token faithful, so gradients hit the exact tokens the policy sampled.

The gains are consistent across all four harnesses. Same Qwen3.5-4B, plain GRPO, evaluated on SWE-Bench Verified:

Codex 3.8 → 26.4 (+22.6)
Claude Code 29.8 → 34.6 (+4.8)
Qwen Code 34.6 → 35.2 (+0.6)
Pi 34.2 → 40.4 (+6.2)

The biggest gains appear on unfamiliar execution paths, Codex being the clearest case. The takeaway: you are not just training a model, you are training the model + harness system.

Two engineering pieces make it work at scale. Async worker pools isolate container boots (CPU), agent execution (GPU) and long tail test runs, so slow runtimes never block the GPUs. And prefix merging stitches hundreds of captured API calls back into contiguous traces: 5.4x faster trainer updates and rollout GPUs at 88% utilization.

It also doubles as an SFT data factory: 504 test verified agent traces from a 122B teacher, multi-turn conversations averaging 104 messages each, coming to the Hub under Apache 2.0 (release pending review).

Paper authors: Binfeng Xu, Hao Zhang, Shaokun Zhang, Songyang Han, Mingjie Liu, Jian Hu, Shizhe Diao, Zhenghui Jin, Yunheng Zou, Michael Demoret, Jan Kautz and Yi Dong.

> Paper: Polar: Agentic RL on Any Harness at Scale (2605.24220)
> Code: https://github.com/NVIDIA-NeMo/ProRL-Agent-Server
> Training data: NovaSky-AI/SkyRL-v0-293-data

sergiopaniego

posted an update 5 days ago

Post

172

The recording from our talk: "From Responses To Trajectories: Multi-Turn and Multi-Environment RL" from PyTorch Conf Europe is live!

@kashif and I covered the latest advances in multi-turn GRPO in TRL: trajectories, tool use, envs, and agentic post-training at scale

https://www.youtube.com/watch?v=rPBeXFntJSU

sergiopaniego

posted an update 5 days ago

Post

130

how do you sync a trillion parameter model every RL step without a shared cluster? we just wrote a blog about it, led by @aminediroHF

what I like the most is the way it proves you can use the Hub for basically everything 🧐 → trainer on one machine, vLLM in a HF Space, the wordle env in another HF Space and weights going through a Hub Bucket. no shared cluster, just HTTPS

it works because ~99% of bf16 weights don't change between RL steps so you only sync the diff. 1.2 GB to 25 MB of payload per step

https://huggingface.co/blog/delta-weight-sync

sergiopaniego

posted an update 6 days ago

Post

2293

most multi-turn RL loops have a silent bug: you decode the model's output to detect tool calls, then re-tokenize the conversation for the next turn. BPE isn't invertible, so decode then re-encode can land on different ids. gradient ends up on tokens the model never sampled. no crash, just quietly wrong math and broken training

@qgallouedec wrote a super educational blog on MITO (message-in, token-out) vs TITO (token-in, token-out) and how you might fix the problem above

go read it 🤓

https://huggingface.co/proxy/qgallouedec-tito.hf.space/

sergiopaniego

posted an update 7 days ago

Post

6229

new banger blog alert 🚨

@ariG23498 is starting a blog series about profiling in pytorch and part 1 just dropped

takes you from the simplest scenario to actually knowing what your gpu is doing. if you have never opened a profiler trace this is where you start

covers torch.profiler from scratch. reading tables and traces, overhead bound vs compute bound, the full dispatch chain from python to gpu kernels, and what torch.compile is actually fusing under the hood

find it here: https://huggingface.co/blog/torch-profiler

1 reply

sergiopaniego

posted an update 10 days ago

Post

169

If you have a github repo, you basically have an RL training environment

We're introducing Repo2RLEnv (built by @AdithyaSK ), a tool that mines PRs, commits, CVEs and turns them into verifiable sandboxed tasks with real reward signals, automatically

Outputs to Harbor spec so you can plug it straight into RL training or coding-agent eval

> repo: https://github.com/huggingface/Repo2RLEnv
> collection with envs: https://huggingface.co/collections/AdithyaSK/repo2rlenv-verifiable-rl-environments

sergiopaniego

posted an update 11 days ago

Post

234

periodic reminder 🧐

some HF blog posts use a special template, long-form, animated, super deep

I keep them all in one collection that gets updated every time a new one drops so you don't lose track

https://hf.co/collections/sergiopaniego/research-and-long-form-blog-posts

spot one missing? let me know

sergiopaniego

posted an update 14 days ago

Post

9964

Harness, Scaffold, Context Engineering, Agent... do you actually know what they mean?

We wrote an AI agent glossary and tried to make sense of it all with simple definitions and real examples

↓ go read it ↓

https://huggingface.co/blog/agent-glossary

2 replies

qgallouedec

posted an update 29 days ago

Post

10260

Shipped hf-sandbox! 🥡

🧪 Running an eval that executes model-generated C on a few thousand prompts? You probably don't want any of that on your laptop.
Just shipped hf-sandbox, a Modal-style sandbox API on top of Hugging Face Jobs. Spin up an isolated, ephemeral container, run untrusted code, get the result back. No Docker on your laptop, no infra to manage.

Just pip install hf-sandbox.

Early days (v0.1); feedback and issues very welcome:
👉 https://github.com/huggingface/hf-sandbox

1 reply

qgallouedec

posted an update about 1 month ago

Post

375

**TRL v1.4 is out 🚀** Chunked NLL loss for SFT and a first-class **OpenReward** integration.

**Chunked NLL loss for SFT — drops peak VRAM by up to 14×**

Standard SFT materializes a full [batch × seq × vocab] logits tensor before computing cross-entropy, which dominates peak memory at long context lengths. The new loss_type="chunked_nll" path drops ignored-label tokens before the lm_head matmul and computes cross-entropy in checkpointed chunks of 256.

Peak GPU memory, AdamW fp32:
- Qwen3-14B, 8×H100 FSDP2, 16k seq: 58.9 GB → 38.9 GB
- Qwen3-4B, 1×H100 80GB, 16k seq: OOM → 63.8 GB
- Qwen3-32B, 8×H100 FSDP2, 8k seq: OOM → 71.2 GB

End-to-end it's consistently as fast or faster than nll, and unlocks sequence lengths that don't fit at all under the standard path.

SFTConfig(loss_type="chunked_nll")

Works with PEFT and VLMs out of the box.

**Open Reward Standard environment adapter**

The new trl.experimental.openreward adapter plugs any environment speaking the [Open Reward Standard](https://openrewardstandard.io) protocol into any TRL trainer that takes an environment_factory. One string — a catalog name or a URL — wires the dataset, factory, and reward_func slots; tools are bound dynamically from JSON Schema, no per-env wrapper code:

from trl import GRPOTrainer
from trl.experimental.openreward import OpenRewardSpec

spec = OpenRewardSpec("Eigent/SETA", num_tasks=64)

trainer = GRPOTrainer(
    ...,
    train_dataset=spec.train_dataset,
    environment_factory=spec.environment_factory,
    reward_funcs=spec.reward_funcs,
)

v1.4 also brings MFU helpers for dense + MoE models, GRPO support for Liger 0.8.0 (delta clipping + VESPO + KL bias correction), Tülu 3's length-normalized DPO loss, four more training chat templates (Cohere, Cohere2, Gemma 3, Qwen3-2507), and a 5+ GB CUDA memory leak fix in activation offloading.

Full release notes: https://github.com/huggingface/trl/releases/tag/v1.4.0

sergiopaniego

posted an update about 1 month ago

Post

1886

OpenEnv is growing fast in tutorials. If you're looking to get started with RL environments, check them out

> evaluate your agents using OpenEnv
> learn how rewards work via rubrics
> connect agents via MCP
> many moreeeee!

anything you think it's missing?

https://meta-pytorch.org/OpenEnv/tutorials/index.html

sergiopaniego

posted an update about 1 month ago

Post

874

OpenEnv already ships 🚢 with a ready-to-deploy RLM environment on free HF Spaces

Drop "Attention Is All You Need", write code that spawns parallel LLM calls → ✅ correct answer, reward 1.0, in 4.2s

Run GRPO (TRL) → model learns to write that search strategy itself

test it yourself → sergiopaniego/repl-env
check out OpenEnv → https://github.com/meta-pytorch/OpenEnv

qgallouedec

posted an update about 1 month ago

Post

8087

TRL v1.3 ships day-one training support for Qwen 3.6 🚀

The new Qwen 3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B) reuses the Qwen3.5-MoE architecture but ships a slightly different chat template, so we updated the stack end-to-end: new training template with {% generation %} markers, tool-call response schema routing, tiny test models for the VLM matrix.

SFT with assistant-only loss works out of the box:

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3.6-27B",
    args=SFTConfig(assistant_only_loss=True),
    train_dataset=dataset,
)
trainer.train()

So does GRPO tool-calling — just hand tools=[...] to GRPOTrainer.

v1.3 also brings a new experimental TPO trainer (Triple Preference Optimization), speculative decoding in trl vllm-serve (Qwen3 MTP / Eagle3 drafts), 12 more KTO ↔ DPO alignment PRs (KTO promotion to stable is now in reach), three more {% generation %} chat templates (Gemma/Gemma 2, Phi-3, GLM-4-MoE), and a chunky SFT entropy bug fix.

Full release notes: https://github.com/huggingface/trl/releases/tag/v1.3.0

qgallouedec

posted an update about 2 months ago

Post

2032

TRL v1.2 introduces the SSDTrainer 🚀

Simple Self-Distillation (SSD) from Apple's paper "Embarrassingly Simple Self-Distillation Improves Code Generation" is now available as an experimental trainer in TRL.

The recipe is as minimal as the name suggests: sample completions from the model itself at a training-time temperature, then fine-tune on those raw, unverified samples with plain cross-entropy. No reward model. No verifier. No teacher model. No reinforcement learning. Just prompts and the model.

from trl.experimental.ssd import SSDConfig, SSDTrainer

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=SSDConfig(temperature=0.6, top_k=20, top_p=0.95),
    train_dataset=dataset,
)
trainer.train()

v1.2 also ships expanded tool-calling support (LLaMA 3.1 / 3.2, DeepSeek-V3), another round of KTO ↔ DPO alignment getting us closer to promoting KTO to stable, a big GRPO simplification for overlong tool results, deprecation of use_transformers_paged, and key fixes for VLM response parsing.

Full release notes: https://github.com/huggingface/trl/releases/tag/v1.2.0

sergiopaniego

posted an update about 2 months ago

Post

1426

Earlier this month, Apple introduced Simple Self-Distillation: a fine-tuning method that improves models on coding tasks just by sampling from the model and training on its own outputs with plain cross-entropy

And… it's already supported in TRL, built by Kashif Rasul. you can really feel the pace of development in the team 🐎

Paper by Ruixiang ZHANG, He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang at Apple 🍎

How it works: the model generates completions at a training-time temperature (T_train) with top_k/top_p truncation, then fine-tunes on them with plain cross-entropy. no labels or verifier needed

You can try it right away with this ready-to-run example (Qwen3-4B on rStar-Coder):
https://github.com/huggingface/trl/blob/main/trl/experimental/ssd/ssd.py
or benchmark a checkpoint with the eval script:
https://github.com/huggingface/trl/blob/main/trl/experimental/ssd/ssd_eval.py

One neat insight from the paper: T_train and T_eval compose into an effective T_eff = T_train × T_eval, so a broad band of configs works well. even very noisy samples still help

Want to dig deeper?

Paper: Embarrassingly Simple Self-Distillation Improves Code Generation (2604.01193)
Trainer docs: https://huggingface.co/docs/trl/main/en/ssd_trainer

sergiopaniego

posted an update about 2 months ago

Post

507

Great experience yesterday at PyTorch Conf Europe in Paris 🇫🇷

We (w/ @kashif ) talked about training LLMs through interaction, using trajectories across games, browsers, or simulators

Room was packed, a clear sign of interest in where RL post-training is heading.

sharing the slides! 🤓
https://drive.google.com/file/d/16k7YRnf5EJEo0XjXGlRJ_hVeLoFWKyNP/view?usp=sharing

sergiopaniego

posted an update 2 months ago

Post

2897

Gemma 4 💎 is here and it’s strong!

to celebrate, we’re rolling out in TRL:

> support for multimodal tool responses for environments (OpenEnv)
> an example to train it in CARLA for autonomous driving with image-based tool calls

go check it out 🏎️🏎️

blog: https://huggingface.co/blog/gemma4
script: https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/carla_vlm_gemma.py

sergiopaniego

posted an update 2 months ago

Post

2107

TRL is officially an adult 🥳

excited to announce TRL v1.0❗️

head to the blog to see how we got here and what’s next for this post-training library, designed to keep pace with the field

https://huggingface.co/blog/trl-v1

2 replies

qgallouedec

posted an update 2 months ago

Post

2460

TRL v1.0 is out!

Hugging Face's TRL library is downloaded 3 million times a month. Over 130k models trained with it are public on the Hub, and major projects like @unsloth and @axolotl-ai-co build directly on top of it. v1.0 is the moment we acknowledged that responsibility explicitly, with a real stability contract.

The field hasn't settled. Building stable software in a domain that keeps invalidating its own assumptions is the actual problem we're solving. The answer is a design that can absorb the next shift without breaking what people rely on.

What's in v1.0:
Deep Hugging Face integration, low infrastructure burden
What's next: asynchronous GRPO, better scaling support, and making training legible enough that agents can inspect and steer it.

pip install --upgrade trl

AI & ML interests

Team members 5

visionLMsftw's activity