arxiv:2603.14465

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Published on Mar 15

· Submitted by

Xuyan Ye on Mar 18

Renmin University of China

Upvote

Authors:

Xuyan Ye ,

Yupeng Huo ,

Zhi-Yuan Chen ,

Yiju Guo ,

Wenkai Yang ,

Shuqi Ye ,

Yankai Lin

Abstract

AgentProcessBench introduces a benchmark for evaluating step-level effectiveness in tool-augmented agent interactions, featuring diverse trajectories with detailed human annotations to improve process-level understanding and model performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

View arXiv page View PDF Project page GitHub 24 Add to collection

Community

LulaCola

Paper author Paper submitter Mar 18

😏 𝑨𝒈𝒆𝒏𝒕𝑷𝒓𝒐𝒄𝒆𝒔𝒔𝑩𝒆𝒏𝒄𝒉 𝑨𝒗𝒂𝒊𝒍𝒂𝒃𝒍𝒆 𝑵𝒐𝒘

When utilizing Process Reward Models (PRMs) to guide Reinforcement Learning (RL) training, accurately identifying the impact or contribution of each step within a trajectory is essential for providing precise reward signals. To achieve a more rigorous and comprehensive evaluation of models' capabilities as PRMs, we have developed a PRM evaluation benchmark specifically designed for tool-using agents. This benchmark comprises 1,000 trajectories totaling 8,509 steps, all featuring 100% human-annotated labels. Our goal is to provide a more fine-grained testing platform for PRM research within agent-based scenarios.

🌍 𝑯𝒐𝒎𝒆𝒑𝒂𝒈𝒆
rucbm.github.io/AgentProcessBench-Homepage/
🤗 𝑯𝑭
huggingface.co/datasets/LulaCola/AgentProcessBench
💻 𝑮𝒊𝒕𝒉𝒖𝒃
github.com/RUCBM/AgentProcessBench
📑 𝒂𝒓𝑿𝒊𝒗
arxiv.org/abs/2603.14465