Papers
arxiv:2603.14465

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Published on Mar 15
ยท Submitted by
Xuyan Ye
on Mar 18
Authors:
,
,
,
,
,

Abstract

AgentProcessBench introduces a benchmark for evaluating step-level effectiveness in tool-augmented agent interactions, featuring diverse trajectories with detailed human annotations to improve process-level understanding and model performance.

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

Community

Paper author Paper submitter

๐Ÿ˜ ๐‘จ๐’ˆ๐’†๐’๐’•๐‘ท๐’“๐’๐’„๐’†๐’”๐’”๐‘ฉ๐’†๐’๐’„๐’‰ ๐‘จ๐’—๐’‚๐’Š๐’๐’‚๐’ƒ๐’๐’† ๐‘ต๐’๐’˜

When utilizing Process Reward Models (PRMs) to guide Reinforcement Learning (RL) training, accurately identifying the impact or contribution of each step within a trajectory is essential for providing precise reward signals. To achieve a more rigorous and comprehensive evaluation of models' capabilities as PRMs, we have developed a PRM evaluation benchmark specifically designed for tool-using agents. This benchmark comprises 1,000 trajectories totaling 8,509 steps, all featuring 100% human-annotated labels. Our goal is to provide a more fine-grained testing platform for PRM research within agent-based scenarios.

๐ŸŒ ๐‘ฏ๐’๐’Ž๐’†๐’‘๐’‚๐’ˆ๐’†
rucbm.github.io/AgentProcessBench-Homepage/
๐Ÿค— ๐‘ฏ๐‘ญ
huggingface.co/datasets/LulaCola/AgentProcessBench
๐Ÿ’ป ๐‘ฎ๐’Š๐’•๐’‰๐’–๐’ƒ
github.com/RUCBM/AgentProcessBench
๐Ÿ“‘ ๐’‚๐’“๐‘ฟ๐’Š๐’—
arxiv.org/abs/2603.14465

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.14465
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.14465 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.14465 in a Space README.md to link it from this page.

Collections including this paper 2