AgentPerf from Synthetic Evaluation, the business’s first agentic AI benchmark, provides builders, enterprises and infrastructure suppliers a transparent option to examine techniques for agentic AI. Within the first spherical of revealed outcomes, the NVIDIA Blackwell Extremely NVL72 platform delivers main efficiency throughout the agentic AI workloads examined, working 20x extra brokers per megawatt than NVIDIA Hopper.
Agentic AI is a basically completely different workload than conversational AI. A single chat completion is a dash: one giant language mannequin (LLM) name, one response. An agent features extra like a relay: It breaks a purpose into many steps and retains going till the duty is finished.Â

That ends in dozens to tons of of LLM calls chained collectively, every passing rising context to the subsequent, with instrument calls like code compile and execution, database search and internet shopping at each handoff. The complexity isn’t additive; it’s multiplicative.Â
The excellence issues enormously for efficiency measurement. Present AI inference benchmarks measure one LLM name: how briskly an LLM responds to a single request and what number of simultaneous requests a system can deal with. They weren’t designed for agentic workloads, the place chained LLM calls, instrument name delays and rising context stress accelerated computing techniques in basically other ways than a single LLM name ever might.Â
For corporations constructing and deploying brokers at scale, it’s necessary to grasp how responsive brokers are, what number of will be deployed concurrently and the way a lot helpful work AI infrastructure can ship for each greenback and watt invested.
NVIDIA GB300 NVL72 Runs 20x Extra Brokers per Megawatt
On this first spherical, AgentPerf measures agentic efficiency with DeepSeek V4 Professional, a big mixture-of-experts (MoE) mannequin that represents the category of frontier fashions powering in the present day’s most succesful brokers. On this workload, NVIDIA GB300 NVL72 delivers the best efficiency within the benchmark, working as much as 20x extra brokers per megawatt than the NVIDIA HGX H200 system.

The efficiency benefit comes from excessive codesign throughout the complete stack. GB300 NVL72 connects 72 GPUs right into a single rack-scale system, enabling giant MoE fashions like DeepSeek V4 Professional to distribute mannequin execution effectively at scale.Â
CUDA kernels speed up this additional by overlapping communication and compute, so the price of coordinating throughout consultants is absorbed somewhat than added to latency.Â
NVIDIA TensorRT LLM sustains effectivity as concurrent agent periods scale. For instance, it separates the processing of inputs from the technology of outputs so every will be optimized independently.Â
These outcomes are grounded in a benchmark methodology constructed from the bottom as much as replicate how agentic AI really works in manufacturing.
Synthetic Evaluation AgentPerf: Constructed on Actual-World Agentic Workloads
AgentPerf is constructed primarily based on actual coding agent trajectories: an agent receives a activity, reads recordsdata, writes and edits code, executes instructions and iterates primarily based on the outcomes — all drawn from actual public code repositories throughout 12+ programming languages. The lengthy sequence lengths, instrument name patterns and delays are all consultant of real-world coding workflows.Â
AgentPerf then measures what number of of those agentic duties a platform can assist concurrently whereas assembly outlined efficiency thresholds for responsiveness and output token price. Instrument calls are usually not executed however simulated utilizing consultant CPU processing time, so variations in outcomes replicate accelerated computing efficiency solely.Â
The outcomes translate straight into infrastructure selections: what number of concurrent agentic duties will be run per accelerator and per megawatt of energy. For enterprises deploying AI brokers at scale, these numbers decide how a lot productive work a given infrastructure funding can really ship.
NVIDIA Ecosystem Companions Harness Blackwell’s Main Efficiency
Main inference suppliers together with Baseten, DeepInfra and Collectively AI are already serving agentic workloads on frontier fashions reminiscent of DeepSeek V4 Professional on NVIDIA Blackwell and powering manufacturing agentic functions in the present day.Â
Collectively AI powers real-time inference for Cursor, an AI-powered agentic coding platform, on NVIDIA Blackwell. Cursor’s brokers debug points, generate options and execute refactors whereas builders proceed working. Â
DeepInfra powers Pam.ai, an AI workforce platform for automotive dealerships, which deploys brokers to e book service appointments, deal with calls and run outbound gross sales campaigns, solely on NVIDIA Blackwell.Â
As NVIDIA and the open supply ecosystem proceed to optimize inference software program, efficiency and effectivity on agentic workloads will solely enhance. The NVIDIA Vera Rubin structure is now in full manufacturing, bringing the subsequent technology of infrastructure capability to satisfy the rising calls for of agentic AI at scale.Â
Dive deeper into AgentPerf’s methodology and NVIDIA’s full-stack optimizations for agentic AI on this technical weblog.

