CAISI Report: DeepSeek V4 Pro Trails US Frontier by 8 Months

The CAISI Verdict Is In — And It Isn't Pretty

In April 2026, the U.S. National Institute of Standards and Technology (NIST), through its Center for AI Standards and Innovation (CAISI), dropped its first major independent evaluation of DeepSeek V4 Pro — the Chinese AI startup's flagship open-weight model. The verdict? DeepSeek V4 Pro lags behind the U.S. frontier by roughly eight months, performing on par with OpenAI's GPT-5 from late 2025 rather than the GPT-5.4 or Opus 4.6-class models DeepSeek itself claims to rival.

This isn't just a numbers game. It's the first time a U.S. government agency has independently stress-tested a Chinese frontier model using its own held-out benchmarks — uncontaminated evaluations that model vendors can't train toward. And the gap it reveals is both narrower than the hype suggests and wider than DeepSeek's marketing would have you believe.

Benchmark Deep-Dive: Where DeepSeek Shines and Where It Falls Short

CAISI evaluated DeepSeek V4 Pro across five critical domains — cyber, software engineering, natural sciences, abstract reasoning, and mathematics — using nine benchmarks including two held-out evaluations: ARC-AGI-2's semi-private dataset and CAISI's internally-built PortBench. The results paint a nuanced picture.

Mathematics & Natural Sciences: The Bright Spots

DeepSeek V4 Pro holds its own where raw reasoning meets structured problem-solving. On OTIS-AIME-2025, it scored 97% — second only to GPT-5.5's perfect 100%. On GPQA-Diamond, it hit 90%, within striking distance of Opus 4.6 (91%) and GPT-5.5 (96%). On PUMaC 2024 and SMT 2025, it matched or nearly matched the top U.S. models. These are the benchmarks that typically reward rigorous training data and strong mathematical foundations — and DeepSeek clearly invested here.

Cyber & Software Engineering: The Exposure

The cracks appear where the tasks get messier. On CTF-Archive-Diamond (a cyber challenge benchmark), DeepSeek managed just 32% — tied for last against Opus 4.6's 46% and GPT-5.5's commanding 71%. PortBench, CAISI's internal software engineering evaluation, was similarly unforgiving: DeepSeek scored 44%, lagging behind Opus 4.6's 60% and GPT-5.5's 78%. Even SWE-Bench Verified, where it scored a respectable 74%, fell short of Opus 4.6 (79%) and GPT-5.5 (81%).

These aren't edge cases. Cyber and software engineering are the domains that matter most for real-world agentic deployment — automated code generation, vulnerability discovery, and autonomous system repair. A model that excels at math competitions but fumbles on practical engineering tasks has a ceiling on its utility.

Abstract Reasoning: The Biggest Delta

The most striking gap appeared on ARC-AGI-2's semi-private dataset, a benchmark designed specifically to resist training data contamination. Here, GPT-5.5 scored 79%. Opus 4.6 managed 63%. DeepSeek V4 Pro? Just 46%. This benchmark tests a model's ability to generalize from minimal examples — a proxy for fluid intelligence — and the 33-point gap to the U.S. frontier suggests fundamental architectural differences in how DeepSeek's model handles novel pattern recognition.

Cost Efficiency: A More Complicated Picture

DeepSeek's headline-grabbing 75% permanent price cut on V4 Pro, announced in late May, would seem to reinforce its value proposition. But CAISI's analysis complicates the narrative:

Compared to GPT-5.4 mini (the most cost-competitive U.S. reference model), DeepSeek V4 Pro was cheaper on 5 out of 7 benchmarks — but the range was wide: from 53% less expensive to 41% more expensive, depending on the task.
DeepSeek V4 Pro's IRT-estimated Elo came in at 800 ± 28 — far behind GPT-5.5's 1260 ± 28, and trailing Opus 4.6's 999 ± 27. Every 200-point increase in this IRT scale represents a 3x improvement in the odds of solving a given task, meaning the gap in raw capability is significant.
DeepSeek's pricing advantage is partly an artifact of Chinese government subsidies and its integration with Huawei's chip ecosystem — not pure engineering efficiency.

The Geopolitical Subtext

The CAISI evaluation lands at a moment of heightened tension. The White House's Office of Science and Technology Policy has publicly accused Chinese AI firms of conducting large-scale distillation attacks against U.S. frontier models — building fake accounts by the thousands to siphon capabilities. DeepSeek, specifically, has been accused of creating over 24,000 fake accounts and conducting more than 16 million interactions with Anthropic's Claude models to extract performance data.

Meanwhile, DeepSeek V4 Pro was reportedly trained on smuggled Nvidia Blackwell chips — still under U.S. export controls — though unlike the V3 paper, the V4 technical report is conspicuously silent on training hardware. And in a twist that undermines its own go-to-market strategy, DeepSeek admits it currently cannot serve V4 Pro to most customers because it lacks the chips to do so at scale.

What This Means for Developers and Enterprises

For the engineering audience, the takeaway is measured: DeepSeek V4 Pro is a genuinely capable open-weight model that offers strong performance in mathematics and scientific reasoning at competitive prices. Its 1.6-trillion-parameter architecture and million-token context window represent real engineering achievements, particularly the hybrid attention mechanism that slashes inference costs versus V3.

But the CAISI data should give pause to anyone considering DeepSeek V4 Pro for production agentic workflows — especially in security-sensitive or software-engineering-heavy domains, where the gap to U.S. frontier models remains wide. The open-weight advantage is real, but it comes with strings attached: geopolitical risk, supply-chain exposure to restricted hardware, and performance ceilings that become apparent once you move beyond curated benchmarks.

The U.S. frontier is sprinting — GPT-5.5 and Claude Mythos Preview both show significant gains over their predecessors — and the evidence suggests the gap is widening, not shrinking. DeepSeek V4 Pro is the best open-weight Chinese model, but "best in class" doesn't mean "competitive at the frontier." And in AI, the frontier is where the future gets built.