DeepSeek-R1 Hallucinates 4x More Than V3, Raising Red Flags for Crypto AI Agent Tokens

DeepSeek-R1, the flagship reasoning model from Chinese lab DeepSeek, hallucinates at 14.3% according to Vectara’s HHEM 2.1 benchmark. That is nearly four times higher than its non-reasoning predecessor DeepSeek-V3, which scored 3.9%.
The gap raises hard questions for the crypto sector. A fast-growing class of AI agent tokens now leans on reasoning-style LLMs for autonomous trading, signals, and on-chain execution.
Vectara Data Shows R1 ‘Overhelps’ With False Facts
Vectara ran both DeepSeek models through HHEM 2.1, its dedicated hallucination evaluation framework. The team also cross-checked the results using Google’s FACTS methodology. R1 produced more false or unsupported statements than V3 in every test configuration.
The cause was not reasoning depth alone. Vectara’s analysts found that R1 tends to “overhelp.” The model adds information that does not appear in the source text.
That added detail can be factually correct on its own and still count as a hallucination. The behavior smuggles fabricated context into otherwise sound answers.
Vectara stated the finding directly in a public post on X.
“DeepSeek-R1 shows a 14.3% hallucination rate, nearly 4x higher than DeepSeek-V3,” Vectrara noted in a post.
The pattern is not unique to DeepSeek. Industry trackers note the same trade-off across reasoning-trained models from other labs. Reinforcement learning that sharpens chain-of-thought also rewards bolder and more confident generation.
Why Crypto AI Tokens Sit on This Trade-Off
The crypto market now hosts hundreds of AI agent tokens, led by Virtuals Protocol (VIRTUAL), ai16z (AI16Z), and aixbt (AIXBT).
The category has posted roughly 39.4% growth over a recent 30-day window. Virtuals alone has surpassed $576 million in market capitalization.
Most of these agents wrap a large language model in tooling. That tooling lets the agent post on social media, route trades, mint tokens, or generate market commentary.
When the underlying model fabricates a price level, a partnership, or a contract address, the consequences can land on-chain.
One BeInCrypto analysis of AIXBT showed the agent had shilled 416 tokens with a 19% average return. The same surface mechanic, however, exposes followers to bad calls when the model fails.
The risk surface scales with autonomy. Read-only agents that summarize sentiment differ in stakes from agents that hold treasury keys.
Reasoning models are especially attractive for agents that plan across multiple steps. That is also the use case where Vectara’s 14.3% figure bites hardest.




