\u2190 Knowledge Base

DeepSeek

The Chinese AI company that trained a world-class model for $6 million, wiped $593 billion off Nvidia in one day, and became the "Sputnik moment" for American AI dominance.

Table of Contents

Introduction
History
Models Breakdown
Architecture Innovations
The $6 Million Model
The "Sputnik Moment"
Open Source and Licensing
The Chatbot and App
Training Infrastructure
Company Strategy
Controversies
Global Impact
Comparison with Alternatives

🔍

Introduction

Imagine a company that comes out of nowhere, trains a world-class AI model for about the cost of a decent house in San Francisco, releases it for free, and promptly wipes half a trillion dollars off Nvidia's market cap in a single day. That's DeepSeek.

DeepSeek is a Chinese AI company based in Hangzhou, founded in 2023 by Liang Wenfeng, a quantitative hedge fund manager who decided that if he was already using AI to trade stocks, he might as well try to build artificial general intelligence. The company is wholly owned and funded by High-Flyer, Liang's hedge fund, which means it operates with zero pressure to turn a profit.

What makes DeepSeek genuinely fascinating is that they achieved performance competitive with OpenAI's GPT-4 and Anthropic's Claude while spending a fraction of the compute budget. Their V3 model, a 671-billion parameter beast, was reportedly trained for around $6 million. OpenAI spent an estimated $100 million on GPT-4. By mid-2026, DeepSeek had released models up to V4 with a 1.6-trillion parameter architecture and a 1-million token context window.

📜

History

The DeepSeek story starts not in an AI lab, but on a trading floor. In 2015, Liang Wenfeng, born in 1985 in Guangdong, China, co-founded High-Flyer, a quantitative hedge fund. By 2021, the firm was using AI exclusively for stock trading. Before the US imposed export restrictions on advanced AI chips to China, Liang acquired 10,000 Nvidia A100 GPUs and poured over a billion yuan into building Fire-Flyer 2, a computing cluster.

On April 14, 2023, High-Flyer announced it was launching an AGI research lab. Three months later, that lab was spun off into an independent company: DeepSeek. Venture capitalists were reluctant to invest, so Liang just kept funding it himself through High-Flyer.

Model Release Timeline

Model	Date	Significance
DeepSeek Coder	Nov 2023	First model, code-focused LLM
DeepSeek-MoE	Jan 2024	First MoE architecture
DeepSeek-V2	May 2024	MLA + MoE, 128K context
DeepSeek-V3	Dec 2024	671B MoE, $6M training cost, matched GPT-4o
DeepSeek-R1	Jan 2025	Reasoning model, caused global market shock
DeepSeek-V3.1	Aug 2025	Hybrid thinking/non-thinking modes
DeepSeek-V3.2	Dec 2025	Sparse Attention, Speciale reasoning variant
DeepSeek-V4 Pro/Flash	Apr 2026	1.6T params, 1M context, mHC architecture

🏗️

Models Breakdown: From V2 to V4

DeepSeek-V2 (May 2024) - Introduced Multi-Head Latent Attention (MLA) and a novel Mixture of Experts with shared experts. MLA compresses KV cache into a latent space, reducing memory by 80-90%. V2 was extremely cheap at 2 RMB per million output tokens.

DeepSeek-V3 (Dec 2024) - The bombshell. A 671B MoE model with 37B active parameters per token, trained on 14.8 trillion tokens for $5.6 million. Matched GPT-4o on benchmarks. Released under an open-weight license.

DeepSeek-R1 (Jan 2025) - A reasoning model using chain-of-thought RL. R1-Zero was trained with pure RL and zero supervised data, spontaneously developing self-reflection and verification behaviors. Distilled versions from 1.5B to 70B were also released.

DeepSeek-V4 (Apr 2026) - Two models: V4-Pro (1.6T params, 49B active) and V4-Flash (284B params, 13B active). Both feature 1M token context window, Manifold-constrained Hyper Connections (mHC), Constrained Sparse Attention, and Heavily Compressed Attention.

⚙️

Architecture Innovations

Mixture of Experts (MoE) with Shared Experts - Keeps a set of "shared" experts always active for common knowledge, while "routed" experts handle specialized tasks. Reduces expert imbalance and improves parameter efficiency.
Multi-Head Latent Attention (MLA) - Compresses KV cache into a low-dimensional latent space, reducing memory by 80-90% while maintaining model quality. Enables 128K+ contexts on consumer hardware.
FP8 Training - Most of the forward pass done in 8-bit floating point. Custom GEMM routines for accurate accumulation at low precision. Dedicated 20 streaming multiprocessors per GPU for inter-GPU communication.
Load Balancing at Scale - Dynamic load-balancing that rearranges which physical machines host which experts every 10 minutes during training, keeping utilization remarkably even.

💰

The $6 Million Model

DeepSeek claimed that training V3 cost about $5.6 million in compute. OpenAI reportedly spent around $100 million on GPT-4. The implication was explosive: if DeepSeek could build a GPT-4-class model for 5% of the cost, then the entire narrative about "moats" in AI was wrong.

The $6 million figure is controversial. Critics point out it only covers the final training run, not research, experimentation, data collection, or infrastructure. DeepSeek's total GPU investment was well over a billion yuan. But even if the real number is 10x higher, it's still dramatically less than Western companies spent, and that's because of genuine engineering optimization, not a PR trick.

🚀

The "Sputnik Moment", January 2025 Market Shock

On January 20, 2025, DeepSeek released the R1 model and a free chatbot app. By January 26, it hit #1 on the Apple App Store, surpassing ChatGPT. By January 27, the financial world was in chaos.

Nvidia's stock crashed 18% in a single day
The company lost $593 billion in market value, the largest single-day loss in US stock market history
The Nasdaq Composite fell over 3%
The sell-off spread globally, affecting semiconductor and AI stocks worldwide

Pundits called it a "Sputnik moment" for the US in AI, a wake-up call that American dominance couldn't be taken for granted, referencing the Soviet Union's 1957 satellite launch that triggered the space race.

🔓

Open Source and Licensing

Starting with R1 in January 2025, all major DeepSeek models have been released under the MIT License, one of the most permissive open-source licenses available. The models are technically "open-weight" (trained parameters are public, but training data is not). By releasing under MIT while OpenAI moved toward closed models, DeepSeek positioned itself as the open, accessible alternative, earning enormous goodwill in the open-source AI community.

📱

The Chatbot and App

The DeepSeek chatbot, powered by R1, launched on January 20, 2025, for iOS and Android. It was free with no usage limits. Within a week, it was the most downloaded free app in the US App Store. Users loved seeing R1's detailed reasoning traces inside <think> tags, the self-corrections, the moment where it caught its own mistakes. OpenAI's o1 kept reasoning hidden; DeepSeek showed you everything.

🖥️

Training Infrastructure

Because of US export restrictions, DeepSeek couldn't get Nvidia's best chips (H100/B200). They worked with Nvidia H800 GPUs (a downgraded export-compliant version) and their own A100 clusters.

DeepSeek didn't use off-the-shelf PyTorch with NCCL. They built their own parallel training stack from the ground up:

3FS (Fire-Flyer File System) - Distributed parallel file system using Direct I/O and RDMA Read to stream data directly from storage to GPU memory.
hfreduce - Custom replacement for Nvidia's NCCL communication library, optimized for gradient allreduce, running asynchronously on CPU.
HaiScale DDP - Parallel training library supporting data, pipeline, tensor, expert parallelism, FSDP, and ZeRO optimization.
HAI Platform - Task scheduling, fault handling, and disaster recovery for the cluster.

🎯

Company Strategy

No Commercialization Pressure - Funded by High-Flyer hedge fund, zero VC pressure to monetize. Models released for free, research published openly.
Unconventional Hiring - Emphasizes skills over credentials. Many hires are fresh graduates with no prior work experience. Recruits from outside CS: poets, mathematicians, domain experts.
Skirting Regulations - Positioned as a research lab rather than consumer AI product, operating under more lenient regulatory frameworks.

⚠️

Controversies

Distillation Accusations - Anthropic formally accused DeepSeek of using thousands of fraudulent accounts to generate millions of conversations with Claude, then using those conversations to train their own models. DeepSeek has not publicly addressed these allegations.
Content Censorship - Models closely follow Chinese Communist Party ideology. The R1-0528 update was more tightly aligned with government positions.
Military-Academic Ties - Dozens of DeepSeek researchers have affiliations with People's Liberation Army laboratories and China's defense-oriented research institutions.
Chip Export Controls - Success with restricted hardware strengthened arguments for tighter US controls. Chinese authorities reportedly encouraging adoption of Huawei Ascend chips.
Venture Capital Pivot - In April 2026, began speaking with investors about a $300 million funding round at a $10 billion valuation, a shift from years of saying "we're not commercializing."

🌍

Global Impact

US-China AI Competition - Fundamentally changed the narrative. Before DeepSeek: "The US leads in AI." After: "Chip restrictions might not matter as much as we thought."
African Expansion - DeepSeek models are less power-hungry and more affordable, making them attractive for African markets. Bolstered African language models and spawned AI startups in Nairobi.
China's AI Ecosystem - Triggered a wave of investment. ByteDance, Tencent, Baidu, and Alibaba all cut prices in response. DeepSeek was dubbed the "Pinduoduo of AI" for extreme affordability.

⚖️

Comparison with Alternatives

Aspect	DeepSeek	OpenAI GPT-4o	Claude Sonnet
Training cost	~$6M (V3)	~$100M	~$50M+
Open weights	MIT License	Closed/API only	API only
Context window	1M (V4)	128K	200K
Reasoning visible	Public think traces	Hidden	Hidden
Multimodal	Text only (strong)	Vision, voice	Vision, audio
Math & coding	Industry-leading	Excellent	Excellent