DeepSeek
The Chinese AI company that trained a world-class model for $6 million, wiped $593 billion off Nvidia in one day, and became the "Sputnik moment" for American AI dominance.
Introduction
Imagine a company that comes out of nowhere, trains a world-class AI model for about the cost of a decent house in San Francisco, releases it for free, and promptly wipes half a trillion dollars off Nvidia's market cap in a single day. That's DeepSeek.
DeepSeek is a Chinese AI company based in Hangzhou, founded in 2023 by Liang Wenfeng, a quantitative hedge fund manager who decided that if he was already using AI to trade stocks, he might as well try to build artificial general intelligence. The company is wholly owned and funded by High-Flyer, Liang's hedge fund, which means it operates with zero pressure to turn a profit.
What makes DeepSeek genuinely fascinating is that they achieved performance competitive with OpenAI's GPT-4 and Anthropic's Claude while spending a fraction of the compute budget. Their V3 model, a 671-billion parameter beast, was reportedly trained for around $6 million. OpenAI spent an estimated $100 million on GPT-4. By mid-2026, DeepSeek had released models up to V4 with a 1.6-trillion parameter architecture and a 1-million token context window.
History
The DeepSeek story starts not in an AI lab, but on a trading floor. In 2015, Liang Wenfeng, born in 1985 in Guangdong, China, co-founded High-Flyer, a quantitative hedge fund. By 2021, the firm was using AI exclusively for stock trading. Before the US imposed export restrictions on advanced AI chips to China, Liang acquired 10,000 Nvidia A100 GPUs and poured over a billion yuan into building Fire-Flyer 2, a computing cluster.
On April 14, 2023, High-Flyer announced it was launching an AGI research lab. Three months later, that lab was spun off into an independent company: DeepSeek. Venture capitalists were reluctant to invest, so Liang just kept funding it himself through High-Flyer.
Model Release Timeline
| Model | Date | Significance |
|---|---|---|
| DeepSeek Coder | Nov 2023 | First model, code-focused LLM |
| DeepSeek-MoE | Jan 2024 | First MoE architecture |
| DeepSeek-V2 | May 2024 | MLA + MoE, 128K context |
| DeepSeek-V3 | Dec 2024 | 671B MoE, $6M training cost, matched GPT-4o |
| DeepSeek-R1 | Jan 2025 | Reasoning model, caused global market shock |
| DeepSeek-V3.1 | Aug 2025 | Hybrid thinking/non-thinking modes |
| DeepSeek-V3.2 | Dec 2025 | Sparse Attention, Speciale reasoning variant |
| DeepSeek-V4 Pro/Flash | Apr 2026 | 1.6T params, 1M context, mHC architecture |
Models Breakdown: From V2 to V4
DeepSeek-V2 (May 2024) - Introduced Multi-Head Latent Attention (MLA) and a novel Mixture of Experts with shared experts. MLA compresses KV cache into a latent space, reducing memory by 80-90%. V2 was extremely cheap at 2 RMB per million output tokens.
DeepSeek-V3 (Dec 2024) - The bombshell. A 671B MoE model with 37B active parameters per token, trained on 14.8 trillion tokens for $5.6 million. Matched GPT-4o on benchmarks. Released under an open-weight license.
DeepSeek-R1 (Jan 2025) - A reasoning model using chain-of-thought RL. R1-Zero was trained with pure RL and zero supervised data, spontaneously developing self-reflection and verification behaviors. Distilled versions from 1.5B to 70B were also released.
DeepSeek-V4 (Apr 2026) - Two models: V4-Pro (1.6T params, 49B active) and V4-Flash (284B params, 13B active). Both feature 1M token context window, Manifold-constrained Hyper Connections (mHC), Constrained Sparse Attention, and Heavily Compressed Attention.
Architecture Innovations
- Mixture of Experts (MoE) with Shared Experts - Keeps a set of "shared" experts always active for common knowledge, while "routed" experts handle specialized tasks. Reduces expert imbalance and improves parameter efficiency.
- Multi-Head Latent Attention (MLA) - Compresses KV cache into a low-dimensional latent space, reducing memory by 80-90% while maintaining model quality. Enables 128K+ contexts on consumer hardware.
- FP8 Training - Most of the forward pass done in 8-bit floating point. Custom GEMM routines for accurate accumulation at low precision. Dedicated 20 streaming multiprocessors per GPU for inter-GPU communication.
- Load Balancing at Scale - Dynamic load-balancing that rearranges which physical machines host which experts every 10 minutes during training, keeping utilization remarkably even.
The $6 Million Model
DeepSeek claimed that training V3 cost about $5.6 million in compute. OpenAI reportedly spent around $100 million on GPT-4. The implication was explosive: if DeepSeek could build a GPT-4-class model for 5% of the cost, then the entire narrative about "moats" in AI was wrong.
The $6 million figure is controversial. Critics point out it only covers the final training run, not research, experimentation, data collection, or infrastructure. DeepSeek's total GPU investment was well over a billion yuan. But even if the real number is 10x higher, it's still dramatically less than Western companies spent, and that's because of genuine engineering optimization, not a PR trick.
The "Sputnik Moment", January 2025 Market Shock
On January 20, 2025, DeepSeek released the R1 model and a free chatbot app. By January 26, it hit #1 on the Apple App Store, surpassing ChatGPT. By January 27, the financial world was in chaos.
- Nvidia's stock crashed 18% in a single day
- The company lost $593 billion in market value, the largest single-day loss in US stock market history
- The Nasdaq Composite fell over 3%
- The sell-off spread globally, affecting semiconductor and AI stocks worldwide
Pundits called it a "Sputnik moment" for the US in AI, a wake-up call that American dominance couldn't be taken for granted, referencing the Soviet Union's 1957 satellite launch that triggered the space race.
Open Source and Licensing
Starting with R1 in January 2025, all major DeepSeek models have been released under the MIT License, one of the most permissive open-source licenses available. The models are technically "open-weight" (trained parameters are public, but training data is not). By releasing under MIT while OpenAI moved toward closed models, DeepSeek positioned itself as the open, accessible alternative, earning enormous goodwill in the open-source AI community.
The Chatbot and App
The DeepSeek chatbot, powered by R1, launched on January 20, 2025, for iOS and Android. It was free with no usage limits. Within a week, it was the most downloaded free app in the US App Store. Users loved seeing R1's detailed reasoning traces inside <think> tags, the self-corrections, the moment where it caught its own mistakes. OpenAI's o1 kept reasoning hidden; DeepSeek showed you everything.
Training Infrastructure
Because of US export restrictions, DeepSeek couldn't get Nvidia's best chips (H100/B200). They worked with Nvidia H800 GPUs (a downgraded export-compliant version) and their own A100 clusters.
DeepSeek didn't use off-the-shelf PyTorch with NCCL. They built their own parallel training stack from the ground up:
- 3FS (Fire-Flyer File System) - Distributed parallel file system using Direct I/O and RDMA Read to stream data directly from storage to GPU memory.
- hfreduce - Custom replacement for Nvidia's NCCL communication library, optimized for gradient allreduce, running asynchronously on CPU.
- HaiScale DDP - Parallel training library supporting data, pipeline, tensor, expert parallelism, FSDP, and ZeRO optimization.
- HAI Platform - Task scheduling, fault handling, and disaster recovery for the cluster.
Company Strategy
- No Commercialization Pressure - Funded by High-Flyer hedge fund, zero VC pressure to monetize. Models released for free, research published openly.
- Unconventional Hiring - Emphasizes skills over credentials. Many hires are fresh graduates with no prior work experience. Recruits from outside CS: poets, mathematicians, domain experts.
- Skirting Regulations - Positioned as a research lab rather than consumer AI product, operating under more lenient regulatory frameworks.
Controversies
- Distillation Accusations - Anthropic formally accused DeepSeek of using thousands of fraudulent accounts to generate millions of conversations with Claude, then using those conversations to train their own models. DeepSeek has not publicly addressed these allegations.
- Content Censorship - Models closely follow Chinese Communist Party ideology. The R1-0528 update was more tightly aligned with government positions.
- Military-Academic Ties - Dozens of DeepSeek researchers have affiliations with People's Liberation Army laboratories and China's defense-oriented research institutions.
- Chip Export Controls - Success with restricted hardware strengthened arguments for tighter US controls. Chinese authorities reportedly encouraging adoption of Huawei Ascend chips.
- Venture Capital Pivot - In April 2026, began speaking with investors about a $300 million funding round at a $10 billion valuation, a shift from years of saying "we're not commercializing."
Global Impact
- US-China AI Competition - Fundamentally changed the narrative. Before DeepSeek: "The US leads in AI." After: "Chip restrictions might not matter as much as we thought."
- African Expansion - DeepSeek models are less power-hungry and more affordable, making them attractive for African markets. Bolstered African language models and spawned AI startups in Nairobi.
- China's AI Ecosystem - Triggered a wave of investment. ByteDance, Tencent, Baidu, and Alibaba all cut prices in response. DeepSeek was dubbed the "Pinduoduo of AI" for extreme affordability.
Comparison with Alternatives
| Aspect | DeepSeek | OpenAI GPT-4o | Claude Sonnet |
|---|---|---|---|
| Training cost | ~$6M (V3) | ~$100M | ~$50M+ |
| Open weights | MIT License | Closed/API only | API only |
| Context window | 1M (V4) | 128K | 200K |
| Reasoning visible | Public think traces | Hidden | Hidden |
| Multimodal | Text only (strong) | Vision, voice | Vision, audio |
| Math & coding | Industry-leading | Excellent | Excellent |