How to read this catalog
This page covers the full release history of the DeepSeek model family. The ai-model.html overview page focuses on the current generation for readers who want depth on what is shipping today; this page maps the complete timeline for readers who need the broader context of how the family has evolved.
The DeepSeek models family is organised into three persistent branches — a general-purpose chat branch (V series), a reasoning-tuned branch (R series), and a code-specialised branch (Coder series) — plus a research line exploring mixture-of-experts architecture (MoE series) and a set of distillation and specialised variants that appeared alongside the R1 release. Within each branch, parameter sweeps produce small, mid-size, and flagship variants that target different hardware budgets.
The version numbering is consistent enough once you know the pattern. The major version number (V1, V2, V3) marks the generation of the general-purpose branch. The R prefix marks the reasoning branch derived from that generation — R1 is derived from the V3 generation's training approach, not a separate pre-training from scratch. The Coder suffix marks code-specialised fine-tunes. A trailing parameter count is self-explanatory. The suffixes -Base, -Chat, and -Instruct mark the pre-trained base checkpoint, the chat-tuned variant, and the instruction-tuned variant respectively.
Early generations: 2023 foundations
The DeepSeek models family began with foundational dense architecture releases in 2023, establishing the training data curation and evaluation methodology that later generations built on.
The first public DeepSeek releases appeared in late 2023. The initial V1 generation established the basic model family structure — a dense transformer architecture, a focus on multilingual coverage with particular strength in Chinese and English, and a policy of publishing both the base and instruction-tuned checkpoint at each parameter tier. The early Coder releases appeared in the same period, establishing the code-specialised branch with a fine-tuning corpus weighted toward programming languages.
What distinguished these early releases from contemporaneous open-weight offerings was the transparency of the technical reporting. Each release was accompanied by a brief model card noting training data composition, known weaknesses, and evaluation methodology. That transparency policy has persisted through subsequent generations, making the DeepSeek family one of the better-documented open-weight families for researchers who need to audit what a model was trained on.
V2 generation and MoE architecture
The V2 generation introduced a mixture-of-experts (MoE) architecture to the flagship tier while retaining dense architectures for the smaller variants. The MoE approach uses a gating mechanism to route each token to a subset of expert sub-networks, enabling a large total parameter count while keeping the compute cost per token comparable to a much smaller dense model. The DeepSeek-MoE research paper published alongside V2 provided architectural details that influenced how the community understood the trade-offs involved.
The V2 generation also expanded the parameter sweep: the flagship variant grew to a significantly larger total parameter count than V1 while keeping the activated-per-token parameter count lower, which improved cost-per-token at inference time for high-throughput deployments. V2 checkpoints remain on Hugging Face for teams that have existing fine-tunes or evaluation results tied to that generation and cannot easily migrate.
Page Pulse
If you need to pick a DeepSeek model variant today for a new project: the V3 Instruct checkpoint is the right starting point for general chat and instruction-following tasks; R1 is the right starting point for reasoning-heavy tasks where latency is secondary; DeepSeek Coder V2 Instruct is the right starting point for code generation workloads. Everything else in this catalog is either a research artifact, a legacy generation, or a distillation variant for resource-constrained deployment.
V3 and R1 generation: 2024–2025
The V3 generation, published in late 2024, represented the largest single capability jump in the family's history. The V3 flagship is an MoE architecture with 671 billion total parameters and approximately 37 billion activated parameters per token — a configuration that produces flagship-class output quality at a compute cost well below what a dense 671B model would require. The instruction-tuned V3 variant quickly became the community benchmark against which other open-weight instruction-following models were compared.
The R1 release followed in early 2025 and introduced a reasoning-focused fine-tuning approach using reinforcement learning from outcome feedback rather than the standard RLHF preference approach. R1's inference-time chain-of-thought mechanism — where the model produces an internal reasoning trace before the final answer — produced results on math, logic, and code benchmarks that placed it competitively against closed-weight models at the frontier. The R1 release also included a set of distillation variants (R1-Distill-7B through R1-Distill-70B) derived by training smaller Llama and Qwen base models on R1-generated reasoning traces.
For current-generation depth on V3 and R1, see the dedicated V3 page and R1 page. For the code-specialised branch, see DeepSeek Coder.
DeepSeek models: family branch, release names, and parameter classes
| Family branch |
Release names |
Parameter classes |
| General-purpose V series |
DeepSeek-V1, V2, V3 (Base and Instruct variants) |
7B, 16B, 67B (V1/V2); 7B, 32B, 671B (V3 MoE) |
| Reasoning R series |
DeepSeek-R1, R1-Zero, R1-Distill variants |
7B, 8B, 14B, 32B, 70B (distills); 671B (flagship R1) |
| Code-specialised Coder series |
DeepSeek-Coder V1, Coder V2, Coder-Instruct |
1.3B, 6.7B, 33B (V1); 16B, 236B (V2 MoE) |
| MoE architecture research |
DeepSeek-MoE, DeepSeekMoE-16B |
16B total (2.8B activated per token); research-scale |
| Specialised and multimodal |
DeepSeek-VL, DeepSeek-Math, DeepSeek-Prover |
Varies by task; typically 7B–67B range |
The specialised and multimodal variants deserve brief mention: DeepSeek-VL is a vision-language model that extends the V series to image understanding; DeepSeek-Math is a mathematics-specialised variant fine-tuned on a mathematical corpus; DeepSeek-Prover applies the reasoning approach to formal theorem proving. These are narrower in intended use than the main branches but represent the breadth of what the lab has published publicly. See the documentation index for the full navigation structure of this reference site, and the multi-family comparison page for how the DeepSeek catalog places relative to Llama, Mistral, Qwen, and Gemma.