DeepSeek models: a catalog overview of every release

Q: What does the version numbering mean in DeepSeek model names?

The major version number (V1, V2, V3) indicates the generation of the general-purpose chat branch. The R prefix (R1) indicates the reasoning-tuned branch derived from the corresponding general-purpose generation. A trailing parameter count (7B, 32B, 67B, 671B) indicates the approximate parameter size. The suffix '-Instruct' or '-Chat' indicates an instruction-tuned or chat-tuned variant, while no suffix typically indicates the base pre-trained checkpoint.

Q: What is the difference between DeepSeek-V3 and DeepSeek-R1?

DeepSeek V3 is the general-purpose chat model in the third generation. DeepSeek R1 is the reasoning-specialised branch derived from that generation, fine-tuned with reinforcement learning to produce inference-time chain-of-thought. R1 is slower per response but produces stronger results on math, code, and complex reasoning benchmarks.

A complete historical catalog of the DeepSeek model family — organised by year and family branch, with guidance on version naming, parameter class conventions, and where each variant lives.

How to read this catalog

This page covers the full release history of the DeepSeek model family. The ai-model.html overview page focuses on the current generation for readers who want depth on what is shipping today; this page maps the complete timeline for readers who need the broader context of how the family has evolved.

The DeepSeek models family is organised into three persistent branches — a general-purpose chat branch (V series), a reasoning-tuned branch (R series), and a code-specialised branch (Coder series) — plus a research line exploring mixture-of-experts architecture (MoE series) and a set of distillation and specialised variants that appeared alongside the R1 release. Within each branch, parameter sweeps produce small, mid-size, and flagship variants that target different hardware budgets.

The version numbering is consistent enough once you know the pattern. The major version number (V1, V2, V3) marks the generation of the general-purpose branch. The R prefix marks the reasoning branch derived from that generation — R1 is derived from the V3 generation's training approach, not a separate pre-training from scratch. The Coder suffix marks code-specialised fine-tunes. A trailing parameter count is self-explanatory. The suffixes -Base, -Chat, and -Instruct mark the pre-trained base checkpoint, the chat-tuned variant, and the instruction-tuned variant respectively.

Early generations: 2023 foundations

The DeepSeek models family began with foundational dense architecture releases in 2023, establishing the training data curation and evaluation methodology that later generations built on.

The first public DeepSeek releases appeared in late 2023. The initial V1 generation established the basic model family structure — a dense transformer architecture, a focus on multilingual coverage with particular strength in Chinese and English, and a policy of publishing both the base and instruction-tuned checkpoint at each parameter tier. The early Coder releases appeared in the same period, establishing the code-specialised branch with a fine-tuning corpus weighted toward programming languages.

What distinguished these early releases from contemporaneous open-weight offerings was the transparency of the technical reporting. Each release was accompanied by a brief model card noting training data composition, known weaknesses, and evaluation methodology. That transparency policy has persisted through subsequent generations, making the DeepSeek family one of the better-documented open-weight families for researchers who need to audit what a model was trained on.

V2 generation and MoE architecture

The V2 generation introduced a mixture-of-experts (MoE) architecture to the flagship tier while retaining dense architectures for the smaller variants. The MoE approach uses a gating mechanism to route each token to a subset of expert sub-networks, enabling a large total parameter count while keeping the compute cost per token comparable to a much smaller dense model. The DeepSeek-MoE research paper published alongside V2 provided architectural details that influenced how the community understood the trade-offs involved.

The V2 generation also expanded the parameter sweep: the flagship variant grew to a significantly larger total parameter count than V1 while keeping the activated-per-token parameter count lower, which improved cost-per-token at inference time for high-throughput deployments. V2 checkpoints remain on Hugging Face for teams that have existing fine-tunes or evaluation results tied to that generation and cannot easily migrate.

Page Pulse

If you need to pick a DeepSeek model variant today for a new project: the V3 Instruct checkpoint is the right starting point for general chat and instruction-following tasks; R1 is the right starting point for reasoning-heavy tasks where latency is secondary; DeepSeek Coder V2 Instruct is the right starting point for code generation workloads. Everything else in this catalog is either a research artifact, a legacy generation, or a distillation variant for resource-constrained deployment.

V3 and R1 generation: 2024–2025

The V3 generation, published in late 2024, represented the largest single capability jump in the family's history. The V3 flagship is an MoE architecture with 671 billion total parameters and approximately 37 billion activated parameters per token — a configuration that produces flagship-class output quality at a compute cost well below what a dense 671B model would require. The instruction-tuned V3 variant quickly became the community benchmark against which other open-weight instruction-following models were compared.

The R1 release followed in early 2025 and introduced a reasoning-focused fine-tuning approach using reinforcement learning from outcome feedback rather than the standard RLHF preference approach. R1's inference-time chain-of-thought mechanism — where the model produces an internal reasoning trace before the final answer — produced results on math, logic, and code benchmarks that placed it competitively against closed-weight models at the frontier. The R1 release also included a set of distillation variants (R1-Distill-7B through R1-Distill-70B) derived by training smaller Llama and Qwen base models on R1-generated reasoning traces.

For current-generation depth on V3 and R1, see the dedicated V3 page and R1 page. For the code-specialised branch, see DeepSeek Coder.

DeepSeek models: family branch, release names, and parameter classes
Family branch	Release names	Parameter classes
General-purpose V series	DeepSeek-V1, V2, V3 (Base and Instruct variants)	7B, 16B, 67B (V1/V2); 7B, 32B, 671B (V3 MoE)
Reasoning R series	DeepSeek-R1, R1-Zero, R1-Distill variants	7B, 8B, 14B, 32B, 70B (distills); 671B (flagship R1)
Code-specialised Coder series	DeepSeek-Coder V1, Coder V2, Coder-Instruct	1.3B, 6.7B, 33B (V1); 16B, 236B (V2 MoE)
MoE architecture research	DeepSeek-MoE, DeepSeekMoE-16B	16B total (2.8B activated per token); research-scale
Specialised and multimodal	DeepSeek-VL, DeepSeek-Math, DeepSeek-Prover	Varies by task; typically 7B–67B range

The specialised and multimodal variants deserve brief mention: DeepSeek-VL is a vision-language model that extends the V series to image understanding; DeepSeek-Math is a mathematics-specialised variant fine-tuned on a mathematical corpus; DeepSeek-Prover applies the reasoning approach to formal theorem proving. These are narrower in intended use than the main branches but represent the breadth of what the lab has published publicly. See the documentation index for the full navigation structure of this reference site, and the multi-family comparison page for how the DeepSeek catalog places relative to Llama, Mistral, Qwen, and Gemma.

Frequently asked questions about DeepSeek models

Five common questions from readers navigating the full DeepSeek model catalog.

How many DeepSeek models have been released?

The DeepSeek family has produced dozens of public checkpoint releases across the V, R, Coder, MoE, and specialised branches. Counting base and instruct variants separately, plus multiple parameter sizes per generation, the total runs into the tens of distinct downloadable checkpoints. The Hugging Face organisation page for deepseek-ai is the most current enumeration; new variants are added with each generation release.

What does the version numbering mean in DeepSeek model names?

The major version number (V1, V2, V3) marks the generation of the general-purpose chat branch. The R prefix marks the reasoning branch derived from that generation. A trailing parameter count (7B, 32B, 671B) indicates approximate total parameters. The suffix -Instruct or -Chat indicates an instruction-tuned or chat-tuned variant; no suffix indicates the base pre-trained checkpoint. -Distill indicates a smaller model trained on outputs from a larger teacher.

What is the difference between DeepSeek-V3 and DeepSeek-R1?

DeepSeek V3 is the third-generation general-purpose chat model — a 671B total parameter MoE architecture with broad instruction-following and multilingual capability. DeepSeek R1 is the reasoning-specialised variant derived from that generation, fine-tuned with reinforcement learning to produce inference-time chain-of-thought. R1 is slower per response but consistently stronger on math, code, and complex multi-step reasoning. For everyday chat tasks, V3 is the right pick.

Where can I find older DeepSeek model versions?

All publicly released DeepSeek model checkpoints remain available on the Hugging Face deepseek-ai organisation page. Older generations are not removed when newer ones are published — V1 and V2 checkpoints are still accessible alongside V3 and R1. The GitHub organisation retains release tags for older inference code versions as well. The download reference page covers how to pull any generation's weights.

What are the DeepSeek MoE models?

DeepSeek-MoE is the research line exploring mixture-of-experts architecture, with the DeepSeekMoE-16B paper being the most cited published work from that branch. The key insight is that routing each token to a subset of expert sub-networks allows a large total parameter count at lower compute cost per token than a dense model of the same size. The V3 flagship is itself an MoE architecture — the research line provided the foundational work that enabled the production V3 release at its scale.