Home › Models › Benchmarks

DeepSeek benchmarks: how the family scores publicly

Q: What benchmarks does DeepSeek publish scores for?

DeepSeek publishes benchmark results across a range of standard evaluations with each major model release. Regularly reported benchmarks include MMLU (general knowledge and reasoning), HumanEval and MBPP (code generation), GSM8K and MATH (mathematical reasoning), BBH (complex multi-step tasks), and CLUEWSC/C-Eval (Chinese-language benchmarks). Reasoning-tuned R1 releases additionally report results on competition-mathematics and formal reasoning datasets.

A reference on the major public benchmarks used to evaluate the DeepSeek model family — what each measures, how DeepSeek typically scores, what the LMSYS leaderboard pattern shows, and why benchmark numbers should be read with a margin of scepticism.

Compact Overview

DeepSeek V3 scores in the high-80s to low-90s on MMLU, posts top-tier open-weight results on HumanEval and GSM8K, and places competitively on the LMSYS Chatbot Arena. DeepSeek R1 leads the family on reasoning-specific benchmarks including MATH and competition-mathematics datasets. All benchmark scores should be treated as directional signals, not procurement verdicts — they age fast and may not reflect your specific workload.

Why public benchmarks matter — and where they fall short

Public benchmarks give a standardised starting point for model comparison but cannot substitute for task-specific evaluation. They measure proxies for capability, not capability itself.

When the DeepSeek team publishes a new model, the first thing most practitioners look at is the benchmark table. MMLU scores, HumanEval pass rates, GSM8K accuracy — these numbers provide a fast first filter for whether a new release is worth evaluating in depth. They also enable comparisons across model families that would otherwise require running every model on every task yourself, which is impractical at the pace the field moves. Public benchmarks serve a real function, and ignoring them entirely is not the right response to their limitations.

The limitations are real nonetheless. Static benchmark test sets are fixed while training corpora grow; as training data increasingly overlaps with benchmark test items — even inadvertently — scores inflate in ways that do not reflect genuine capability improvements. Frontier models from multiple families have shown benchmark saturation effects on MMLU and similar tests, where scores cluster near the ceiling and fine-grained differences become statistically meaningless. And benchmarks measure the tasks they contain, not the tasks your product actually runs — a model that scores 92 on MMLU may still fail consistently on the specific domain-adapted prompts your application sends. The NIST AI Risk Management Framework, available from NIST, provides a useful structure for thinking about AI evaluation beyond benchmark scores, including in regulated contexts.

MMLU: general knowledge and multi-domain reasoning

MMLU measures broad academic knowledge across 57 domains; DeepSeek V3 scores in the high-80s to low-90s, placing it in the frontier tier of open-weight models.

The Massive Multitask Language Understanding benchmark consists of multiple-choice questions drawn from 57 academic and professional domains: mathematics, law, medicine, history, computer science, and more. It is a good signal of whether a model has absorbed broad factual knowledge and can apply it in an exam-style context. At the frontier tier, MMLU scores are approaching saturation — the spread between the leading open-weight and closed commercial models is now small enough that a 1–2 point delta is unlikely to predict meaningful differences on real-world tasks.

DeepSeek V3 posts scores in the high-80s to low-90s on MMLU 5-shot, consistent with other frontier-class models released in the same period. DeepSeek R1, which applies inference-time chain-of-thought, shows gains on the reasoning-heavy MMLU subsets — science, mathematics, and professional law — where the structured thinking approach has more room to help than on pure recall questions. For most product teams, the practical interpretation of these MMLU scores is that DeepSeek V3 and R1 are competitive with any open-weight alternative on general-knowledge tasks and should not be pre-disqualified on MMLU grounds.

HumanEval, GSM8K, and MATH: code and mathematical reasoning

DeepSeek Coder leads the family on HumanEval; DeepSeek R1 leads on GSM8K and MATH. Both post results competitive with the best open-weight alternatives in their respective domains.

HumanEval is a 164-problem Python code-synthesis benchmark that measures pass@1 — the fraction of problems solved correctly on the first generation attempt. DeepSeek Coder consistently places at the top of the open-weight code model rankings on this benchmark, and DeepSeek V3 also scores well above the open-weight average despite being a general-purpose model. The HumanEval scores for Coder have been cited as evidence that specialised pretraining on a code corpus pays off: Coder outperforms general-purpose models of comparable or larger parameter counts on this task class.

GSM8K is an 8,500-problem grade-school mathematics benchmark that tests multi-step arithmetic and word-problem reasoning. DeepSeek R1's chain-of-thought approach is particularly well-suited to this task: generating intermediate reasoning steps lets the model verify arithmetic before committing to an answer, which reduces the simple calculation errors that plague single-pass generation on multi-step problems. R1 posts near-ceiling scores on GSM8K. MATH, a harder competition-mathematics dataset, is more discriminating: at that difficulty level, R1's reasoning traces are the primary mechanism keeping scores competitive with the best closed models.

Multilingual benchmarks and coverage

DeepSeek V3 performs well on Chinese-language benchmarks including C-Eval and CLUEWSC, and shows competitive results on major European language evaluations.

The DeepSeek training corpus is weighted toward English and Chinese, and this weighting is visible in the benchmark profile. On C-Eval — a comprehensive Chinese academic knowledge benchmark — and on CLUEWSC — a Winograd-schema-style coreference benchmark in Chinese — DeepSeek V3 consistently scores near the top of the open-weight field. This reflects both the training data weighting and the fact that fewer open-weight models have been as deliberately optimised for Chinese-language performance as DeepSeek has.

For European languages — French, German, Spanish, Italian — performance is strong on translation and instruction-following tasks but less well-documented on dedicated native-language academic benchmarks. Teams building multilingual products for these languages should run language-specific evaluations on their own task distribution rather than extrapolating from English or Chinese benchmark numbers. The family's multilingual coverage is a genuine strength relative to many open-weight alternatives, but it is not uniformly excellent across all languages in the training distribution.

Reasoning benchmarks specific to DeepSeek R1

R1's chain-of-thought approach enables scores on competition-level math and formal reasoning datasets that single-pass models of comparable size cannot match.

Beyond GSM8K and MATH, DeepSeek R1 has published results on harder reasoning datasets including AMC, AIME (American Mathematics Competitions and Invitational Examination), and competition programming benchmarks. These tasks are specifically designed to require multi-step reasoning where simple pattern completion fails. R1's thinking traces allow it to attempt systematic approaches to these problems rather than guessing based on surface features — which is why R1's scores on hard competition datasets separate it from V3 more sharply than MMLU scores do.

It is worth noting that hard reasoning benchmark scores are particularly susceptible to prompt sensitivity. The same model can produce substantially different scores depending on whether the system prompt encourages or discourages explicit reasoning, whether examples in few-shot prompts are problem-relevant, and whether the temperature and sampling settings are aligned with the benchmark's reference configuration. When comparing R1's published reasoning scores against other models, verifying that the evaluation settings are comparable is important — the model card's methodology section is the right place to check.

The LMSYS leaderboard pattern

The LMSYS Chatbot Arena measures human preference rather than accuracy, making it a useful complement to automated benchmarks — and one where DeepSeek models have established competitive positions.

The LMSYS Chatbot Arena is a crowdsourced evaluation platform where real users rate anonymous model pairs on open-ended prompts. The resulting Elo-style ranking captures aggregate human preference across a wide distribution of query types rather than accuracy on a predetermined test set. This makes the Arena a useful sanity check on automated benchmarks: models that score well on standard evals but produce outputs that humans consistently disprefer will show a gap between their automated benchmark rank and their Arena rank.

DeepSeek V3 and R1 have both appeared in competitive Arena positions, with R1 showing particular strength on the hard reasoning prompts submitted by technically engaged users. The Arena's open-ended nature also surfaces failure modes that static benchmarks do not: verbosity, over-qualification, instruction drift across long turns, and formatting inconsistencies are all more visible in open-ended human evaluation than in multiple-choice accuracy tests. For teams trying to estimate what real users will experience with a model, the Arena leaderboard is a more relevant signal than MMLU.

Why benchmark scores age fast

Three factors cause benchmark scores to become less informative over time: saturation, contamination, and benchmark-task mismatch. Understanding each helps calibrate how much weight to place on published numbers.

Saturation happens when scores cluster near the ceiling and fine-grained differences become statistically meaningless. MMLU is substantially saturated at the frontier tier — the difference between a 90.5 and a 91.2 on MMLU does not predict any measurable difference on real-world tasks. Teams making procurement decisions based on small MMLU deltas are reading precision that the benchmark does not actually contain. Harder benchmarks — competition-mathematics, multi-step reasoning, SWE-bench — are less saturated and carry more signal per percentage point.

Contamination is the subtler problem. As the community publishes benchmark test sets, they become part of the publicly available text that future training runs ingest — intentionally in some cases, incidentally in most. A model trained on text that includes solved versions of benchmark problems will score better on those problems without being genuinely more capable. The DeepSeek technical reports discuss contamination detection approaches, but the problem is hard to eliminate entirely and should be factored into interpretation. Studies on evaluation methodology from Stanford CRFM provide useful academic context on this issue. Benchmark-task mismatch is the simplest factor: the tasks your product runs are not the tasks benchmarks measure. Running your own evaluation harness on a representative sample of your production prompt distribution remains the most reliable way to predict model quality for your specific application, regardless of how any benchmark table reads.

Public benchmarks used to evaluate the DeepSeek family, with typical score class and measurement scope
Benchmark	What it measures	DeepSeek typical score class
MMLU (5-shot)	Academic and professional knowledge across 57 domains (multiple choice)	High-80s to low-90s for V3; frontier-tier competitive
HumanEval (pass@1)	Python function synthesis from docstrings	Top-tier open-weight for Coder; strong for V3
GSM8K	Grade-school multi-step mathematics word problems	Near-ceiling for R1; strong for V3
MATH	Competition-level mathematics (AMC/AIME difficulty)	R1 leads; V3 competitive with large open-weight alternatives
C-Eval / CLUEWSC	Chinese academic knowledge and coreference (Chinese-language)	Top-tier among open-weight models with Chinese corpus emphasis
LMSYS Chatbot Arena	Human preference over open-ended prompts (Elo ranking)	V3 and R1 both in competitive Arena positions; R1 strong on hard reasoning prompts

How to read DeepSeek benchmark numbers responsibly

Benchmark tables published with model releases are produced by the model authors, which creates an incentive toward favourable presentation. This is not unique to DeepSeek — every major model release does the same — but it is worth keeping in mind. The DeepSeek technical reports are generally more transparent than average about evaluation methodology, failure modes, and known limitations, and they should be consulted alongside the headline numbers. For benchmark coverage from independent third parties, the Open LLM Leaderboard on Hugging Face, the LMSYS Chatbot Arena, and academic papers from independent evaluation groups provide useful secondary sources that have no stake in the outcome.

For the models covered on this reference site — V3, R1, and Coder — the benchmark story is consistent: these are frontier-class models whose scores sit in the top tier of open-weight alternatives across the major evaluation suites. The differences between DeepSeek and the nearest open-weight competitors are often small on saturated benchmarks and more significant on harder reasoning and code-specific evals. For teams making a build-versus-buy decision, the benchmark data strongly supports the conclusion that DeepSeek is a viable open-weight alternative to leading closed APIs for most workloads; for the remaining workloads, the right path is a targeted evaluation on your own task sample before committing.

Frequently asked questions about DeepSeek benchmarks

Four questions that cover what the major deepseek benchmarks measure, how to interpret the scores, and why they age quickly.

What benchmarks does DeepSeek publish scores for?

DeepSeek publishes results across a range of standard evaluations with each major model release, including MMLU (general knowledge and reasoning), HumanEval and MBPP (code generation), GSM8K and MATH (mathematical reasoning), BBH (complex multi-step tasks), and C-Eval and CLUEWSC (Chinese-language benchmarks). Reasoning-tuned R1 releases additionally report results on competition-mathematics datasets including AMC and AIME problems.

How does DeepSeek perform on MMLU?

DeepSeek V3 scores in the high-80s to low-90s on MMLU 5-shot, placing it in the same tier as leading open-weight models and competitive with several closed commercial APIs. MMLU scores at this range are difficult to interpret finely — the benchmark is approaching saturation among frontier models, and a 2-point delta at the top of the range rarely reflects a meaningful difference in practical task performance. R1 shows additional gains on the science and mathematics MMLU subsets where chain-of-thought helps.

Why do benchmark scores age quickly for DeepSeek and other LLMs?

Benchmark scores age for three reasons. First, new models release frequently and leaderboard positions shift within weeks. Second, benchmark contamination — where training data overlaps with test sets — gradually inflates scores on static benchmarks, making direct comparisons across time unreliable. Third, the tasks that matter to practitioners evolve faster than benchmark curricula are updated. A benchmark that was a useful proxy for real-world performance in 2023 may be largely saturated by 2025 and no longer discriminating between meaningful capability differences.

What is the LMSYS Chatbot Arena and how does DeepSeek appear on it?

The LMSYS Chatbot Arena is a human-preference leaderboard where anonymous model pairs are evaluated by real users on open-ended prompts. It captures qualitative user preference rather than accuracy on a fixed test set, making it a useful complement to automated benchmarks. DeepSeek V3 and R1 have both appeared in competitive Arena positions, with R1 showing particular strength on hard reasoning prompts submitted by technical users — consistent with its chain-of-thought advantage on structured-reasoning tasks.