Where DeepSeek weights live
The primary distribution point for DeepSeek model weights is the DeepSeek organisation on Hugging Face; GitHub releases carry code and tooling, not the raw weight files themselves.
When people search for a DeepSeek download, they are usually looking for one of three things: the full-precision base weights for fine-tuning or research, the instruction-tuned chat variant for inference, or a quantised GGUF or AWQ version that fits on consumer hardware. Each of those lives in a different repository slot on Hugging Face, and the naming conventions distinguish them clearly once you know what to read.
The Hugging Face organisation page for DeepSeek lists every public release in reverse-chronological order. Each model has at least a base checkpoint repo and an instruct-tuned variant. Community contributors — typically established quantisers with long track records on the platform — maintain parallel repos containing GGUF files at multiple quantisation levels. These community repos are not maintained by the upstream lab but are widely used in the open-weight community because the full-precision files are too large for most individual hardware.
File naming conventions
DeepSeek weight filenames encode the model family, parameter class, variant type, and shard index in a consistent pattern that mirrors how other Hugging Face-hosted open-weight families name their files.
A typical base checkpoint filename reads something like model-00001-of-00030.safetensors — a zero-padded shard index followed by the total shard count. The safetensors format is the current default because it strips the Python pickle attack surface present in older .bin format checkpoints. For GGUF community quantisations, the filename typically includes the quantisation level: DeepSeek-V3-Q4_K_M.gguf identifies a V3 model file quantised with the Q4_K_M method, which is the standard 4-bit variant that balances size and quality for most use cases.
The parameter class usually appears either in the repository name or in the filename prefix. A 7B variant and a 67B variant will differ in the repo slug; within a single repo, different tensor-parallel sharding configurations will show up as different shard counts in the multi-part filenames. The config.json and tokenizer files in the root of each repo are the authoritative source on what the checkpoint actually contains.
Field Notes
The safest download path for most developers is the Hugging Face hub CLI with huggingface-cli download <repo-id>. It handles shard integrity automatically, retries failed chunks, and stores files in a local cache that subsequent tool loads can reuse without re-downloading. For large flagship models, set --local-dir to a drive with enough headroom before starting.
Integrity verification
Every Hugging Face repository includes a file-level checksum surface. The safetensors format embeds a header containing the tensor dtype and shape, which loaders like Transformers and vLLM validate on open. For a complete download-level check, compare the SHA-256 hash of each downloaded file against the value in the repository's .gitattributes or a published checksum manifest when one is provided.
On Linux or macOS the one-liner is sha256sum *.safetensors > local_checksums.txt followed by a diff against the upstream manifest. Windows users can use Get-FileHash in PowerShell. For GGUF files, llama.cpp's built-in model verification flag catches most loading-time corruption before inference starts.
The practical risk is not malicious tampering in most contexts — it is corrupt downloads from network interruption. Large shard files over slow connections are prone to truncation, which safetensors loaders will catch on the first load attempt. If a shard fails to load, delete just that shard file and re-run the hub CLI; it will re-download only the missing piece.
Getting started with self-hosted inference
Once the weight files are on disk, the most common next step is loading them into one of three runtimes. NIST's AI guidance is worth reviewing for teams deploying in regulated contexts. For individual developers, Ollama is the lowest-friction path: ollama pull deepseek-r1:7b handles the download and serves a local API on port 11434 with no additional configuration. llama.cpp offers the most hardware tuning knobs, including layer offloading to mix CPU and GPU memory. vLLM is the production-grade choice for higher-throughput multi-GPU deployments.
For projects using the DeepSeek GitHub repositories directly, the inference scripts in the repo expect the weight directory to follow the Hugging Face directory layout — config.json, tokenizer files, and sharded safetensors all in the same directory. The README in each inference repo notes any version-specific requirements.
See also the DeepSeek AI free access page for hosted alternatives that avoid the download step entirely, and the documentation index for the full reference structure of this site.
DeepSeek download: file pattern reference
| File pattern |
Content |
Typical use |
model-NNNNN-of-MMMMM.safetensors |
Full-precision base or instruct checkpoint shard |
Fine-tuning, research, multi-GPU inference via Transformers or vLLM |
DeepSeek-*-Q4_K_M.gguf |
4-bit quantised single-file model (GGUF) |
Consumer GPU or CPU inference via llama.cpp, Ollama, LM Studio |
config.json |
Model architecture config: hidden size, layers, attention heads |
Required by all loaders; also used by eval harnesses for metadata |
tokenizer.json / tokenizer_config.json |
Tokeniser vocabulary and special-token map |
Required for text encoding before inference and decoding after |
generation_config.json |
Default sampling parameters: temperature, top-p, repetition penalty |
Used by Transformers pipeline as inference defaults; override per request |
A note on disk space planning: the full-precision V3 flagship checkpoint runs to roughly 600 GB across its shards. The instruction-tuned instruct variant is the same size. Most developers working outside a data centre choose the Q4_K_M GGUF quantisation of the 7B or 32B variant, which fits in 5–20 GB depending on the parameter class. Storage cost is the first filter when deciding which build to pull.
Pascale O. Olabintan, Embedded Engineer at Goldfern Fabric Works in Tucson, AZ, notes: "The GGUF variants let us prototype on the same laptop we write firmware on. We pull the Q4_K_M build overnight once and reuse it for months of local inference without network dependency."