How to run Llama 4, DeepSeek, and Mistral on your own hardware — no API keys, no data leaving your machine, full model control.

You will know how to run frontier AI models on your own hardware — with setup commands, hardware requirements, and use cases where local beats cloud.
TL;DR: Ollama makes running AI models locally as simple as docker pull. Install in one command, pull Llama 4 Scout or DeepSeek V3, and run inference on your own GPU — no API keys, no data leaving your machine, no per-token costs. The tradeoff: you need decent hardware (16GB+ RAM for small models, 64GB+ for large ones) and quality lags behind hosted frontier models on complex reasoning. But for sensitive data, air-gapped environments, and unlimited inference at fixed cost, local is the right architecture.
I've been running local models since the first usable Llama 2 release. Back then, the setup process involved compiling llama.cpp from source, fighting CUDA drivers, and accepting that you'd spend two hours before seeing a single token generate. The results were mostly disappointing — slow, inaccurate, and not worth the friction unless you had very specific reasons.
That's no longer the situation.
Ollama changed local AI in the same way Docker changed local infrastructure. One binary, one command, and you have a model running. The models themselves — Llama 4 Scout, DeepSeek V3, Mistral, Codestral — are now genuinely useful. Not "impressive for local" useful. Actually useful.
This guide covers the full setup, hardware reality, and the honest analysis of when local beats cloud and when it doesn't.
Ollama is a local model runtime — a tool that downloads, manages, and serves AI models entirely on your hardware. It wraps the complexity of model formats, quantization, and inference engines (primarily llama.cpp and its derivatives) behind a clean CLI and an OpenAI-compatible API.
The key thing to understand: Ollama isn't a model. It's the runtime that runs models. Think of it like Docker — you install Docker once, then pull and run any container image. Ollama installs once, then you pull and run any supported model.
What this gives you:
macOS and Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download the installer from ollama.com. It installs as a background service with a system tray icon.
That's the full installation. Ollama starts automatically and listens on http://localhost:11434.
Verify it's running:
ollama --version
# ollama version 0.6.x
The model pull syntax mirrors Docker:
# Llama 4 Scout — Meta's efficient 17B MoE model
ollama pull llama4-scout
# DeepSeek V3 — strong coding + reasoning, MIT license
ollama pull deepseek-v3
# Mistral 7B — fast, lean, good for most tasks
ollama pull mistral
# Codestral — Mistral's dedicated code model
ollama pull codestral
# Gemma 3 27B — Google's capable mid-size model
ollama pull gemma3:27b
# Phi-4 Mini — Microsoft's tiny but capable 3.8B model
ollama pull phi4-mini
Run a model interactively:
ollama run llama4-scout
Or send a one-shot prompt:
ollama run mistral "Explain async/await in JavaScript in 3 sentences"
The OpenAI-compatible API endpoint:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama4-scout",
"messages": [{"role": "user", "content": "Hello"}]
}'
That endpoint means any tool built for OpenAI's API — including many agent frameworks — can point at Ollama with a one-line URL change.
This is the part most guides bury or get wrong. Local AI performance is almost entirely a function of RAM (system or VRAM). The number that matters: the model must fit in memory to run at useful speeds. If it doesn't fit, it offloads to system RAM and CPU, which drops you from 30+ tokens/second to 3-5 tokens/second — technically functional, practically painful.
| Model Size | RAM Required | Recommended GPU | Tokens/sec (GPU) | Tokens/sec (CPU) |
|---|---|---|---|---|
| 3B-4B params | 4-6GB | Any modern GPU | 60-120 t/s | 10-20 t/s |
| 7B-8B params | 8-10GB | 8GB VRAM GPU | 30-60 t/s | 5-12 t/s |
| 13B-14B params | 16GB | 16GB VRAM GPU | 20-40 t/s | 3-7 t/s |
| 27B-34B params | 32-40GB | 24GB+ VRAM GPU | 15-25 t/s | 2-4 t/s |
| 70B params | 48-64GB | 2x 24GB GPU | 8-15 t/s | 1-2 t/s |
| 405B+ params | 256GB+ | Multi-GPU setup | 4-8 t/s | Not viable |
Practical minimum for useful work: 16GB system RAM + any discrete GPU with 8GB+ VRAM. This runs 7B-8B models (Mistral, Llama 4 Scout's efficient config, Phi-4 Mini) at comfortable speeds.
Sweet spot for most developers: 32-64GB system RAM + NVIDIA RTX 4090 (24GB VRAM) or AMD RX 7900 XTX (24GB VRAM). Runs 27B-34B models fully on GPU.
Mac note: Apple Silicon uses unified memory — the M3 Max with 96GB runs 70B models well. M2/M3 with 32GB runs 27B-34B models comfortably. Metal acceleration works natively in Ollama.
Storage: Models range from 4GB (quantized 7B) to 400GB+ (70B in high quality). Budget 500GB+ SSD space if you plan to experiment with multiple models.
I run Ollama on three different setups. Here's what real-world inference looks like:
Budget setup (16GB RAM, GTX 1080 8GB VRAM):
Mid-range setup (64GB RAM, RTX 3090 24GB VRAM):
High-end setup (128GB RAM, 2x RTX 4090):
This is where I've found local models genuinely superior to cloud alternatives:
Sensitive code and proprietary logic. When you're working with code that contains business logic, API keys in context, or unreleased product details — running it through a cloud API means that data touches someone else's infrastructure. Locally, the code stays on your machine. For any client work with NDA requirements, this isn't optional.
Air-gapped and offline environments. Factory floors, secure government networks, planes, servers without outbound internet. Once the model is pulled, Ollama runs completely offline. No connectivity dependency.
Unlimited inference workflows. Batch processing 50,000 documents. Running an evaluation loop 10,000 times. Testing a prompt variation 500 ways. Cloud APIs charge per token and rate-limit aggressively. Locally, you run until you're done. I use this pattern constantly when building and testing agent workflows in my ACOS setup.
Fine-tuning and model experimentation. You can't fine-tune Claude or GPT-4o on your own data via API. Locally, you can fine-tune Llama, Mistral, or Gemma on your specific dataset and run the result through Ollama. This is the path to truly customized models.
Cost-sensitive high-volume applications. At scale, per-token API costs compound fast. A setup processing 100M tokens/day at $5/1M tokens is $500/day — $182,500/year. One-time GPU hardware at $5,000-10,000 amortizes in weeks for that volume.
Local models have real limitations. I'm not pretending otherwise.
Complex multi-step reasoning. On tasks requiring extended chains of inference — difficult math, multi-step logical deduction, PhD-level analysis — Claude Opus, GPT-4o, and Gemini Ultra maintain a meaningful lead over local models in the 7B-70B range. The frontier models at hosted providers are still ahead on pure reasoning depth. See my full frontier model comparison for the benchmarks.
Multimodal tasks. Vision understanding, image generation, audio — local multimodal models exist but lag significantly behind hosted options like GPT-4o Vision or Gemini 1.5 Pro on complex visual reasoning.
Tool use and function calling. Hosted models have had more optimization for reliable, structured function calling. Local models can do it but require more careful prompting and produce less consistent JSON outputs.
Latest frontier capabilities. When Anthropic ships Claude Opus 5 or OpenAI ships o4, you get access immediately via API. With local models, you wait for the open-weight release cycle — which can be months behind or never for some capabilities.
One of the most practical local AI use cases for developers: using Ollama as a backend for AI coding tools. Tools like Cline, Roo Code, and similar extensions support configuring custom OpenAI-compatible endpoints.
In VS Code with Cline:
http://localhost:11434/v1ollama (any string works)codestral or deepseek-v3Now your coding agent runs on local inference. Every file you share, every suggestion, every refactor — processed on your hardware. For enterprise developers working with proprietary codebases, this changes the risk profile entirely.
Codestral is my default for local coding work. At 22B parameters, it fits on a 24GB GPU and produces code quality that outperforms many much larger general-purpose models on focused coding tasks.
For the broader picture on AI models and model selection, the DeepSeek R1 open-weight analysis covers why some open models now genuinely compete on specific reasoning tasks.
The math depends heavily on your usage volume. Here's a rough framework:
Hardware costs (one-time):
API costs (monthly, for comparison):
Breakeven for RTX 4090 ($2,000 hardware):
The honest conclusion: If you're using AI models for personal learning, occasional coding, or light experimentation — cloud APIs are cheaper and better quality. If you're running production workloads with sensitive data, high volume, or offline requirements — local infrastructure pays for itself fast.
I run both. Cloud for the tasks that demand frontier quality (complex research, long-context reasoning, nuanced writing). Local for privacy-sensitive work, batch processing, and the experiments I want to run without watching a cost meter.
The generative AI research hub has more on evaluating model trade-offs across tasks.
For developers getting started:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a fast, capable small model
ollama pull phi4-mini
# Pull a strong code model
ollama pull codestral
# Test it
ollama run phi4-mini "Write a Python function that validates email addresses"
Start with phi4-mini (3.8B) or mistral (7B) to verify your setup works before pulling larger models. They run fast on modest hardware and give you immediate feedback.
For production-grade local inference:
Consider ollama serve with the environment variable OLLAMA_HOST=0.0.0.0 to expose the API on your local network. Combine with Nginx for SSL termination and basic auth if exposing to a team.
For memory management across multiple models, Ollama keeps loaded models in memory for 5 minutes of inactivity by default. Set OLLAMA_KEEP_ALIVE=0 to unload immediately after each request if you're memory-constrained.
Does Ollama work on Windows?
Yes. Ollama has native Windows support via an installer. GPU acceleration works on NVIDIA cards with CUDA support. AMD GPU support on Windows is available but less mature than Linux.
Can I run multiple models simultaneously?
Ollama can load multiple models, but they compete for the same VRAM. In practice, you typically run one model at a time unless you have significant GPU memory (48GB+). The CLI switches between models seamlessly — just run ollama run <different-model> and it unloads the current one.
How does quantization affect quality?
Ollama defaults to Q4_K_M quantization for most models — a 4-bit format that reduces the original 16-bit model by ~75% in size with minimal quality degradation on most tasks. You can pull higher-quality quantizations (Q8, fp16) if you have the VRAM: ollama pull llama4-scout:q8_0.
Is the Ollama API fully OpenAI-compatible?
For chat completions and embeddings: largely yes. For function calling, streaming, and tool use: it depends on the model. Smaller models often produce malformed JSON for tool calls. Codestral and DeepSeek V3 are the most reliable local options for structured output tasks.
What happens to my data when I use Ollama?
Nothing leaves your machine. Ollama is a local process with no telemetry beyond optional, opt-in crash reports. The model weights are stored locally. Your prompts and responses exist only in memory and, optionally, your application's logs.
Local AI has crossed the threshold from "technically interesting" to "production-viable for specific use cases." Ollama is the reason that's accessible without a PhD in ML systems. Install it, pull a model, and see what fits your workflow. The privacy properties alone make it worth having in your toolkit — even if you keep cloud APIs for the tasks that demand their quality ceiling.
The right answer, for most developers, is both. Know where each belongs.
Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.
Start buildingDownload AI architecture templates, multi-agent blueprints, and prompt engineering patterns.
Browse templatesConnect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.
Join the circleRead on FrankX.AI — AI Architecture, Music & Creator Intelligence
Weekly field notes on AI systems, production patterns, and builder strategy.