Creator SystemsMarch 21, 202612 min read

Running Local AI Models with Ollama: The Privacy Guide

How to run Llama 4, DeepSeek, and Mistral on your own hardware — no API keys, no data leaving your machine, full model control.

FrankX

AI Architect & Creator Systems Builder

Share Share

Reading Goal

You will know how to run frontier AI models on your own hardware — with setup commands, hardware requirements, and use cases where local beats cloud.

TL;DR: Ollama makes running AI models locally as simple as docker pull. Install in one command, pull Llama 4 Scout or DeepSeek V3, and run inference on your own GPU — no API keys, no data leaving your machine, no per-token costs. The tradeoff: you need decent hardware (16GB+ RAM for small models, 64GB+ for large ones) and quality lags behind hosted frontier models on complex reasoning. But for sensitive data, air-gapped environments, and unlimited inference at fixed cost, local is the right architecture.

I've been running local models since the first usable Llama 2 release. Back then, the setup process involved compiling llama.cpp from source, fighting CUDA drivers, and accepting that you'd spend two hours before seeing a single token generate. The results were mostly disappointing — slow, inaccurate, and not worth the friction unless you had very specific reasons.

That's no longer the situation.

Ollama changed local AI in the same way Docker changed local infrastructure. One binary, one command, and you have a model running. The models themselves — Llama 4 Scout, DeepSeek V3, Mistral, Codestral — are now genuinely useful. Not "impressive for local" useful. Actually useful.

This guide covers the full setup, hardware reality, and the honest analysis of when local beats cloud and when it doesn't.

What Is Ollama?

Ollama is a local model runtime — a tool that downloads, manages, and serves AI models entirely on your hardware. It wraps the complexity of model formats, quantization, and inference engines (primarily llama.cpp and its derivatives) behind a clean CLI and an OpenAI-compatible API.

The key thing to understand: Ollama isn't a model. It's the runtime that runs models. Think of it like Docker — you install Docker once, then pull and run any container image. Ollama installs once, then you pull and run any supported model.

What this gives you:

Zero data egress: Every token stays on your machine. Your code, your documents, your prompts — none of it leaves.
No API costs: Pay for hardware once (or use what you have). Inference cost is electricity.
No rate limits: Run 1,000 inference calls in a loop at 2am. Nobody cares.
Model control: Pin exact model versions. No silent updates, no capability regressions, no deprecations.
Offline capability: Works in air-gapped environments, planes, servers with no outbound internet.

Installation: One Command

macOS and Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer from ollama.com. It installs as a background service with a system tray icon.

That's the full installation. Ollama starts automatically and listens on http://localhost:11434.

Verify it's running:

ollama --version
# ollama version 0.6.x

Pulling Models

The model pull syntax mirrors Docker:

# Llama 4 Scout — Meta's efficient 17B MoE model
ollama pull llama4-scout

# DeepSeek V3 — strong coding + reasoning, MIT license
ollama pull deepseek-v3

# Mistral 7B — fast, lean, good for most tasks
ollama pull mistral

# Codestral — Mistral's dedicated code model
ollama pull codestral

# Gemma 3 27B — Google's capable mid-size model
ollama pull gemma3:27b

# Phi-4 Mini — Microsoft's tiny but capable 3.8B model
ollama pull phi4-mini

Run a model interactively:

ollama run llama4-scout

Or send a one-shot prompt:

ollama run mistral "Explain async/await in JavaScript in 3 sentences"

The OpenAI-compatible API endpoint:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4-scout",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

That endpoint means any tool built for OpenAI's API — including many agent frameworks — can point at Ollama with a one-line URL change.

Hardware Requirements: The Honest Table

This is the part most guides bury or get wrong. Local AI performance is almost entirely a function of RAM (system or VRAM). The number that matters: the model must fit in memory to run at useful speeds. If it doesn't fit, it offloads to system RAM and CPU, which drops you from 30+ tokens/second to 3-5 tokens/second — technically functional, practically painful.

Model Size	RAM Required	Recommended GPU	Tokens/sec (GPU)	Tokens/sec (CPU)
3B-4B params	4-6GB	Any modern GPU	60-120 t/s	10-20 t/s
7B-8B params	8-10GB	8GB VRAM GPU	30-60 t/s	5-12 t/s
13B-14B params	16GB	16GB VRAM GPU	20-40 t/s	3-7 t/s
27B-34B params	32-40GB	24GB+ VRAM GPU	15-25 t/s	2-4 t/s
70B params	48-64GB	2x 24GB GPU	8-15 t/s	1-2 t/s
405B+ params	256GB+	Multi-GPU setup	4-8 t/s	Not viable

Practical minimum for useful work: 16GB system RAM + any discrete GPU with 8GB+ VRAM. This runs 7B-8B models (Mistral, Llama 4 Scout's efficient config, Phi-4 Mini) at comfortable speeds.

Sweet spot for most developers: 32-64GB system RAM + NVIDIA RTX 4090 (24GB VRAM) or AMD RX 7900 XTX (24GB VRAM). Runs 27B-34B models fully on GPU.

Mac note: Apple Silicon uses unified memory — the M3 Max with 96GB runs 70B models well. M2/M3 with 32GB runs 27B-34B models comfortably. Metal acceleration works natively in Ollama.

Storage: Models range from 4GB (quantized 7B) to 400GB+ (70B in high quality). Budget 500GB+ SSD space if you plan to experiment with multiple models.

Performance Expectations by Hardware Tier

I run Ollama on three different setups. Here's what real-world inference looks like:

Budget setup (16GB RAM, GTX 1080 8GB VRAM):

7B models: ~25-35 t/s. Workable for most tasks. Response latency is acceptable.
13B models: ~8-12 t/s if partially offloaded. Noticeable lag on long responses.
Verdict: Good for personal projects, not production APIs.

Mid-range setup (64GB RAM, RTX 3090 24GB VRAM):

7B models: ~80-100 t/s. Faster than you can read.
27B models: ~30-45 t/s. Genuinely comfortable for interactive use.
70B models: ~8-12 t/s with partial CPU offload. Usable but slow.
Verdict: Excellent for most developer workflows.

High-end setup (128GB RAM, 2x RTX 4090):

27B models: ~60+ t/s. Competitive with hosted API response times.
70B models: ~25-35 t/s on full GPU. Fast enough for production.
Verdict: Matches hosted APIs on speed for most model sizes.

When Local Beats Cloud

This is where I've found local models genuinely superior to cloud alternatives:

Sensitive code and proprietary logic. When you're working with code that contains business logic, API keys in context, or unreleased product details — running it through a cloud API means that data touches someone else's infrastructure. Locally, the code stays on your machine. For any client work with NDA requirements, this isn't optional.

Air-gapped and offline environments. Factory floors, secure government networks, planes, servers without outbound internet. Once the model is pulled, Ollama runs completely offline. No connectivity dependency.

Unlimited inference workflows. Batch processing 50,000 documents. Running an evaluation loop 10,000 times. Testing a prompt variation 500 ways. Cloud APIs charge per token and rate-limit aggressively. Locally, you run until you're done. I use this pattern constantly when building and testing agent workflows in my ACOS setup.

Fine-tuning and model experimentation. You can't fine-tune Claude or GPT-4o on your own data via API. Locally, you can fine-tune Llama, Mistral, or Gemma on your specific dataset and run the result through Ollama. This is the path to truly customized models.

Cost-sensitive high-volume applications. At scale, per-token API costs compound fast. A setup processing 100M tokens/day at $5/1M tokens is $500/day — $182,500/year. One-time GPU hardware at $5,000-10,000 amortizes in weeks for that volume.

When Cloud Wins

Local models have real limitations. I'm not pretending otherwise.

Complex multi-step reasoning. On tasks requiring extended chains of inference — difficult math, multi-step logical deduction, PhD-level analysis — Claude Opus, GPT-4o, and Gemini Ultra maintain a meaningful lead over local models in the 7B-70B range. The frontier models at hosted providers are still ahead on pure reasoning depth. See my full frontier model comparison for the benchmarks.

Multimodal tasks. Vision understanding, image generation, audio — local multimodal models exist but lag significantly behind hosted options like GPT-4o Vision or Gemini 1.5 Pro on complex visual reasoning.

Tool use and function calling. Hosted models have had more optimization for reliable, structured function calling. Local models can do it but require more careful prompting and produce less consistent JSON outputs.

Latest frontier capabilities. When Anthropic ships Claude Opus 5 or OpenAI ships o4, you get access immediately via API. With local models, you wait for the open-weight release cycle — which can be months behind or never for some capabilities.

Integration with Claude Code and Coding Agents

One of the most practical local AI use cases for developers: using Ollama as a backend for AI coding tools. Tools like Cline, Roo Code, and similar extensions support configuring custom OpenAI-compatible endpoints.

In VS Code with Cline:

Open Cline settings
Set API Provider to "OpenAI Compatible"
Set Base URL to http://localhost:11434/v1
Set API Key to ollama (any string works)
Set model to codestral or deepseek-v3

Now your coding agent runs on local inference. Every file you share, every suggestion, every refactor — processed on your hardware. For enterprise developers working with proprietary codebases, this changes the risk profile entirely.

Codestral is my default for local coding work. At 22B parameters, it fits on a 24GB GPU and produces code quality that outperforms many much larger general-purpose models on focused coding tasks.

For the broader picture on AI models and model selection, the DeepSeek R1 open-weight analysis covers why some open models now genuinely compete on specific reasoning tasks.

Cost Analysis: GPU vs API Breakeven

The math depends heavily on your usage volume. Here's a rough framework:

Hardware costs (one-time):

Entry level (RTX 4070 12GB): ~$600
Mid-range (RTX 4090 24GB): ~$2,000
High-end (2x RTX 4090): ~$4,000

API costs (monthly, for comparison):

Claude Sonnet: ~$3/1M input tokens, $15/1M output tokens
GPT-4o: ~$2.50/1M input, $10/1M output
Gemini 1.5 Pro: ~$3.50/1M input, $10.50/1M output

Breakeven for RTX 4090 ($2,000 hardware):

At 500M tokens/month → breakeven in ~2 months
At 100M tokens/month → breakeven in ~10 months
At 10M tokens/month → breakeven in ~8 years (cloud wins)

The honest conclusion: If you're using AI models for personal learning, occasional coding, or light experimentation — cloud APIs are cheaper and better quality. If you're running production workloads with sensitive data, high volume, or offline requirements — local infrastructure pays for itself fast.

I run both. Cloud for the tasks that demand frontier quality (complex research, long-context reasoning, nuanced writing). Local for privacy-sensitive work, batch processing, and the experiments I want to run without watching a cost meter.

The generative AI research hub has more on evaluating model trade-offs across tasks.

Practical Setup Recommendations

For developers getting started:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a fast, capable small model
ollama pull phi4-mini

# Pull a strong code model
ollama pull codestral

# Test it
ollama run phi4-mini "Write a Python function that validates email addresses"

Start with phi4-mini (3.8B) or mistral (7B) to verify your setup works before pulling larger models. They run fast on modest hardware and give you immediate feedback.

For production-grade local inference:

Consider ollama serve with the environment variable OLLAMA_HOST=0.0.0.0 to expose the API on your local network. Combine with Nginx for SSL termination and basic auth if exposing to a team.

For memory management across multiple models, Ollama keeps loaded models in memory for 5 minutes of inactivity by default. Set OLLAMA_KEEP_ALIVE=0 to unload immediately after each request if you're memory-constrained.

FAQ

Does Ollama work on Windows?

Yes. Ollama has native Windows support via an installer. GPU acceleration works on NVIDIA cards with CUDA support. AMD GPU support on Windows is available but less mature than Linux.

Can I run multiple models simultaneously?

Ollama can load multiple models, but they compete for the same VRAM. In practice, you typically run one model at a time unless you have significant GPU memory (48GB+). The CLI switches between models seamlessly — just run ollama run <different-model> and it unloads the current one.

How does quantization affect quality?

Ollama defaults to Q4_K_M quantization for most models — a 4-bit format that reduces the original 16-bit model by ~75% in size with minimal quality degradation on most tasks. You can pull higher-quality quantizations (Q8, fp16) if you have the VRAM: ollama pull llama4-scout:q8_0.

Is the Ollama API fully OpenAI-compatible?

For chat completions and embeddings: largely yes. For function calling, streaming, and tool use: it depends on the model. Smaller models often produce malformed JSON for tool calls. Codestral and DeepSeek V3 are the most reliable local options for structured output tasks.

What happens to my data when I use Ollama?

Nothing leaves your machine. Ollama is a local process with no telemetry beyond optional, opt-in crash reports. The model weights are stored locally. Your prompts and responses exist only in memory and, optionally, your application's logs.

Local AI has crossed the threshold from "technically interesting" to "production-viable for specific use cases." Ollama is the reason that's accessible without a PhD in ML systems. Install it, pull a model, and see what fits your workflow. The privacy properties alone make it worth having in your toolkit — even if you keep cloud APIs for the tasks that demand their quality ceiling.

The right answer, for most developers, is both. Know where each belongs.

Get Started

Build your first AI system

Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.

Start building

Templates & Blueprints

Production-ready architecture

Download AI architecture templates, multi-agent blueprints, and prompt engineering patterns.

Browse templates

Inner Circle

Join the builder community

Connect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.

Join the circle

Stay in the intelligence loop

Weekly field notes on AI systems, production patterns, and builder strategy.

Continue Reading

Intelligence Dispatches12 min read

Ollama vs LM Studio vs Jan 2026: The Best Way to Run AI Locally

A tested comparison of the three local-LLM runners in June 2026 — Ollama, LM Studio, and Jan — on ease of use, model library, GUI vs CLI, OpenAI-compatible API, hardware support, and privacy.

Read article

Creator SystemsMarch 21, 202612 min read

Running Local AI Models with Ollama: The Privacy Guide

How to run Llama 4, DeepSeek, and Mistral on your own hardware — no API keys, no data leaving your machine, full model control.

FrankX

AI Architect & Creator Systems Builder

Share Share

Reading Goal

You will know how to run frontier AI models on your own hardware — with setup commands, hardware requirements, and use cases where local beats cloud.

That's no longer the situation.

This guide covers the full setup, hardware reality, and the honest analysis of when local beats cloud and when it doesn't.

What Is Ollama?

What this gives you:

Zero data egress: Every token stays on your machine. Your code, your documents, your prompts — none of it leaves.
No API costs: Pay for hardware once (or use what you have). Inference cost is electricity.
No rate limits: Run 1,000 inference calls in a loop at 2am. Nobody cares.
Model control: Pin exact model versions. No silent updates, no capability regressions, no deprecations.
Offline capability: Works in air-gapped environments, planes, servers with no outbound internet.

Installation: One Command

macOS and Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer from ollama.com. It installs as a background service with a system tray icon.

That's the full installation. Ollama starts automatically and listens on http://localhost:11434.

Verify it's running:

ollama --version
# ollama version 0.6.x

Pulling Models

The model pull syntax mirrors Docker:

# Llama 4 Scout — Meta's efficient 17B MoE model
ollama pull llama4-scout

# DeepSeek V3 — strong coding + reasoning, MIT license
ollama pull deepseek-v3

# Mistral 7B — fast, lean, good for most tasks
ollama pull mistral

# Codestral — Mistral's dedicated code model
ollama pull codestral

# Gemma 3 27B — Google's capable mid-size model
ollama pull gemma3:27b

# Phi-4 Mini — Microsoft's tiny but capable 3.8B model
ollama pull phi4-mini

Run a model interactively:

ollama run llama4-scout

Or send a one-shot prompt:

ollama run mistral "Explain async/await in JavaScript in 3 sentences"

The OpenAI-compatible API endpoint:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4-scout",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

That endpoint means any tool built for OpenAI's API — including many agent frameworks — can point at Ollama with a one-line URL change.

Hardware Requirements: The Honest Table

Model Size	RAM Required	Recommended GPU	Tokens/sec (GPU)	Tokens/sec (CPU)
3B-4B params	4-6GB	Any modern GPU	60-120 t/s	10-20 t/s
7B-8B params	8-10GB	8GB VRAM GPU	30-60 t/s	5-12 t/s
13B-14B params	16GB	16GB VRAM GPU	20-40 t/s	3-7 t/s
27B-34B params	32-40GB	24GB+ VRAM GPU	15-25 t/s	2-4 t/s
70B params	48-64GB	2x 24GB GPU	8-15 t/s	1-2 t/s
405B+ params	256GB+	Multi-GPU setup	4-8 t/s	Not viable

Practical minimum for useful work: 16GB system RAM + any discrete GPU with 8GB+ VRAM. This runs 7B-8B models (Mistral, Llama 4 Scout's efficient config, Phi-4 Mini) at comfortable speeds.

Sweet spot for most developers: 32-64GB system RAM + NVIDIA RTX 4090 (24GB VRAM) or AMD RX 7900 XTX (24GB VRAM). Runs 27B-34B models fully on GPU.

Mac note: Apple Silicon uses unified memory — the M3 Max with 96GB runs 70B models well. M2/M3 with 32GB runs 27B-34B models comfortably. Metal acceleration works natively in Ollama.

Storage: Models range from 4GB (quantized 7B) to 400GB+ (70B in high quality). Budget 500GB+ SSD space if you plan to experiment with multiple models.

Performance Expectations by Hardware Tier

I run Ollama on three different setups. Here's what real-world inference looks like:

Budget setup (16GB RAM, GTX 1080 8GB VRAM):

7B models: ~25-35 t/s. Workable for most tasks. Response latency is acceptable.
13B models: ~8-12 t/s if partially offloaded. Noticeable lag on long responses.
Verdict: Good for personal projects, not production APIs.

Mid-range setup (64GB RAM, RTX 3090 24GB VRAM):

7B models: ~80-100 t/s. Faster than you can read.
27B models: ~30-45 t/s. Genuinely comfortable for interactive use.
70B models: ~8-12 t/s with partial CPU offload. Usable but slow.
Verdict: Excellent for most developer workflows.

High-end setup (128GB RAM, 2x RTX 4090):

27B models: ~60+ t/s. Competitive with hosted API response times.
70B models: ~25-35 t/s on full GPU. Fast enough for production.
Verdict: Matches hosted APIs on speed for most model sizes.

When Local Beats Cloud

This is where I've found local models genuinely superior to cloud alternatives:

When Cloud Wins

Local models have real limitations. I'm not pretending otherwise.

Integration with Claude Code and Coding Agents

In VS Code with Cline:

Open Cline settings
Set API Provider to "OpenAI Compatible"
Set Base URL to http://localhost:11434/v1
Set API Key to ollama (any string works)
Set model to codestral or deepseek-v3

Codestral is my default for local coding work. At 22B parameters, it fits on a 24GB GPU and produces code quality that outperforms many much larger general-purpose models on focused coding tasks.

For the broader picture on AI models and model selection, the DeepSeek R1 open-weight analysis covers why some open models now genuinely compete on specific reasoning tasks.

Cost Analysis: GPU vs API Breakeven

The math depends heavily on your usage volume. Here's a rough framework:

Hardware costs (one-time):

Entry level (RTX 4070 12GB): ~$600
Mid-range (RTX 4090 24GB): ~$2,000
High-end (2x RTX 4090): ~$4,000

API costs (monthly, for comparison):

Claude Sonnet: ~$3/1M input tokens, $15/1M output tokens
GPT-4o: ~$2.50/1M input, $10/1M output
Gemini 1.5 Pro: ~$3.50/1M input, $10.50/1M output

Breakeven for RTX 4090 ($2,000 hardware):

At 500M tokens/month → breakeven in ~2 months
At 100M tokens/month → breakeven in ~10 months
At 10M tokens/month → breakeven in ~8 years (cloud wins)

The generative AI research hub has more on evaluating model trade-offs across tasks.

Practical Setup Recommendations

For developers getting started:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a fast, capable small model
ollama pull phi4-mini

# Pull a strong code model
ollama pull codestral

# Test it
ollama run phi4-mini "Write a Python function that validates email addresses"

Start with phi4-mini (3.8B) or mistral (7B) to verify your setup works before pulling larger models. They run fast on modest hardware and give you immediate feedback.

For production-grade local inference:

Consider ollama serve with the environment variable OLLAMA_HOST=0.0.0.0 to expose the API on your local network. Combine with Nginx for SSL termination and basic auth if exposing to a team.

FAQ

Does Ollama work on Windows?

Yes. Ollama has native Windows support via an installer. GPU acceleration works on NVIDIA cards with CUDA support. AMD GPU support on Windows is available but less mature than Linux.

Can I run multiple models simultaneously?

How does quantization affect quality?

Is the Ollama API fully OpenAI-compatible?

What happens to my data when I use Ollama?

The right answer, for most developers, is both. Know where each belongs.

Get Started

Build your first AI system

Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.

Start building

Templates & Blueprints

Production-ready architecture

Download AI architecture templates, multi-agent blueprints, and prompt engineering patterns.

Browse templates

Inner Circle

Join the builder community

Connect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.

Join the circle

Stay in the intelligence loop

Weekly field notes on AI systems, production patterns, and builder strategy.

Continue Reading

Intelligence Dispatches12 min read

Ollama vs LM Studio vs Jan 2026: The Best Way to Run AI Locally

A tested comparison of the three local-LLM runners in June 2026 — Ollama, LM Studio, and Jan — on ease of use, model library, GUI vs CLI, OpenAI-compatible API, hardware support, and privacy.

Read article

Running Local AI Models with Ollama: The Privacy Guide

What Is Ollama?

Installation: One Command

Pulling Models

Hardware Requirements: The Honest Table

Performance Expectations by Hardware Tier

When Local Beats Cloud

When Cloud Wins

Integration with Claude Code and Coding Agents

Cost Analysis: GPU vs API Breakeven

Practical Setup Recommendations

FAQ

Build your first AI system

Production-ready architecture

Join the builder community

Tags

Stay in the intelligence loop

Continue Reading

Ollama vs LM Studio vs Jan 2026: The Best Way to Run AI Locally

Running Local AI Models with Ollama: The Privacy Guide

What Is Ollama?

Installation: One Command

Pulling Models

Hardware Requirements: The Honest Table

Performance Expectations by Hardware Tier

When Local Beats Cloud

When Cloud Wins

Integration with Claude Code and Coding Agents

Cost Analysis: GPU vs API Breakeven

Practical Setup Recommendations

FAQ

Build your first AI system

Production-ready architecture

Join the builder community

Tags

Stay in the intelligence loop

Continue Reading

Ollama vs LM Studio vs Jan 2026: The Best Way to Run AI Locally