How to Run a Local LLM — Complete 2025 Guide
A complete step-by-step walkthrough for running large language models on your own hardware. Works on macOS, Linux, and Windows — no cloud account required.
Updated June 2026 · 15 min read
In this guide
1. Prerequisites & Hardware Requirements
Local LLMs are memory-intensive. The rule of thumb is that you need roughly 1 GB of RAM per billion parameters at Q4 quantization (the most common compression level). Below are the practical tiers:
| Hardware | RAM / VRAM | Best Model | Speed |
|---|---|---|---|
| Budget laptop | 8–12 GB | Qwen3 8B (thinking mode) | ~8 tok/s CPU |
| Mid-range laptop | 16–24 GB | Phi-4-reasoning 14B | ~15 tok/s |
| High-end GPU (RTX 3090) | 24 GB VRAM | Gemma 4 31B Q4 | ~55 tok/s |
| Apple M3/M4 Max | 48–128 GB | Gemma 4 31B (MTP) | ~50 tok/s (MTP) |
| Workstation GPU (RTX 4090) | 24 GB VRAM | Gemma 4 31B Q4 | ~70 tok/s |
| Dual-GPU / Multi-GPU | 48–80+ GB | Qwen3 32B at FP16 | ~80+ tok/s |
2. Install Ollama
Ollama is the easiest way to run local LLMs. It handles model downloads, GGUF quantization, GPU offloading, and exposes a simple REST API. Think of it as Docker, but for AI models.
Download the macOS installer from ollama.com or install via Homebrew. Natively supports Apple Silicon (M1/M2/M3) with Metal GPU acceleration.
# Option A: Homebrew
brew install ollama
# Option B: Direct install script
curl -fsSL https://ollama.com/install.sh | shmacOS users: after install, Ollama runs as a background menu-bar app.
One-line install script works on Ubuntu, Debian, Fedora, Arch, and most other distributions. Auto-detects NVIDIA and AMD GPUs.
curl -fsSL https://ollama.com/install.sh | shNVIDIA GPUs: requires CUDA drivers. AMD GPUs: requires ROCm 5.7+.
Download the OllamaSetup.exe installer from ollama.com. WSL2 is not required — the native Windows binary supports NVIDIA and AMD GPUs directly.
# Download from: https://ollama.com/download
# Then run: OllamaSetup.exeWindows users: restart your terminal after install to update PATH.
3. Download & Run Your First Model
Once Ollama is installed, running a model is a single command. Ollama automatically pulls the model if it isn't downloaded yet.
Verify Ollama is running
Check that the Ollama daemon is active before pulling any models.
ollama --versionRun a model (auto-downloads if needed)
The `ollama run` command downloads the model on first run, then starts an interactive chat session. For a fast first experience, start with Llama 3.2 3B or Mistral 7B.
# Fast & light (recommended first run)
ollama run llama3.2:3b
# Best quality on mid-range hardware
ollama run mistral:7bModel downloads are cached at ~/.ollama/models — you only download once.
List downloaded models
See all models you've pulled locally.
ollama listUse Ollama as an API (optional)
Ollama exposes a local REST API on port 11434, compatible with the OpenAI API format.
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.2:3b","prompt":"Hello!","stream":false}'4. Platform-Specific Tips
macOS (Apple Silicon)
- ✓Unified memory architecture means GPU and CPU share RAM — a 64 GB M3 Max can run 70B models.
- ✓Metal GPU is used automatically. No CUDA or ROCm needed.
- ✓For best performance, close Chrome and other memory-hungry apps before running 30B+ models.
- ✓Monitor memory with: Activity Monitor → Memory tab.
Linux (NVIDIA GPU)
- ✓Install NVIDIA drivers 525+ and CUDA 12.x before installing Ollama.
- ✓Check VRAM available: nvidia-smi
- ✓Enable persistence mode for faster startup: sudo nvidia-smi -pm 1
- ✓Multiple GPUs are auto-detected and used for layer distribution.
Windows
- ✓Use Windows Terminal or PowerShell — the native CMD prompt may have encoding issues.
- ✓NVIDIA GPU: ensure you have the latest Game Ready or Studio drivers.
- ✓AMD GPU: requires Windows 11 and ROCm for Radeon driver.
- ✓WSL2 users: Ollama has a native WSL2 integration; NVIDIA drivers installed on Windows are visible inside WSL2.
5. GPU Acceleration
Ollama auto-detects GPUs on install. Here's what to know for each vendor:
NVIDIA (CUDA)
Auto-detected on Linux and Windows. Requires CUDA 12.x and driver 525+.
nvidia-smi # check VRAM & driverAMD (ROCm)
Supported on Linux via ROCm 5.7+. Windows support via DirectML is experimental.
rocm-smi # check GPU statusApple Silicon
Metal GPU used automatically. Ollama ships a Metal-optimized binary. No extra setup.
6. Alternative Tools
Ollama is the fastest path, but alternatives exist for different needs:
LM Studio
GUI desktop app with model browser, chat UI, and OpenAI-compatible server. Best for non-technical users.
Jan
Open-source desktop AI assistant built on llama.cpp. Offline-first, plugin system, multiple model hubs.
llama.cpp
The C++ runtime that powers most local LLM tools under the hood. Best performance tuning, CLI only.
GPT4All
Cross-platform GUI app focused on privacy with a curated model library and document chat.
vLLM
Production-grade Python inference server with PagedAttention. Best for serving models to multiple users.
See our full local LLM tools comparison for a detailed breakdown.
7. Troubleshooting
Problem: Out of memory error / model crashes
Cause: The model is too large to fit in available RAM/VRAM.
Fix: Use a smaller quantization or smaller model. Try a Q4_K_M quantization instead of FP16.
ollama run llama3.3:70b-instruct-q4_K_MProblem: Extremely slow inference (1–2 tokens/sec)
Cause: Model is running entirely on CPU with no GPU offloading.
Fix: Check if Ollama is detecting your GPU. Verify driver installation and try OLLAMA_GPU_LAYERS env variable.
OLLAMA_GPU_LAYERS=32 ollama run mistral:7bProblem: Ollama server not starting
Cause: Port 11434 already in use or Ollama service not running.
Fix: Check what's on port 11434 and restart Ollama.
lsof -i :11434
ollama serve # start manuallyProblem: Model gives repetitive or nonsense output
Cause: Context window overflow or bad temperature/top_p settings.
Fix: Start a new conversation to clear the context. Adjust temperature (lower = more deterministic).
Related Guides