How to Run a Local LLM — Complete 2025 Guide

A complete step-by-step walkthrough for running large language models on your own hardware. Works on macOS, Linux, and Windows — no cloud account required.

Updated June 2026 · 15 min read

1. Prerequisites & Hardware Requirements

Local LLMs are memory-intensive. The rule of thumb is that you need roughly 1 GB of RAM per billion parameters at Q4 quantization (the most common compression level). Below are the practical tiers:

HardwareRAM / VRAMBest ModelSpeed
Budget laptop8–12 GBQwen3 8B (thinking mode)~8 tok/s CPU
Mid-range laptop16–24 GBPhi-4-reasoning 14B~15 tok/s
High-end GPU (RTX 3090)24 GB VRAMGemma 4 31B Q4~55 tok/s
Apple M3/M4 Max48–128 GBGemma 4 31B (MTP)~50 tok/s (MTP)
Workstation GPU (RTX 4090)24 GB VRAMGemma 4 31B Q4~70 tok/s
Dual-GPU / Multi-GPU48–80+ GBQwen3 32B at FP16~80+ tok/s
Note: If the model does not fit entirely in VRAM/unified memory, Ollama automatically offloads layers to CPU RAM. This works but is slower. A model split across GPU and CPU runs at roughly 50–80% of pure GPU speed.

2. Install Ollama

Ollama is the easiest way to run local LLMs. It handles model downloads, GGUF quantization, GPU offloading, and exposes a simple REST API. Think of it as Docker, but for AI models.

🍎macOS

Download the macOS installer from ollama.com or install via Homebrew. Natively supports Apple Silicon (M1/M2/M3) with Metal GPU acceleration.

# Option A: Homebrew brew install ollama # Option B: Direct install script curl -fsSL https://ollama.com/install.sh | sh

macOS users: after install, Ollama runs as a background menu-bar app.

🐧Linux

One-line install script works on Ubuntu, Debian, Fedora, Arch, and most other distributions. Auto-detects NVIDIA and AMD GPUs.

curl -fsSL https://ollama.com/install.sh | sh

NVIDIA GPUs: requires CUDA drivers. AMD GPUs: requires ROCm 5.7+.

🪟Windows

Download the OllamaSetup.exe installer from ollama.com. WSL2 is not required — the native Windows binary supports NVIDIA and AMD GPUs directly.

# Download from: https://ollama.com/download # Then run: OllamaSetup.exe

Windows users: restart your terminal after install to update PATH.

3. Download & Run Your First Model

Once Ollama is installed, running a model is a single command. Ollama automatically pulls the model if it isn't downloaded yet.

1

Verify Ollama is running

Check that the Ollama daemon is active before pulling any models.

ollama --version
2

Run a model (auto-downloads if needed)

The `ollama run` command downloads the model on first run, then starts an interactive chat session. For a fast first experience, start with Llama 3.2 3B or Mistral 7B.

# Fast & light (recommended first run) ollama run llama3.2:3b # Best quality on mid-range hardware ollama run mistral:7b

Model downloads are cached at ~/.ollama/models — you only download once.

3

List downloaded models

See all models you've pulled locally.

ollama list
4

Use Ollama as an API (optional)

Ollama exposes a local REST API on port 11434, compatible with the OpenAI API format.

curl http://localhost:11434/api/generate \ -d '{"model":"llama3.2:3b","prompt":"Hello!","stream":false}'

4. Platform-Specific Tips

🍎

macOS (Apple Silicon)

  • Unified memory architecture means GPU and CPU share RAM — a 64 GB M3 Max can run 70B models.
  • Metal GPU is used automatically. No CUDA or ROCm needed.
  • For best performance, close Chrome and other memory-hungry apps before running 30B+ models.
  • Monitor memory with: Activity Monitor → Memory tab.
🐧

Linux (NVIDIA GPU)

  • Install NVIDIA drivers 525+ and CUDA 12.x before installing Ollama.
  • Check VRAM available: nvidia-smi
  • Enable persistence mode for faster startup: sudo nvidia-smi -pm 1
  • Multiple GPUs are auto-detected and used for layer distribution.
🪟

Windows

  • Use Windows Terminal or PowerShell — the native CMD prompt may have encoding issues.
  • NVIDIA GPU: ensure you have the latest Game Ready or Studio drivers.
  • AMD GPU: requires Windows 11 and ROCm for Radeon driver.
  • WSL2 users: Ollama has a native WSL2 integration; NVIDIA drivers installed on Windows are visible inside WSL2.

5. GPU Acceleration

Ollama auto-detects GPUs on install. Here's what to know for each vendor:

NVIDIA (CUDA)

Auto-detected on Linux and Windows. Requires CUDA 12.x and driver 525+.

nvidia-smi # check VRAM & driver

AMD (ROCm)

Supported on Linux via ROCm 5.7+. Windows support via DirectML is experimental.

rocm-smi # check GPU status

Apple Silicon

Metal GPU used automatically. Ollama ships a Metal-optimized binary. No extra setup.

6. Alternative Tools

Ollama is the fastest path, but alternatives exist for different needs:

LM Studio

GUI desktop app with model browser, chat UI, and OpenAI-compatible server. Best for non-technical users.

Best GUI

Jan

Open-source desktop AI assistant built on llama.cpp. Offline-first, plugin system, multiple model hubs.

Best Desktop App

llama.cpp

The C++ runtime that powers most local LLM tools under the hood. Best performance tuning, CLI only.

Best Performance

GPT4All

Cross-platform GUI app focused on privacy with a curated model library and document chat.

Best for Beginners

vLLM

Production-grade Python inference server with PagedAttention. Best for serving models to multiple users.

Best for Production

See our full local LLM tools comparison for a detailed breakdown.

7. Troubleshooting

Problem: Out of memory error / model crashes

Cause: The model is too large to fit in available RAM/VRAM.

Fix: Use a smaller quantization or smaller model. Try a Q4_K_M quantization instead of FP16.

ollama run llama3.3:70b-instruct-q4_K_M

Problem: Extremely slow inference (1–2 tokens/sec)

Cause: Model is running entirely on CPU with no GPU offloading.

Fix: Check if Ollama is detecting your GPU. Verify driver installation and try OLLAMA_GPU_LAYERS env variable.

OLLAMA_GPU_LAYERS=32 ollama run mistral:7b

Problem: Ollama server not starting

Cause: Port 11434 already in use or Ollama service not running.

Fix: Check what's on port 11434 and restart Ollama.

lsof -i :11434 ollama serve # start manually

Problem: Model gives repetitive or nonsense output

Cause: Context window overflow or bad temperature/top_p settings.

Fix: Start a new conversation to clear the context. Adjust temperature (lower = more deterministic).