What is the easiest way to run a local LLM?

Ollama is the easiest way to run a local LLM in 2026. Install it from ollama.com, then run `ollama run llama3.2` in your terminal. It handles model downloads, GPU detection, and serving automatically — no configuration required.

Can I run a local LLM without a GPU?

Yes. Models with 7B parameters or fewer run acceptably on CPU-only systems, though inference is slower (5–10 tokens/sec vs 50+ on GPU). For best results without a dedicated GPU, Apple Silicon Macs with unified memory are the top choice.

How much RAM do I need to run a local LLM?

8 GB RAM is the minimum for small 7B models. 16 GB is recommended for comfortable use with 7–14B models. 32+ GB opens up 30B models and above. For Apple Silicon, unified memory means your GPU and CPU share the same pool — a 36 GB M3 Pro can run 70B models at Q4 quantization.

Does running a local LLM require internet?

Only for the initial model download. Once downloaded, local LLMs run entirely offline — no internet connection required. This makes them ideal for private, air-gapped, and latency-sensitive environments.

How do I run a local LLM on Mac?

Install Ollama from ollama.com — it natively supports Apple Silicon (M1/M2/M3/M4) via Apple Metal GPU. Run `ollama run mistral:7b` for a lightweight model or `ollama run llama3.3:70b` on M3 Max / M4 Max with 48+ GB memory. No drivers or CUDA needed.

How to Run a Local LLM — Complete 2025 Guide

A complete step-by-step walkthrough for running large language models on your own hardware. Works on macOS, Linux, and Windows — no cloud account required.

Updated June 2026 · 15 min read

In this guide

Prerequisites & hardware requirements
Install Ollama (recommended)
Download and run your first model
Platform-specific setup (macOS / Linux / Windows)
GPU acceleration setup
Alternative tools: LM Studio, llama.cpp, Jan
Troubleshooting common issues

1. Prerequisites & Hardware Requirements

Local LLMs are memory-intensive. The rule of thumb is that you need roughly 1 GB of RAM per billion parameters at Q4 quantization (the most common compression level). Below are the practical tiers:

Hardware	RAM / VRAM	Best Model	Speed
Budget laptop	8–12 GB	Qwen3 8B (thinking mode)	~8 tok/s CPU
Mid-range laptop	16–24 GB	Phi-4-reasoning 14B	~15 tok/s
High-end GPU (RTX 3090)	24 GB VRAM	Gemma 4 31B Q4	~55 tok/s
Apple M3/M4 Max	48–128 GB	Gemma 4 31B (MTP)	~50 tok/s (MTP)
Workstation GPU (RTX 4090)	24 GB VRAM	Gemma 4 31B Q4	~70 tok/s
Dual-GPU / Multi-GPU	48–80+ GB	Qwen3 32B at FP16	~80+ tok/s

Note: If the model does not fit entirely in VRAM/unified memory, Ollama automatically offloads layers to CPU RAM. This works but is slower. A model split across GPU and CPU runs at roughly 50–80% of pure GPU speed.

2. Install Ollama

Ollama is the easiest way to run local LLMs. It handles model downloads, GGUF quantization, GPU offloading, and exposes a simple REST API. Think of it as Docker, but for AI models.

🍎macOS

Download the macOS installer from ollama.com or install via Homebrew. Natively supports Apple Silicon (M1/M2/M3) with Metal GPU acceleration.

# Option A: Homebrew
brew install ollama

# Option B: Direct install script
curl -fsSL https://ollama.com/install.sh | sh

macOS users: after install, Ollama runs as a background menu-bar app.

🐧Linux

One-line install script works on Ubuntu, Debian, Fedora, Arch, and most other distributions. Auto-detects NVIDIA and AMD GPUs.

curl -fsSL https://ollama.com/install.sh | sh

NVIDIA GPUs: requires CUDA drivers. AMD GPUs: requires ROCm 5.7+.

🪟Windows

Download the OllamaSetup.exe installer from ollama.com. WSL2 is not required — the native Windows binary supports NVIDIA and AMD GPUs directly.

# Download from: https://ollama.com/download
# Then run: OllamaSetup.exe

Windows users: restart your terminal after install to update PATH.

3. Download & Run Your First Model

Once Ollama is installed, running a model is a single command. Ollama automatically pulls the model if it isn't downloaded yet.

Verify Ollama is running

Check that the Ollama daemon is active before pulling any models.

ollama --version

Run a model (auto-downloads if needed)

The `ollama run` command downloads the model on first run, then starts an interactive chat session. For a fast first experience, start with Llama 3.2 3B or Mistral 7B.

# Fast & light (recommended first run)
ollama run llama3.2:3b

# Best quality on mid-range hardware
ollama run mistral:7b

Model downloads are cached at ~/.ollama/models — you only download once.

List downloaded models

See all models you've pulled locally.

ollama list

Use Ollama as an API (optional)

Ollama exposes a local REST API on port 11434, compatible with the OpenAI API format.

curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2:3b","prompt":"Hello!","stream":false}'

4. Platform-Specific Tips

🍎

macOS (Apple Silicon)

✓Unified memory architecture means GPU and CPU share RAM — a 64 GB M3 Max can run 70B models.
✓Metal GPU is used automatically. No CUDA or ROCm needed.
✓For best performance, close Chrome and other memory-hungry apps before running 30B+ models.
✓Monitor memory with: Activity Monitor → Memory tab.

🐧

Linux (NVIDIA GPU)

✓Install NVIDIA drivers 525+ and CUDA 12.x before installing Ollama.
✓Check VRAM available: nvidia-smi
✓Enable persistence mode for faster startup: sudo nvidia-smi -pm 1
✓Multiple GPUs are auto-detected and used for layer distribution.

🪟

Windows

✓Use Windows Terminal or PowerShell — the native CMD prompt may have encoding issues.
✓NVIDIA GPU: ensure you have the latest Game Ready or Studio drivers.
✓AMD GPU: requires Windows 11 and ROCm for Radeon driver.
✓WSL2 users: Ollama has a native WSL2 integration; NVIDIA drivers installed on Windows are visible inside WSL2.

5. GPU Acceleration

Ollama auto-detects GPUs on install. Here's what to know for each vendor:

NVIDIA (CUDA)

Auto-detected on Linux and Windows. Requires CUDA 12.x and driver 525+.

nvidia-smi # check VRAM & driver

AMD (ROCm)

Supported on Linux via ROCm 5.7+. Windows support via DirectML is experimental.

rocm-smi # check GPU status

Apple Silicon

Metal GPU used automatically. Ollama ships a Metal-optimized binary. No extra setup.

6. Alternative Tools

Ollama is the fastest path, but alternatives exist for different needs:

LM Studio

GUI desktop app with model browser, chat UI, and OpenAI-compatible server. Best for non-technical users.

Best GUI

Jan

Open-source desktop AI assistant built on llama.cpp. Offline-first, plugin system, multiple model hubs.

Best Desktop App

llama.cpp

The C++ runtime that powers most local LLM tools under the hood. Best performance tuning, CLI only.

Best Performance

GPT4All

Cross-platform GUI app focused on privacy with a curated model library and document chat.

Best for Beginners

vLLM

Production-grade Python inference server with PagedAttention. Best for serving models to multiple users.

Best for Production

See our full local LLM tools comparison for a detailed breakdown.

7. Troubleshooting

Problem: Out of memory error / model crashes

Cause: The model is too large to fit in available RAM/VRAM.

Fix: Use a smaller quantization or smaller model. Try a Q4_K_M quantization instead of FP16.

ollama run llama3.3:70b-instruct-q4_K_M

Problem: Extremely slow inference (1–2 tokens/sec)

Cause: Model is running entirely on CPU with no GPU offloading.

Fix: Check if Ollama is detecting your GPU. Verify driver installation and try OLLAMA_GPU_LAYERS env variable.

OLLAMA_GPU_LAYERS=32 ollama run mistral:7b

Problem: Ollama server not starting

Cause: Port 11434 already in use or Ollama service not running.

Fix: Check what's on port 11434 and restart Ollama.

lsof -i :11434
ollama serve  # start manually

Problem: Model gives repetitive or nonsense output

Cause: Context window overflow or bad temperature/top_p settings.

Fix: Start a new conversation to clear the context. Adjust temperature (lower = more deterministic).