YOUR MACHINE: 16 GB RAM · 8 GB VRAM | — compatible models

          
          Open Source · Run Locally · No Cloud Required
        

Can I Run This Model?

51 open source AI models — pick your GPU, get instant compatibility scores and real tok/s estimates for every model.

Model Data

             Fetching…
          

01 All Models

Loading pipeline...

What models can your machine run?

Select your GPU below — or auto-detect for instant results.

GPU — type to search, auto-fills VRAM

RAM (GB)

VRAM (GB)

APPLY

CLEAR

0 GB VRAM = CPU-only inference

Tier:

Use:

Sort:

Quick picks:

02 By Hardware Tier

Entry Tier

8 GB RAM

7

models

No GPU required. These models run entirely on CPU. Expect 2–6 tokens/sec — slower than a fast reader, but functional for most tasks.

No GPU needed · CPU inference · 4–8 GB download

Mid Tier

16 GB RAM

12

models

6–8 GB VRAM GPU recommended. Most gaming laptops qualify. Expect 8–25 tokens/sec — fast enough for comfortable interactive use.

RTX 3060 / RX 6700 · 6–8 GB VRAM

High Tier

32 GB RAM

10

models

12–16 GB VRAM GPU. Unlocks 70B class models at Q4. Expect 15–40 tokens/sec — fast enough to feel instant.

RTX 4080 / M2 Max · 12–16 GB VRAM

Pro Tier

64+ GB RAM

6

models

24+ GB VRAM workstation GPU or Apple M2 Max/Ultra. Full-precision 70B models and 405B quantized. Production-quality inference.

RTX 4090 / A100 / M2 Ultra · 24+ GB VRAM

03 By Use Case

Chat & Assistant

General-purpose conversational models. Good for Q&A, writing assistance, summarization, and brainstorming.

Llama 3 8B Mistral 7B Gemma 2 9B Phi-3 Mini

Code Generation

Trained specifically for code. Completion, explanation, debugging, and test generation across major languages.

CodeLlama 7B DeepSeek Coder Qwen 2.5 7B

Reasoning & Math

Chain-of-thought reasoning, multi-step problem solving, mathematical proofs, and logical analysis.

DeepSeek R1 Llama 3 70B Qwen 2.5 72B

Vision & Multimodal

Image understanding, OCR, chart reading, and visual Q&A. Requires models with a vision encoder.

LLaVA 1.6 7B LLaVA 1.6 34B

RAG & Documents

Models with long context windows for document Q&A, retrieval-augmented generation, and summarization.

Mistral 7B Llama 3 8B Gemma 2 27B

Creative Writing

Story generation, roleplay, character dialogue, poetry. Models fine-tuned for creative and expressive output.

Mistral 7B Llama 3 8B Solar 10.7B

04 Hardware Minimums

Min RAM / Min VRAM

The absolute floor — the model will load and respond, but may run at 1–4 t/s. Usable for testing, not comfortable for daily use.

Rec RAM / Rec VRAM

Recommended for comfortable use — typically 10–30 t/s. If you meet Rec but not Min, the model won't load at all.

Model	Params ⓘ	Min RAM ⓘ	Min VRAM ⓘ	Rec RAM ⓘ	Rec VRAM ⓘ	Context ⓘ	Tier

How these numbers are calculated

All sizes are for Q4_K_M quantization — the recommended balance of size and quality. ⚡ LIVE rows use data freshly fetched from HuggingFace. ⚠ ESTIMATED rows use architecture math — usually within 10–20% of real-world usage.

05 Quantization Guide

Quantization reduces model weight precision from 16-bit floats to fewer bits, shrinking file size and VRAM requirements. Lower bits = smaller + faster, with diminishing quality loss until you go below Q4.

Format	Bits/weight	Size (7B model)	Quality loss	Best for	VRAM needed (7B)
F16	16	~14 GB	None	Research, fine-tuning, max accuracy	16+ GB
Q8_0	8	~7.7 GB	Negligible	Best quality at half size	8+ GB
Q5_K_M	5	~5.1 GB	Very low	Good balance, still high quality	6+ GB
Q4_K_M ★	4	~4.1 GB	Low	Sweet spot — recommended default	4.5+ GB
Q3_K_M	3	~3.1 GB	Moderate	Very constrained hardware only	3.5+ GB
Q2_K	2	~2.2 GB	High	Avoid unless desperate for space	2.5+ GB

Which quantization should I use?

Entry tier (4–8 GB VRAM)

Use Q4_K_M. It's the default in Ollama for good reason — best quality-to-size ratio for constrained hardware.

Mid tier (8–12 GB VRAM)

Use Q5_K_M or Q8_0. You have headroom for better quality. Q8 is noticeably better than Q4 for complex reasoning.

High tier (16+ GB VRAM)

Use Q8_0 or F16 for 7B–13B models. Use Q4_K_M for 70B models to fit within VRAM.

CPU-only (no GPU)

Use Q4_K_M. Speed matters more than quality when running on CPU. Stick to ≤ 3B models for usable speed.

06 Runtimes

Ollama

Recommended

The easiest way to run local models. One command to install, one command to pull any model. Exposes a REST API compatible with OpenAI's SDK.

OS: Mac, Win, Linux

GPU: CUDA, Metal, ROCm

API: REST (OpenAI compat)

Ease: ★★★★★

curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama3

LM Studio

GUI App

Desktop GUI for discovering, downloading and running models. Best choice for non-technical users. Includes a built-in chat interface.

OS: Mac, Win, Linux

GPU: CUDA, Metal, Vulkan

API: OpenAI-compatible

Ease: ★★★★★

Download from lmstudio.ai
No CLI required

llama.cpp

Power Users

Pure C++ inference engine. Maximum performance and flexibility. Run as CLI or as a local HTTP server. Most quantization formats supported natively.

OS: Mac, Win, Linux

GPU: CUDA, Metal, ROCm

API: CLI / HTTP server

Ease: ★★★☆☆

brew install llama.cpp
llama-server -m model.gguf

Jan

Privacy-first

Open source ChatGPT alternative. Fully offline, no telemetry. Supports multiple model backends. Good for users who want a polished UI without any cloud connection.

OS: Mac, Win, Linux

GPU: CUDA, Metal

API: REST API

Ease: ★★★★☆

Download from jan.ai
100% offline, zero telemetry

07 FAQ

          
          Do I need a powerful GPU to run local AI models?
        

No. Many small models (1B–3B parameters) run entirely on CPU. A laptop with 8 GB RAM can run Phi-3 Mini or Llama 3.2 3B at 2–6 tokens per second — slow but functional. A GPU dramatically increases speed but is not required to start.

          
          What is the difference between Q4 and Q8 quantization?
        

Q4 uses 4 bits per model weight; Q8 uses 8 bits. Q8 is roughly 2× the file size of Q4 but has almost no quality loss compared to the original 16-bit weights. Q4 has a small but noticeable quality reduction on complex tasks. For most everyday use, Q4_K_M is the sweet spot — Q8 is worth it if you have the VRAM.

          
          How is "tokens per second" useful as a benchmark?
        

One token ≈ 0.75 words. Average human reading speed is ~250 words per minute ≈ 5 tokens/sec. So: below 5 t/s feels slow (you're waiting), 10–20 t/s feels comfortable (you can read it as it streams), 40+ t/s feels instant. Tokens/sec depends on your specific hardware and the quantization level used.

          
          Apple Silicon (M1/M2/M3/M4) — is it good for local AI?
        

Excellent. Apple Silicon uses unified memory shared between CPU and GPU, so a MacBook Pro M2 with 32 GB RAM effectively has 32 GB of "VRAM" for model inference. Ollama and llama.cpp both support Metal acceleration natively. An M2 Max with 32 GB can comfortably run 70B models at Q4. M4 chips are fully supported — select your exact chip in the GPU picker above for an accurate tok/s estimate.

          
          How do I actually install and run a model, step by step?
        

The easiest path: 1. Install Ollama (free, Mac/Win/Linux). 2. Open a terminal and run ollama run llama3.2 — it downloads and starts a chat in one command. 3. For a graphical interface, install LM Studio instead — browse and download models with a UI, no terminal needed. That's it. No Python, no configuration files.

          
          What happens if my GPU doesn't have enough VRAM?
        

Ollama and llama.cpp handle this automatically in one of two ways: (a) CPU offload — the model loads entirely into system RAM and runs on CPU. This works if you have enough total RAM, but speed drops to 1–5 t/s. (b) Partial GPU offload — layers that fit in VRAM run on the GPU, the rest on CPU. You get some speedup, proportional to how many layers fit. A model with an F grade on your hardware will refuse to load because even system RAM is insufficient.

          
          What is a context window and why does size matter?
        

The context window is the maximum amount of text a model can "see" at once — your messages plus the model's replies. One token ≈ ¾ of a word. Quick reference: 4K tokens ≈ a short conversation (3–4 pages). 32K ≈ a short story or a long document. 128K ≈ a full novel or a large codebase. If your conversation exceeds the context limit, the model forgets the earliest messages. Larger context uses more VRAM — a 128K context at runtime can use 2–4 GB more VRAM than a 4K context with the same model.

          
          What is the difference between a base model and an instruct/chat model?
        

A base model is trained on raw text to predict the next token — it will complete your input like autocomplete, not answer questions. An instruct or chat model is the same weights, fine-tuned with RLHF or DPO to follow instructions and hold a conversation. Almost all models you'd want to actually use are instruct models. On Hugging Face, look for tags like -Instruct, -Chat, or -IT in the model name. On Ollama, all models served by default are already the instruct variant.

          
          What does "parameter count" actually mean — why does 7B feel so different from 70B?
        

Parameters are the numerical weights that store everything the model learned during training. More parameters = more capacity to memorize facts, reason across steps, and handle nuance. The jump from 7B to 70B is not just 10× size — 70B models tend to have qualitatively better reasoning, less hallucination, and far stronger instruction-following. In practice: 7B models are great for quick Q&A, summarization, and simple coding. 70B models start to rival GPT-4-class performance on multi-step reasoning and complex tasks. The difference is felt more on hard prompts than easy ones.

          
          What is a Mixture of Experts (MoE) model — and why is Mixtral "47B" but only uses 12B at a time?
        

MoE models contain many "expert" sub-networks but activate only a few per token. Mixtral 8×7B has 8 expert networks of ~7B each, but a routing layer selects only 2 per forward pass — so inference cost matches a ~12B dense model, while the total stored weights are ~47B. The result: the quality of a much larger model at a fraction of the compute cost. The catch: you still need enough VRAM/RAM to load all the weights, even though only some are active at once. That's why Mixtral 8×7B still needs ~28 GB at Q4, despite inferring at "12B speed."

          
          Can I fine-tune a model on my own data at home?
        

Yes, for small-to-mid models. Full fine-tuning of a 7B model requires ~30–40 GB VRAM, but LoRA / QLoRA (parameter-efficient fine-tuning) can fine-tune a 7B model on a single 8 GB GPU. Tools like LLaMA-Factory and Unsloth make this straightforward. For 70B+ models you realistically need a multi-GPU server or cloud. Fine-tuning is best when you have a specific domain or format you need the model to learn — for general instruction-following, prompting a good base model is usually sufficient.

          
          Does AMD GPU work for local AI, or is NVIDIA required?
        

AMD works, but with caveats. Ollama and llama.cpp support AMD via ROCm (Linux) and Vulkan/DirectML (Windows). Linux + ROCm on RX 6800 XT and newer gives near-NVIDIA performance. Windows support is more experimental — Vulkan works but is slower than ROCm. The RX 7900 XTX (24 GB VRAM) is a popular choice for local AI on a budget compared to an RTX 4090. Intel Arc GPUs are supported via SYCL/oneAPI on Linux, though driver maturity lags behind both.

          
          Can I split a model across two GPUs to double my VRAM?
        

Yes. llama.cpp and Ollama both support tensor parallelism across multiple GPUs. Two RTX 3090s (24 GB each) gives you an effective ~48 GB VRAM pool, enough for a 70B model at Q4. Speeds scale roughly linearly if the GPUs are connected via NVLink; with PCIe-only they still work but the bandwidth between GPUs becomes the bottleneck. The setup is plug-and-play with Ollama — it auto-detects multiple GPUs and splits layers automatically.

          
          What is the difference between Ollama, LM Studio, llama.cpp, and Jan?
        

llama.cpp — the low-level inference engine everything else builds on. Command-line, very fast, maximum control. Ollama — wraps llama.cpp with a one-command CLI and a local REST API; no UI, great for developers and scripting. LM Studio — polished desktop GUI for browsing, downloading, and chatting with models; best for non-developers. Jan — open-source desktop app similar to LM Studio with a focus on privacy and extensibility. Open WebUI — browser-based chat UI that connects to an Ollama backend; closest to a self-hosted ChatGPT experience. All four run the same GGUF model files.

          
          Can local models handle images and vision tasks?
        

Yes — multimodal (vision-language) models are fully supported locally. LLaVA, Llama 3.2 Vision, and Gemma 3 can describe images, answer questions about screenshots, read charts, and OCR text. Ollama supports vision models natively — just ollama run llama3.2-vision and pass an image file. The 7B vision models handle everyday tasks well; 34B+ models handle complex diagrams and document understanding. Note: vision models require slightly more VRAM than their text-only equivalents due to the image encoder.

          
          Can I call a local model from my own app or script via API?
        

Yes, and it's straightforward. Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1. This means any code that works with the OpenAI SDK works with a local model — just change the base_url and set a dummy API key. LM Studio also exposes a compatible server on port 1234. You can switch between local and cloud models by changing one environment variable — no code changes needed. This makes local models ideal for development, cost reduction, and offline-capable applications.

          
          What are the best local models for RAG (document question-answering)?
        

RAG (Retrieval-Augmented Generation) works well with any chat model, but some properties matter more than others: a large context window (to fit retrieved chunks), strong instruction-following, and low hallucination rate. Top choices: Mistral Nemo 12B (128K context, balanced), Llama 3.1 8B (128K context, fast), Qwen2.5-7B (128K context, multilingual), Gemma 3 12B (128K). For embeddings, use a dedicated embedding model like nomic-embed-text or mxbai-embed-large via Ollama — don't use the chat model for embeddings.

          
          Is there any risk of data leaving my device, even with a local model?
        

With a correctly configured local model, none of your prompts or responses leave your device during inference. The only network activity is the initial one-time model download from Hugging Face or Ollama's registry. However, watch out for: (1) chat frontends that may have optional telemetry — check settings. (2) If you're using a model via a remote Ollama server (not localhost), traffic travels over your network. (3) Ollama itself does not send prompts anywhere, but does send anonymous version-check pings on startup — these can be disabled with OLLAMA_NOPRUNE=1 and firewall rules.

          
          How are the tok/s estimates on this page calculated?
        

When you select a GPU, the page uses a physics-based formula: tok/s = memory bandwidth (GB/s) ÷ model size on disk (GB) × efficiency factor. During autoregressive decoding, the bottleneck is how fast the GPU can stream model weights from VRAM — not compute. An RTX 4090 (1008 GB/s) running a 4-bit 7B model (~4 GB) gives roughly 1008 ÷ 4 × 0.70 ≈ 176 t/s. Efficiency is set to 0.70 for discrete GPUs and 0.65 for Apple unified memory to account for real-world overhead (KV cache, scheduling, quantization overhead). These are estimates — actual speeds vary by framework, quantization type, batch size, and context length.

          
          How do the S / A / B / C / F compatibility grades work?
        

Grades compare a model's memory requirements against your hardware specs. S — fits in VRAM with headroom; expect full GPU-accelerated speed. A — fits in VRAM but close to the limit; may slow down at long contexts. B — minimum VRAM met but below recommended; GPU inference works but at reduced speed, especially at long contexts. C — fits in RAM but barely; expect slow speeds and potential OOM on complex prompts. F — insufficient RAM to load the model at all; will crash or refuse to load. The recommended minimum RAM shown per model is the Q4_K_M quantization size plus ~2 GB of OS/runtime overhead.

Specs shown are sourced from the Hugging Face Hub, Ollama registry, official vendor datasheets, and architecture math — they may not always reflect the latest changes. VRAM figures marked ⚠ ESTIMATED are approximations and may vary from real-world usage. Company and model names are used for reference only; this site is not affiliated with or endorsed by any of them.