Local Models
Run models locally without API keys using Ollama, llama.cpp, or vLLM.
Why Local Models?
- Privacy: Data never leaves your machine
- Cost: No per-token charges
- Offline: Works without internet
- Customization: Fine-tuned or custom models
Ollama
The easiest way to run local models.
Setup
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start the server (runs in background)
ollama serve
# Pull a model
ollama pull llama3
ollama pull mistral
ollama pull codellama
Config
models:
- type: ollama
args:
model_name: llama3
base_url: http://localhost:11434
Python
from insideLLMs.models import OllamaModel
model = OllamaModel(
model_name="llama3",
base_url="http://localhost:11434"
)
response = model.generate("Hello!")
Available Models
| Model | Size | Use Case |
|---|---|---|
llama3 | 8B | General purpose |
llama3:70b | 70B | High quality |
mistral | 7B | Fast, good quality |
codellama | 7B-34B | Code generation |
gemma | 2B-7B | Lightweight |
See full list: ollama list
GPU Acceleration
Ollama automatically uses GPU if available:
# Check GPU usage
ollama run llama3 --verbose
llama.cpp
CPU-optimised inference with GGUF models.
Setup
pip install llama-cpp-python
# For GPU support (optional)
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
Download Models
Get GGUF models from HuggingFace:
- TheBloke’s models
- Look for files ending in
.gguf
Config
models:
- type: llamacpp
args:
model_path: /path/to/model.gguf
n_ctx: 2048
n_gpu_layers: 0 # 0 for CPU, -1 for all GPU
Python
from insideLLMs.models import LlamaCppModel
model = LlamaCppModel(
model_path="/path/to/llama-2-7b.Q4_K_M.gguf",
n_ctx=2048
)
Quantization Levels
| Suffix | Bits | Size | Quality |
|---|---|---|---|
Q2_K | 2 | Smallest | Lowest |
Q4_K_M | 4 | Medium | Good |
Q5_K_M | 5 | Larger | Better |
Q8_0 | 8 | Large | Best |
vLLM
High-performance inference with PagedAttention.
Setup
pip install vllm
Requires GPU with CUDA support.
Config
models:
- type: vllm
args:
model_name: meta-llama/Llama-2-7b-chat-hf
tensor_parallel_size: 1
gpu_memory_utilization: 0.9
Python
from insideLLMs.models import VLLMModel
model = VLLMModel(
model_name="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=1
)
Multi-GPU
models:
- type: vllm
args:
model_name: meta-llama/Llama-2-70b-chat-hf
tensor_parallel_size: 4 # Use 4 GPUs
HuggingFace Transformers
Direct use of HuggingFace models.
Setup
pip install transformers torch accelerate
Config
models:
- type: huggingface
args:
model_name: meta-llama/Llama-2-7b-chat-hf
device: cuda # or cpu, mps
torch_dtype: float16
Python
from insideLLMs.models import HuggingFaceModel
model = HuggingFaceModel(
model_name="meta-llama/Llama-2-7b-chat-hf",
device="cuda"
)
Comparison
| Method | Setup | Speed | Memory | GPU Required |
|---|---|---|---|---|
| Ollama | Easy | Good | Medium | Optional |
| llama.cpp | Medium | Good | Low | Optional |
| vLLM | Complex | Best | High | Yes |
| HuggingFace | Medium | Medium | High | Recommended |
Memory Requirements
| Model Size | RAM (CPU) | VRAM (GPU) |
|---|---|---|
| 7B | 8GB | 6GB |
| 13B | 16GB | 10GB |
| 70B | 64GB | 40GB+ |
Quantized models use less memory (Q4 ≈ 50% of above).
Comparing Local to Hosted
models:
- type: ollama
args:
model_name: llama3
- type: openai
args:
model_name: gpt-4o-mini
probes:
- type: logic
dataset:
format: jsonl
path: data/test.jsonl
insidellms harness comparison.yaml
Troubleshooting
Ollama: “connection refused”
# Start the server
ollama serve
llama.cpp: “model too large”
Use a smaller quantization or set n_gpu_layers: -1 for GPU offloading.
vLLM: “CUDA out of memory”
args:
gpu_memory_utilization: 0.7 # Lower this
Slow performance
- Enable GPU if available
- Use quantized models
- Reduce context length (
n_ctx)
See Also
- Models Catalog - All model configurations
- Model Comparison Tutorial - Compare models