Core Tools and Ecosystem
The local AI ecosystem has matured quickly. A year ago, getting a model running locally required significant manual setup. Today, a handful of tools handle the hard parts well. Here's what's worth knowing.
Hugging Face
Hugging Face is the central hub of the open-weight AI ecosystem. Think of it as GitHub for models: it hosts tens of thousands of model weights, datasets, and model cards that document what each model does, how it was trained, and what it's suited for.
For local use, you'll primarily interact with Hugging Face in two ways. First, as a source for downloading model weights — most of the models you'll run locally (Llama, Mistral, Qwen, Gemma, etc.) are hosted here. Second, through the transformers library, which is the standard Python interface for loading and running models programmatically.
You don't need a Hugging Face account to download most models, but creating one is worth doing — some models require accepting a license agreement before download, and an account lets you use the huggingface_hub CLI for faster, resumable downloads.
Ollama
Ollama is the fastest way to get a model running locally. It's a single binary that manages model downloads, quantization formats, and provides a local REST API that's compatible with the OpenAI API spec.
ollama run llama3.2
That's genuinely all it takes to pull and run a model. Ollama handles quantization selection automatically, exposes a /v1/chat/completions endpoint locally, and manages model storage sensibly.
Where Ollama shines is simplicity and developer ergonomics. It's ideal for integrating local models into applications — anything that speaks the OpenAI API format works with Ollama with a single endpoint change. The trade-off is that it abstracts away some control; for fine-grained inference configuration, you'll eventually want something lower-level.
LM Studio
LM Studio is a GUI application for running models locally, built on top of llama.cpp. It's particularly useful if you want a conversational interface without writing any code, or if you want to evaluate multiple models quickly without CLI setup.
Features worth knowing about: it has a built-in model browser connected to Hugging Face, a local server mode that exposes an OpenAI-compatible API, and a reasonable interface for adjusting inference parameters (temperature, context length, system prompts) without editing config files.
LM Studio is a good choice for experimentation and for sharing a local AI setup with colleagues who aren't comfortable with the command line. For production-style local deployments or automated pipelines, you'll likely move to Ollama or llama.cpp directly.
llama.cpp
llama.cpp is the foundation much of this ecosystem is built on. It's a C++ inference engine for GGUF-format models, optimised for CPU and GPU inference across platforms. Ollama and LM Studio both use it under the hood.
You may not interact with llama.cpp directly unless you need fine-grained control — custom batching, specific quantization formats, or embedding generation for retrieval workflows. But understanding that it exists, and that the GGUF format is its native model format, helps make sense of why tools like Ollama work the way they do.
Putting them together
A typical local AI stack looks like this: models sourced from Hugging Face, served via Ollama for API-compatible use, with LM Studio available for quick experimentation. The next posts cover what you can build on top of this foundation.