exo: Run Frontier AI Models Across All Your Devices Locally
exo turns every device you own into a single AI cluster. Your MacBook, your Mac Studio, your Linux server, whatever you have. They discover each other automatically, split large models across their combined memory, and run inference as if they were one machine.
The headline result: four M3 Ultra Mac Studios with 512GB each can run DeepSeek v3.1 671B at 8-bit. Or Kimi K2 Thinking at native 4-bit. Or Qwen3-235B. Models that would not fit on any single device become runnable with hardware you probably already own.
How It Works
exo uses MLX as its inference backend and MLX distributed for communication between devices. Devices discover each other on the local network with zero config. Just run exo on each machine and they find each other.
The key pieces:
Topology-aware auto parallel. exo figures out the best way to split any model across your available devices. It accounts for each device’s resources and the network latency between them. You do not configure anything.
Tensor parallelism. Sharding model layers across devices gives measurable speedups. exo reports 1.8x on 2 devices and 3.2x on 4 devices.
RDMA over Thunderbolt 5. This is the differentiator. Remote Direct Memory Access over Thunderbolt 5 cuts latency between devices by 99%. That makes distributed inference actually fast instead of bottlenecked by network overhead. It requires macOS 26.2 and a brief trip to Recovery Mode to enable.
What You Can Run
The Dashboard view shows your cluster and loaded models. Jeff Geerling benchmarked a 4-node M3 Ultra setup running DeepSeek v3.1 671B and confirmed it works at usable speeds. The project’s benchmarks show solid throughput for 235B and 671B parameter models across multiple devices.
Smaller models run on a single MacBook without breaking a sweat. The cluster approach matters most for the models that are too big for any one machine.
API Compatibility
exo exposes OpenAI Chat Completions, Claude Messages, OpenAI Responses, and Ollama APIs. Whatever tool you use, it probably works against the local endpoint. That includes OpenWebUI, existing CLI scripts, and anything built for the OpenAI client libraries.
Installation
On macOS you can either run from source or use the native macOS app (requires macOS Tahoe 26.2). The source path needs Xcode, Homebrew, uv, Node, Rust nightly, and a pinned fork of macmon for Apple Silicon monitoring. There is also a Nix flake that handles most of this automatically.
On Linux, same source approach minus the macOS-specific monitoring tools. Note that GPU support on Linux is still in development. Currently runs on CPU.
The macOS app installs via DMG and runs in the menu bar. It handles cluster joining automatically and includes an uninstaller through the app menu.
Why This Matters
Running frontier models locally means no API costs, no data leaving your network, and no rate limits. The practical trade-off is that you need enough aggregate hardware. But if you already own multiple Apple Silicon devices, exo may let you run models you would otherwise have to rent access to.
The RDMA over Thunderbolt piece is genuinely new. Apple added RDMA support in macOS 26.2, and exo shipped day-0 support. That kind of timing is rare for an open-source project tracking OS-level features.
Links
- GitHub: Source code, issues, discussions
- Discord: Community and support
- exo labs on X: Project updates
- Jeff Geerling benchmark: 15 TB VRAM on Mac Studio cluster
Crepi il lupo! 🐺