Local LLM runtime for Apple Silicon. Rust + MLX. OpenAI-compatible. Targeting LM Studio parity.
A local LLM runtime for Apple Silicon. Rust, MLX backend, OpenAI-compatible HTTP API. One binary.
single-model decode (Qwen3-4B-Instruct-2507-4bit, 300 tokens, greedy)
base-mlx 32.75 tok/s
LM Studio ~35 tok/s
The gap is closing. The point of this project is to close it in the open.
Local inference on Apple Silicon currently splits into:
| tool | runtime | gap |
|---|---|---|
| Ollama | llama.cpp | Not MLX. Leaves perf on the table for native shapes. |
| LM Studio | mlx-lm wrapper | Closed source. Not embeddable. No API control. |
| mlx_lm.server | mlx-lm (Python) | Thin server. No enforced JSON schema. No concurrent reqs. |
| mistral.rs | candle | Not MLX. |
base-mlx is a Rust binary that drives MLX directly, exposes the OpenAI surface properly (streaming, tools, structured output), and aims to be the engine you can either run as a server or link into a Mac-native app.
Working today:
127.0.0.1:11435/v1
chat/completions (one-shot + SSE streaming)modelsembeddingsgroup_size=64)<tool_call>...</tool_call> markup → OpenAI tool_calls array)json_schema response format (soft-prompt nudge + post-validate; grammar-constrained sampling is on the roadmap)model field~/.lmstudio/models/ if present (no duplicate downloads)mlx-sys binding (mlx_ext::slice_update)gen done elapsed_ms=… active_mb_*=… cache_mb_*=…?spec=<draft_id> (greedy only currently). Correct, not yet a win — see base.md for the verify-cost analysis.Not yet:
json_schema is currently best-effort + validate)brew installSee base.md for a detailed status snapshot (what works, what doesn’t, why) and ROADMAP.md for the milestone plan.
Requires Rust (stable) and a Mac with Apple Silicon. MLX is linked through mlx-sys, which builds the C library on first compile (~3 min cold).
git clone https://github.com/flakerimi/base-mlx
cd base-mlx
cargo build --release -p base-mlx-cli
Output binary: target/release/base-mlx.
RUST_LOG=info ./target/release/base-mlx serve
# base-mlx serving addr=127.0.0.1:11435
Port 11435 is chosen so the server can coexist with Ollama on 11434.
The first chat request loads on demand. If you want to fetch ahead of time:
./target/release/base-mlx pull mlx-community/Qwen3-4B-Instruct-2507-4bit
If LM Studio is already on the machine, base-mlx will find its existing copies under ~/.lmstudio/models/ and skip the download.
curl -s http://localhost:11435/v1/chat/completions \
-H 'content-type: application/json' \
-d '{
"model": "qwen3-4b-instruct",
"temperature": 0,
"max_tokens": 300,
"messages": [{"role": "user", "content": "Explain MLX in two sentences."}]
}'
curl -sN http://localhost:11435/v1/chat/completions \
-H 'content-type: application/json' \
-d '{
"model": "qwen3-4b-instruct",
"stream": true,
"max_tokens": 200,
"messages": [{"role": "user", "content": "Write a haiku about latency."}]
}'
curl -s http://localhost:11435/v1/chat/completions \
-H 'content-type: application/json' \
-d '{
"model": "qwen3-4b-instruct",
"temperature": 0,
"messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"type": "object",
"properties": { "city": {"type": "string"} },
"required": ["city"]
}
}
}]
}'
The response comes back with finish_reason: "tool_calls" and a populated tool_calls array.
curl -s http://localhost:11435/v1/chat/completions \
-H 'content-type: application/json' \
-d '{
"model": "qwen3-4b-instruct",
"temperature": 0,
"messages": [{"role": "user", "content": "List three primes."}],
"response_format": {
"type": "json_schema",
"json_schema": {
"schema": {
"type": "object",
"properties": {
"primes": {"type": "array", "items": {"type": "integer"}}
},
"required": ["primes"]
}
}
}
}'
Note: response_format is currently advisory (system-prompt nudge + JSON validation pass). True grammar-constrained sampling is on the roadmap.
Opt-in per request. Greedy only (temperature must be 0). The server loads target and draft simultaneously.
curl -s 'http://localhost:11435/v1/chat/completions?spec=mlx-community/Qwen3-0.6B-4bit' \
-H 'content-type: application/json' \
-d '{
"model": "qwen3-4b-instruct",
"temperature": 0,
"max_tokens": 300,
"messages": [{"role": "user", "content": "Write a 200-word story."}]
}'
Per-iter timing lands in the server log (spec-dec done draft_ms_mean=… verify_ms_mean=…). See base.md for current numbers and what’s blocking spec-dec from being a win.
curl -s http://localhost:11435/v1/models | jq
Returns every catalog model that resolves to a local copy plus any LM Studio model on disk, each annotated with loaded: true|false for what’s currently in memory.
crates/
base-mlx-core/ forward pass, KV cache, sampler, tokenizer, speculative engine
model/qwen3.rs Qwen3 architecture + KvCache
model/kernels.rs fused qkv / mlp blocks (mlx::compile inputs)
engine.rs LoadedModel; greedy / sampled decode loop
speculative.rs target+draft verify loop (greedy)
mlx_ext.rs direct mlx-sys bindings we needed but mlx-rs doesn't expose
memory.rs MLX free-list cap, active/cache byte queries
pull.rs HF download + local model discovery (incl. LM Studio cache)
chat_template.rs Qwen3 chat template + tool_call markup
base-mlx-server/ axum HTTP server, OpenAI routes
base-mlx-cli/ single binary entry point
mlx_sys::mlx_slice_update directly because mlx-rs 0.25 marks the equivalent as pub(crate). MLX’s runtime planner elides the copy when the input array’s refcount is 1, which we guarantee by reassigning the cache slot before the next slice op. This is the same mechanism mlx-lm uses in Python.gen done tag=stream elapsed_ms=9163 active_mb_before=2160 active_mb_after=2160 cache_mb_before=601 cache_mb_after=601
This is early. The structure is settling; bug reports and perf data are the most useful contributions right now. If you have a Mac that isn’t M2-class, dropping a gen done log from a known prompt is valuable.
Conventions:
cargo fmt and cargo clippy --release before sending a PR.perf:, fix:, feat:, doc:). See git log for the existing style.mlx-community quants. The engine here is a thin layer over the actual work Apple ML did.tokenizers for the BPE pipeline.MIT.