mlx-local-inference

Flagged

Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/...

176 downloads

$ Add to .claude/skills/

$ openclaw install

About This Skill

# MLX Local Inference Stack

Local AI inference on Apple Silicon. oMLX handles LLM/VLM with continuous batching. Python libraries handle Embedding/ASR/OCR directly via `uv`.

Architecture

``` ┌─────────────────────────────────────┐ │ oMLX (localhost:8000/v1) │ │ - LLM (Qwen3.5-35B, etc.) │ │ - VLM (vision-language models) │ │ - Continuous batching + SSD cache │ └─────────────────────────────────────┘

┌─────────────────────────────────────┐ │ Python Libraries (via uv run) │ │ - mlx-lm: Embedding │ │ - mlx-vlm: OCR (PaddleOCR-VL) │ │ - mlx-audio: ASR (Qwen3-ASR) │ └─────────────────────────────────────┘ ```

Models

| Capability | Implementation | Model | Size | |-----------|---------------|-------|------| | 💬 LLM | oMLX API | `Qwen3.5-35B-A3B-4bit` | ~20 GB | | 👁️ VLM | oMLX API | Any mlx-vlm model | varies | | 📐 Embed | mlx-lm (uv) | `Qwen3-Embedding-0.6B-4bit-DWQ` | ~1 GB | | 🎤 ASR | mlx-audio (uv) | `Qwen3-ASR-1.7B-8bit` | ~1.5 GB | | 👁️ OCR | mlx-vlm (uv) | `PaddleOCR-VL-1.5-6bit` | ~3.3 GB |

Usage

LLM / Vision-Language (via oMLX API)

```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")

# Text generation resp = client.chat.completions.create( model="Qwen3.5-35B-A3B-4bit", messages=[{"role": "user", "content": "Hello"}] ) print(resp.choices[0].message.content) ```

---

Embeddings (via mlx-lm + uv)

```bash uv run --with mlx-lm python -c " from mlx_lm import load model, tokenizer = load('~/models/Qwen3-Embedding-0.6B-4bit-DWQ') text = 'text to embed' inputs = tokenizer(text, return_tensors='np') embeddings = model(**inputs).last_hidden_state.mean(axis=1) print(embeddings.shape) " ```

---

ASR — Speech-to-Text (via mlx-audio + uv)

> Important: Must run with `--python 3.11` to avoid OpenMP threading issues (`SIGSEGV`).

```bash uv run --python 3.11 --with mlx-audio python -m mlx_audio.stt.generate \ --model ~/models/Qwen3-ASR-1.7B-8bit \ --audio "audio.wav" \ --output-path /tmp/asr_result \ --format txt \ --language zh \ --verbose ```

---

OCR (via mlx-vlm + uv)

> Important: The `generate` function parameter order must be `(model, processor, prompt, image)`.

```bash cat << 'PY_EOF' > run_ocr.py import os from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template

model_path = os.path.expanduser("~/models/PaddleOCR-VL-1.5-6bit") model, processor = load(model_path) prompt = apply_chat_template(processor, config=model.config, prompt="OCR:", num_images=1)

output = generate(model, processor, prompt, "document.jpg", max_tokens=512, temp=0.0) print(output.text) PY_EOF

uv run --python 3.11 --with mlx-vlm python run_ocr.py ```

---

Service Management (oMLX only)

```bash # Check running models curl http://localhost:8000/v1/models

# Restart oMLX launchctl kickstart -k gui/$(id -u)/com.omlx-server ```

Model Storage Strategy

All models stored in `~/models/` using oMLX-compatible structure:

``` ~/models/ ├── Qwen3-Embedding-0.6B-4bit-DWQ/ ├── Qwen3-ASR-1.7B-8bit/ ├── PaddleOCR-VL-1.5-6bit/ └── Qwen3.5-35B-A3B-4bit/ ```

Requirements

Apple Silicon Mac (M1/M2/M3/M4)
`uv` installed (`curl -LsSf https://astral.sh/uv/install.sh | sh`)

Use Cases

Run local AI inference on Mac using MLX for text, embeddings, and speech-to-text
Perform OCR and image understanding locally via oMLX gateway
Generate text and embeddings without cloud API calls on Apple Silicon
Process speech-to-text locally for privacy-sensitive audio workflows
Build Mac-native AI pipelines using MLX framework for local inference

Pros & Cons

Pros

+Compatible with multiple platforms including claude-code, openclaw
+Well-documented with detailed usage instructions and examples
+Purpose-built for ai & machine learning tasks with focused functionality

Cons

-Requires API tokens or authentication setup before first use
-No built-in analytics or usage metrics dashboard

FAQ

What does mlx-local-inference do?

Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/...

What platforms support mlx-local-inference?

mlx-local-inference is available on Claude Code, OpenClaw.

What are the use cases for mlx-local-inference?

Run local AI inference on Mac using MLX for text, embeddings, and speech-to-text. Perform OCR and image understanding locally via oMLX gateway. Generate text and embeddings without cloud API calls on Apple Silicon.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

AI Humanizer

Make AI text undetectable

AI Detector

Free, unlimited

PDF Tools

Merge, split, compress

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.

Open Free Tools Try AI Detector