llmfit

Verified

Detect local hardware (RAM, CPU, GPU/VRAM) and recommend the best-fit local LLM models with optimal quantization, speed estimates, and fit scoring.

267 downloads

$ Add to .claude/skills/

$ openclaw install

About This Skill

# llmfit-advisor

Hardware-aware local LLM advisor. Detects your system specs (RAM, CPU, GPU/VRAM) and recommends models that actually fit, with optimal quantization and speed estimates.

When to use (trigger phrases)

Use this skill immediately when the user asks any of:

"what local models can I run?"
"which LLMs fit my hardware?"
"recommend a local model"
"what's the best model for my GPU?"
"can I run Llama 70B locally?"
"configure local models"
"set up Ollama models"
"what models fit my VRAM?"
"help me pick a local model for coding"

Also use this skill when:

The user wants to configure `models.providers.ollama` or `models.providers.lmstudio`
The user mentions running models locally and you need to know what fits
A model recommendation is needed and the user has local inference capability (Ollama, vLLM, LM Studio)

Quick start

Detect hardware

```bash llmfit --json system ```

Returns JSON with CPU, RAM, GPU name, VRAM, multi-GPU info, and whether memory is unified (Apple Silicon).

Get top recommendations

```bash llmfit recommend --json --limit 5 ```

Returns the top 5 models ranked by a composite score (quality, speed, fit, context) with optimal quantization for the detected hardware.

Filter by use case

```bash llmfit recommend --json --use-case coding --limit 3 llmfit recommend --json --use-case reasoning --limit 3 llmfit recommend --json --use-case chat --limit 3 ```

Valid use cases: `general`, `coding`, `reasoning`, `chat`, `multimodal`, `embedding`.

Filter by minimum fit level

```bash llmfit recommend --json --min-fit good --limit 10 ```

Valid fit levels (best to worst): `perfect`, `good`, `marginal`.

Understanding the output

System JSON

```json { "system": { "cpu_name": "Apple M2 Max", "cpu_cores": 12, "total_ram_gb": 32.0, "available_ram_gb": 24.5, "has_gpu": true, "gpu_name": "Apple M2 Max", "gpu_vram_gb": 32.0, "gpu_count": 1, "backend": "Metal", "unified_memory": true } } ```

Recommendation JSON

Each model in the `models` array includes:

| Field | Meaning | |---|---| | `name` | HuggingFace model ID (e.g. `meta-llama/Llama-3.1-8B-Instruct`) | | `provider` | Model provider (Meta, Alibaba, Google, etc.) | | `params_b` | Parameter count in billions | | `score` | Composite score 0–100 (higher is better) | | `score_components` | Breakdown: `quality`, `speed`, `fit`, `context` (each 0–100) | | `fit_level` | `Perfect`, `Good`, `Marginal`, or `TooTight` | | `run_mode` | `GPU`, `CPU+GPU Offload`, or `CPU Only` | | `best_quant` | Optimal quantization for the hardware (e.g. `Q5_K_M`, `Q4_K_M`) | | `estimated_tps` | Estimated tokens per second | | `memory_required_gb` | VRAM/RAM needed at this quantization | | `memory_available_gb` | Available VRAM/RAM detected | | `utilization_pct` | How much of available memory the model uses | | `use_case` | What the model is designed for | | `context_length` | Maximum context window |

Fit levels explained

Perfect: Model fits comfortably with room to spare. Ideal choice.
Good: Model fits but uses most available memory. Will work well.
Marginal: Model barely fits. May work but expect slower performance or reduced context.
TooTight: Model does not fit. Do not recommend.

Run modes explained

GPU: Full GPU inference. Fastest. Model weights loaded entirely into VRAM.
CPU+GPU Offload: Some layers on GPU, rest in system RAM. Slower than pure GPU.
CPU Only: All inference on CPU using system RAM. Slowest but works without GPU.

Configuring OpenClaw with results

After getting recommendations, configure the user's local model provider.

For Ollama

Map the HuggingFace model name to its Ollama tag. Common mappings:

| llmfit name | Ollama tag | |---|---| | `meta-llama/Llama-3.1-8B-Instruct` | `llama3.1:8b` | | `meta-llama/Llama-3.3-70B-Instruct` | `llama3.3:70b` | | `Qwen/Qwen2.5-Coder-7B-Instruct` | `qwen2.5-coder:7b` | | `Qwen/Qwen2.5-72B-Instruct` | `qwen2.5:72b` | | `deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct` | `deepseek-coder-v2:16b` | | `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B` | `deepseek-r1:32b` | | `google/gemma-2-9b-it` | `gemma2:9b` | | `mistralai/Mistral-7B-Instruct-v0.3` | `mistral:7b` | | `microsoft/Phi-3-mini-4k-instruct` | `phi3:mini` | | `microsoft/Phi-4-mini-instruct` | `phi4-mini` |

Then update `openclaw.json`:

```json { "models": { "providers": { "ollama": { "models": ["ollama/<ollama-tag>"] } } } } ```

And optionally set as default:

```json { "agents": { "defaults": { "model": { "primary": "ollama/<ollama-tag>" } } } } ```

For vLLM / LM Studio

Use the HuggingFace model name directly as the model identifier with the appropriate provider prefix (`vllm/` or `lmstudio/`).

Workflow example

When a user asks "what local models can I run?":

Run `llmfit --json system` to show hardware summary
Run `llmfit recommend --json --limit 5` to get top picks
Present the recommendations with scores and fit levels
If the user wants to configure one, map it to the appropriate Ollama/vLLM/LM Studio tag
Offer to update `openclaw.json` with the chosen model

When a user asks for a specific use case like "recommend a coding model":

Run `llmfit recommend --json --use-case coding --limit 3`
Present the coding-specific recommendations
Offer to pull via Ollama and configure

Notes

llmfit detects NVIDIA GPUs (via nvidia-smi), AMD GPUs (via rocm-smi), and Apple Silicon (unified memory).
Multi-GPU setups aggregate VRAM across cards automatically.
The `best_quant` field tells you the optimal quantization — higher quant (Q6_K, Q8_0) means better quality if VRAM allows.
Speed estimates (`estimated_tps`) are approximate and vary by hardware and quantization.
Models with `fit_level: "TooTight"` should never be recommended to users.

Use Cases

Detect local hardware specs (RAM, CPU, GPU/VRAM) for LLM compatibility
Recommend the best-fit local LLM models based on available hardware
Determine optimal quantization levels for your machine's capabilities
Estimate inference speed and performance for candidate local models
Match hardware constraints to model requirements for local AI deployment

Pros & Cons

Pros

+Compatible with multiple platforms including claude-code, openclaw
+Well-documented with detailed usage instructions and examples
+Open source with permissive licensing
+Automation-first design reduces manual intervention

Cons

-Requires API tokens or authentication setup before first use
-No built-in analytics or usage metrics dashboard

FAQ

What does llmfit do?

Detect local hardware (RAM, CPU, GPU/VRAM) and recommend the best-fit local LLM models with optimal quantization, speed estimates, and fit scoring.

What platforms support llmfit?

llmfit is available on Claude Code, OpenClaw.

What are the use cases for llmfit?

Detect local hardware specs (RAM, CPU, GPU/VRAM) for LLM compatibility. Recommend the best-fit local LLM models based on available hardware. Determine optimal quantization levels for your machine's capabilities.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

AI Humanizer

Make AI text undetectable

AI Detector

Free, unlimited

PDF Tools

Merge, split, compress

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.

Open Free Tools Try AI Detector