Vision Sandbox

Verified

Agentic Vision via Gemini's native Code Execution sandbox. Use for spatial grounding, visual math, and UI auditing.

2,111 downloads

$ Add to .claude/skills/

$ openclaw install

About This Skill

# Vision Sandbox 🔭

Leverage Gemini's native code execution to analyze images with high precision. The model writes and runs Python code in a Google-hosted sandbox to verify visual data, perfect for UI auditing, spatial grounding, and visual reasoning.

Installation

```bash clawhub install vision-sandbox ```

Usage

```bash uv run vision-sandbox --image "path/to/image.png" --prompt "Identify all buttons and provide [x, y] coordinates." ```

Pattern Library

📍 Spatial Grounding Ask the model to find specific items and return coordinates. * Prompt: "Locate the 'Submit' button in this screenshot. Use code execution to verify its center point and return the [x, y] coordinates in a [0, 1000] scale."

🧮 Visual Math Ask the model to count or calculate based on the image. * Prompt: "Count the number of items in the list. Use Python to sum their values if prices are visible."

🖥️ UI Audit Check layout and readability. * Prompt: "Check if the header text overlaps with any icons. Use the sandbox to calculate the bounding box intersections."

🖐️ Counting & Logic Solve visual counting tasks with code verification. * Prompt: "Count the number of fingers on this hand. Use code execution to identify the bounding box for each finger and return the total count."

Integration with OpenCode This skill is designed to provide Visual Grounding for automated coding agents like OpenCode. - Step 1: Use `vision-sandbox` to extract UI metadata (coordinates, sizes, colors). - Step 2: Pass the JSON output to OpenCode to generate or fix CSS/HTML.

Configuration - GEMINI_API_KEY: Required environment variable. - Model: Defaults to `gemini-3-flash-preview`.

Use Cases

Perform spatial grounding on images — locate and annotate objects with coordinates
Solve visual math problems by analyzing diagrams, charts, and equations in images
Audit UI designs by detecting layout issues, accessibility problems, and inconsistencies
Run code against images in Gemini's native execution sandbox
Build visual analysis pipelines that combine image understanding with code execution

Pros & Cons

Pros

+Unique combination of vision AI and code execution sandbox capabilities
+Three focused use cases — spatial grounding, visual math, and UI auditing
+Gemini's native sandbox provides safe execution environment for generated code

Cons

-Requires Google Gemini API access with code execution enabled
-Sandbox capabilities depend on Gemini's execution environment limitations

FAQ

What does Vision Sandbox do?

Agentic Vision via Gemini's native Code Execution sandbox. Use for spatial grounding, visual math, and UI auditing.

What platforms support Vision Sandbox?

Vision Sandbox is available on Claude Code, OpenClaw.

What are the use cases for Vision Sandbox?

Perform spatial grounding on images — locate and annotate objects with coordinates. Solve visual math problems by analyzing diagrams, charts, and equations in images. Audit UI designs by detecting layout issues, accessibility problems, and inconsistencies.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

AI Humanizer

Make AI text undetectable

AI Detector

Free, unlimited

PDF Tools

Merge, split, compress

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.

Open Free Tools Try AI Detector