Skip to content

Vision Sandbox

Verified

Agentic Vision via Gemini's native Code Execution sandbox. Use for spatial grounding, visual math, and UI auditing.

2,111

Install

Claude Code

Add to .claude/skills/

About This Skill

# Vision Sandbox 🔭

Leverage Gemini's native code execution to analyze images with high precision. The model writes and runs Python code in a Google-hosted sandbox to verify visual data, perfect for UI auditing, spatial grounding, and visual reasoning.

Installation

```bash clawhub install vision-sandbox ```

Usage

```bash uv run vision-sandbox --image "path/to/image.png" --prompt "Identify all buttons and provide [x, y] coordinates." ```

Pattern Library

📍 Spatial Grounding Ask the model to find specific items and return coordinates. * **Prompt:** "Locate the 'Submit' button in this screenshot. Use code execution to verify its center point and return the [x, y] coordinates in a [0, 1000] scale."

🧮 Visual Math Ask the model to count or calculate based on the image. * **Prompt:** "Count the number of items in the list. Use Python to sum their values if prices are visible."

🖥️ UI Audit Check layout and readability. * **Prompt:** "Check if the header text overlaps with any icons. Use the sandbox to calculate the bounding box intersections."

🖐️ Counting & Logic Solve visual counting tasks with code verification. * **Prompt:** "Count the number of fingers on this hand. Use code execution to identify the bounding box for each finger and return the total count."

Integration with OpenCode This skill is designed to provide **Visual Grounding** for automated coding agents like OpenCode. - **Step 1:** Use `vision-sandbox` to extract UI metadata (coordinates, sizes, colors). - **Step 2:** Pass the JSON output to OpenCode to generate or fix CSS/HTML.

Configuration - **GEMINI_API_KEY**: Required environment variable. - **Model**: Defaults to `gemini-3-flash-preview`.

Use Cases

  • Perform spatial grounding on images — locate and annotate objects with coordinates
  • Solve visual math problems by analyzing diagrams, charts, and equations in images
  • Audit UI designs by detecting layout issues, accessibility problems, and inconsistencies
  • Run code against images in Gemini's native execution sandbox
  • Build visual analysis pipelines that combine image understanding with code execution

Pros & Cons

Pros

  • + Unique combination of vision AI and code execution sandbox capabilities
  • + Three focused use cases — spatial grounding, visual math, and UI auditing
  • + Gemini's native sandbox provides safe execution environment for generated code

Cons

  • - Requires Google Gemini API access with code execution enabled
  • - Sandbox capabilities depend on Gemini's execution environment limitations

Frequently Asked Questions

What does Vision Sandbox do?

Agentic Vision via Gemini's native Code Execution sandbox. Use for spatial grounding, visual math, and UI auditing.

What platforms support Vision Sandbox?

Vision Sandbox is available on Claude Code, OpenClaw.

What are the use cases for Vision Sandbox?

Perform spatial grounding on images — locate and annotate objects with coordinates. Solve visual math problems by analyzing diagrams, charts, and equations in images. Audit UI designs by detecting layout issues, accessibility problems, and inconsistencies.

Stay Updated on Agent Skills

Get weekly curated skills + safety alerts