Skip to content

multimodal-parser

Verified

Unified multi-modal content parser for images, PDF, DOCX, audio, auto OCR/transcription, output structured text for LLM processing

83 downloads
$ Add to .claude/skills/

About This Skill

Content available in Chinese

  1. # 📄 多模态内容解析器
  2. ## 核心亮点
  3. 🔄 统一接口:一套API支持图片/PDF/Word/音频4大类格式解析,不需要对接多个服务
  4. 🚀 开箱即用:内置OCR、音频转文字、文档解析能力,零配置即可使用
  5. 📝 多格式输出:支持纯文本/Markdown/结构化JSON三种输出格式,适配不同LLM处理需求
  6. 💡 友好错误提示:依赖缺失时自动给出安装命令,新手也能快速上手

🎯 适用场景 - 多模态Agent的内容解析层 - 文档问答、知识库构建场景的文件预处理 - 图片OCR识别、语音转文字需求 - 批量文档解析与结构化处理

📝 参数说明 | 参数 | 类型 | 必填 | 默认值 | 说明 | |------|------|------|--------|------| | file_path | string | 是 | - | 要解析的文件路径 | | file_type | string | 否 | auto | 文件类型:image/pdf/docx/audio/auto | | output_format | string | 否 | text | 输出格式:text/markdown/structured | | options.ocr_lang | string | 否 | chi_sim+eng | OCR识别语言 | | options.audio_model | string | 否 | base | Whisper模型大小(base/small/medium/large) | | options.pdf_page_range | tuple | 否 | undefined | PDF解析页码范围,如[1, 10]表示解析第1-10页 |

💡 开箱即用示例 ### 图片OCR识别 ```typescript const result = await skills.multimodalParser({ file_path: "./resume.jpg", file_type: "image", output_format: "markdown" }); ```

PDF解析(指定页码范围) ```typescript const result = await skills.multimodalParser({ file_path: "./document.pdf", output_format: "structured", options: { pdf_page_range: [1, 50] // 只解析前50页 } }); ```

音频转文字 ```typescript const result = await skills.multimodalParser({ file_path: "./meeting.mp3", options: { audio_model: "small" // 用small模型,速度更快 } }); ```

🔧 依赖安装 根据需要解析的文件类型安装对应依赖: ```bash # 全量安装所有依赖(推荐) ## macOS brew install tesseract tesseract-lang poppler pandoc pip install openai-whisper ffmpeg

Ubuntu/Debian apt install tesseract-ocr tesseract-ocr-chi-sim poppler-utils pandoc ffmpeg pip install openai-whisper ```

技术实现说明 - 基于成熟的开源工具链(Tesseract/Poppler/Whisper/Pandoc) - 自动文件类型检测,无需手动指定格式 - 模块化设计,可轻松扩展支持更多格式 - 输出格式标准化,直接可被LLM处理

Use Cases

  • Parse images, PDFs, DOCX, and audio files into structured text for LLMs
  • Apply automatic OCR to extract text from image-based documents
  • Transcribe audio files and convert to structured text format
  • Build unified content ingestion pipelines for multi-format documents
  • Process diverse media types into LLM-ready text representations

Pros & Cons

Pros

  • +Compatible with multiple platforms including claude-code, openclaw
  • +Well-documented with detailed usage instructions and examples
  • +Purpose-built for ai & machine learning tasks with focused functionality

Cons

  • -Documentation primarily in Chinese — may need translation for English users
  • -No built-in analytics or usage metrics dashboard

FAQ

What does multimodal-parser do?
Unified multi-modal content parser for images, PDF, DOCX, audio, auto OCR/transcription, output structured text for LLM processing
What platforms support multimodal-parser?
multimodal-parser is available on Claude Code, OpenClaw.
What are the use cases for multimodal-parser?
Parse images, PDFs, DOCX, and audio files into structured text for LLMs. Apply automatic OCR to extract text from image-based documents. Transcribe audio files and convert to structured text format.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.