Skip to content

Miliger Playwright Scraper

Verified

Miliger Playwright Scraper — testing and quality assurance tool. Supports Tab, SPA.

121 downloads
$ Add to .claude/skills/

About This Skill

Content available in Chinese

# Playwright 网页爬取技能

功能概述

使用Playwright进行真实浏览器操作,爬取复杂动态网页。

核心能力 - ✅ **真实浏览器操作** - 点击、滚动、输入、等待 - ✅ **处理复杂SPA** - 单页应用、多Tab、懒加载 - ✅ **AI自动生成脚本** - 无需提前准备,实时分析页面结构 - ✅ **持久化Chrome Profile** - 复用登录状态 - ✅ **数据结构化输出** - 自动整理成Markdown/JSON

对比优势

| 工具 | 局限性 | Playwright优势 | |------|--------|----------------| | n8n | 无法处理JS渲染 | ✅ 完整JS渲染支持 | | Apify | 需要现成actor | ✅ AI实时生成脚本 | | Bright Data | 按量计费 | ✅ 本地运行,免费 |

适用场景 - ✅ **公开信息型网站** - 会议日程、展会信息 - ✅ **多Tab懒加载页面** - 需点击切换的内容 - ✅ **SPA单页应用** - JavaScript异步加载

不适用场景 - ⚠️ **反爬机制强的网站** - 需要复杂验证 - ⚠️ **需多轮调试的场景** - 效率不高 - ⚠️ **生产环境** - 需要高度稳定性

---

使用方式

1. 基本爬取(单页面) ``` 官家,请帮我爬取这个页面:https://example.com 等JS渲染完成后提取所有内容 保存成Markdown ```

2. 多Tab爬取 ``` 官家,请爬取MWC巴展议程: - 页面有5个日期Tab(PRE、MON、TUE、WED、THU) - 需要点击每个Tab获取数据 - 按日期分别保存 ```

3. 懒加载内容 ``` 官家,请爬取这个页面: - 需要滚动到底部才能加载完整内容 - 等待所有内容加载完成 - 提取所有session数据 ```

---

技术实现

Playwright核心API

#### 1. 启动浏览器 ```javascript const { chromium } = require('playwright');

const browser = await chromium.launch({ headless: false, // 或true channel: 'chrome' });

const context = await browser.newContext({ viewport: { width: 1920, height: 1080 } });

const page = await context.newPage(); ```

#### 2. 导航和等待 ```javascript await page.goto(url, { waitUntil: 'networkidle' }); await page.waitForLoadState('domcontentloaded'); await page.waitForSelector('selector'); ```

#### 3. DOM操作 ```javascript // 点击 await page.click('button'); await page.click('text=Monday');

// 滚动 await page.evaluate(() => { window.scrollTo(0, document.body.scrollHeight); });

// 等待 await page.waitForTimeout(2000); ```

#### 4. 数据提取 ```javascript const data = await page.evaluate(() => { const items = Array.from(document.querySelectorAll('.item')); return items.map(item => ({ title: item.querySelector('.title').textContent, time: item.querySelector('.time').textContent, location: item.querySelector('.location').textContent })); }); ```

#### 5. 持久化Profile ```javascript const context = await chromium.launchPersistentContext( './user-data', { headless: false, channel: 'chrome' } ); ```

---

实战案例

案例1:MWC巴展议程爬取

  • 需求
  • 5个日期Tab(PRE、MON、TUE、WED、THU)
  • 每个Tab有懒加载
  • 数据通过JS异步请求

实现步骤: ```javascript // 1. 导航到页面 await page.goto('https://mwcbarcelona.com/agenda');

// 2. 等待页面加载 await page.waitForLoadState('networkidle');

// 3. 循环处理每个Tab const tabs = ['PRE', 'MON', 'TUE', 'WED', 'THU']; for (const tab of tabs) { // 点击Tab await page.click(`text=${tab}`); // 等待内容加载 await page.waitForTimeout(2000); // 滚动到底部(触发懒加载) await page.evaluate(() => { window.scrollTo(0, document.body.scrollHeight); }); // 等待懒加载完成 await page.waitForTimeout(3000); // 提取数据 const sessions = await page.evaluate(() => { const items = Array.from(document.querySelectorAll('.session')); return items.map(item => ({ title: item.querySelector('.title').textContent, time: item.querySelector('.time').textContent, location: item.querySelector('.location').textContent, speakers: item.querySelector('.speakers').textContent })); }); // 保存到文件 const markdown = sessions.map(s => `## ${s.title}\n- 时间: ${s.time}\n- 地点: ${s.location}\n- 演讲者: ${s.speakers}\n` ).join('\n'); fs.writeFileSync(`mwc-${tab}.md`, markdown); } ```

  • 关键点
  • ✅ 等待网络空闲(networkidle)
  • ✅ 点击后等待内容加载
  • ✅ 滚动触发懒加载
  • ✅ 等待懒加载完成
  • ✅ 数据结构化提取

---

最佳实践

1. 等待策略 ```javascript // ❌ 不推荐:固定等待 await page.waitForTimeout(5000);

// ✅ 推荐:智能等待 await page.waitForLoadState('networkidle'); await page.waitForSelector('.item', { state: 'visible' }); await page.waitForFunction(() => { return document.querySelectorAll('.item').length > 10; }); ```

2. 错误处理 ```javascript try { await page.click('button'); await page.waitForSelector('.result', { timeout: 10000 }); } catch (error) { console.error('操作失败:', error.message); // 截图调试 await page.screenshot({ path: 'error.png' }); } ```

3. 性能优化 ```javascript // 阻止不必要的资源加载 await page.route('**/*.{png,jpg,jpeg,gif,svg}', route => route.abort()); await page.route('**/*.css', route => route.abort()); ```

4. 反爬应对 ```javascript // 设置用户代理 await page.setExtraHTTPHeaders({ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...' });

// 禁用webdriver标识 await page.evaluateOnNewDocument(() => { Object.defineProperty(navigator, 'webdriver', { get: () => false, }); }); ```

---

调试技巧

1. 截图 ```javascript await page.screenshot({ path: 'debug.png', fullPage: true }); ```

2. 打印HTML ```javascript const html = await page.content(); console.log(html); ```

3. 打印特定元素 ```javascript const element = await page.$('.item'); const html = await element.innerHTML(); console.log(html); ```

4. 控制台日志 ```javascript page.on('console', msg => { console.log('PAGE LOG:', msg.text()); }); ```

---

注意事项

安全提示 - ⚠️ 仅爬取公开信息 - ⚠️ 遵守robots.txt - ⚠️ 不要过度请求,避免给服务器造成压力 - ⚠️ 敏感数据不要保存

法律风险 - ⚠️ 确保爬取行为合法 - ⚠️ 不要爬取版权内容 - ⚠️ 不要绕过付费墙 - ⚠️ 不要爬取个人隐私数据

技术限制 - ⚠️ 反爬机制强的网站可能失败 - ⚠️ 需要复杂验证的平台不适用 - ⚠️ 页面结构变化需重新调试

---

安装依赖

```bash npm install playwright npx playwright install chromium ```

---

参考资料

---

创建时间: 2026-02-27 版本: 1.0.0 维护者: 米粒儿(AI Agent)

Use Cases

  • Scrape dynamic web content using Playwright browser automation
  • Handle single-page applications and JavaScript-rendered content
  • Extract data from tab-based and dynamically loaded web interfaces
  • Build reliable web scraping pipelines for modern web applications
  • Automate data extraction from complex, interactive web pages

Pros & Cons

Pros

  • +Compatible with multiple platforms including claude-code, openclaw
  • +Well-documented with detailed usage instructions and examples
  • +Purpose-built for testing tasks with focused functionality

Cons

  • -Documentation primarily in Chinese — may need translation for English users
  • -No built-in analytics or usage metrics dashboard

FAQ

What does Miliger Playwright Scraper do?
Miliger Playwright Scraper — testing and quality assurance tool. Supports Tab, SPA.
What platforms support Miliger Playwright Scraper?
Miliger Playwright Scraper is available on Claude Code, OpenClaw.
What are the use cases for Miliger Playwright Scraper?
Scrape dynamic web content using Playwright browser automation. Handle single-page applications and JavaScript-rendered content. Extract data from tab-based and dynamically loaded web interfaces.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.