A/B Testing Framework
CautionDesign and implement A/B tests with proper statistical methodology, sample size calculation, feature flags, and significance testing for conversion optimization.
Install
Claude Code
Copy the SKILL.md file to .claude/skills/a-b-testing.md About This Skill
A/B Testing Framework generates statistically rigorous experimentation infrastructure to avoid the common mistakes that invalidate most A/B tests.
Pre-Experiment Design
Sample size calculator with inputs: baseline conversion rate, minimum detectable effect (MDE), statistical power (80% default), and significance threshold (α=0.05). Runtime estimator based on current traffic volume. Multiple comparison correction (Bonferroni) for multi-variant tests.
Assignment
Deterministic user bucketing via MurmurHash3 on `user_id + experiment_id`. Ensures users see the same variant on every visit. Traffic allocation by percentage. Holdout groups for long-term effect measurement. Exclusion rules to prevent experiment interference.
Feature Flags
Integrates with LaunchDarkly, Unleash, or a self-hosted flag service. Server-side flag evaluation prevents flickering. SDK wrappers for React (useFlag hook), Python, and Go.
Analysis
Frequentist — Z-test for proportions, t-test for continuous metrics, chi-square for multi-category. Confidence intervals. p-value with multiple testing correction.
Bayesian — Beta-Binomial conjugate model for conversion rates. Probability to be best, expected loss, credible intervals. Thompson Sampling for multi-armed bandit scenarios.
Common Pitfalls Detection
Sample Ratio Mismatch (SRM) detection, novelty effect warnings for long-run test drift, and network effect warnings for social products.
Use Cases
- Running landing page copy tests with proper power analysis and minimum detectable effect
- Implementing feature flag-based A/B tests with consistent user bucketing
- Analyzing experiment results with frequentist and Bayesian methods
- Designing multi-variate tests with proper traffic allocation across variants
Pros & Cons
Pros
- + Pre-experiment sample size calculator prevents underpowered tests
- + SRM detection catches assignment bugs that would otherwise invalidate results
- + Bayesian analysis provides probability-based decisions, not just p-value cutoffs
- + MurmurHash bucketing ensures consistent assignment without database storage
Cons
- - Minimum sample sizes mean small sites cannot reach significance on rare conversions
- - Bayesian analysis requires choosing priors which introduces subjective decisions
Related AI Tools
Claude Code
Paid
Anthropic's agentic CLI for autonomous terminal-native coding workflows
- Terminal-native autonomous coding agent
- Full file system and shell access for multi-step tasks
- Deep codebase understanding via repository indexing
Cursor
Freemium
AI-native code editor with deep multi-model integration and agentic coding
- AI-native Cmd+K inline editing and generation
- Composer Agent for autonomous multi-file changes
- Full codebase indexing and context awareness
GitHub Copilot
Freemium
AI pair programmer that suggests code in real time across your IDE
- Real-time code completions across 30+ languages
- Copilot Chat for natural language code Q&A
- Pull request description and summary generation
Related Skills
Metrics Dashboard Builder
VerifiedBuild operational metrics dashboards with Grafana, Prometheus, or Recharts displaying real-time KPIs, time-series charts, and configurable alerts.
Data Validator
CautionBuild data quality validation pipelines with schema enforcement, anomaly detection, referential integrity checks, and data quality reports.
Stay Updated on Agent Skills
Get weekly curated skills + safety alerts
每周精选 Skills + 安全预警