Free tool · 2026

A/B Test Significance Calculator

Stopping a test early because the winner "looks right" is the single most common A/B testing mistake. Drop in your raw numbers — we'll tell you if you've actually seen a real lift or just noise.

Test inputs

Drop in your raw visitors and conversions for each variant.

Variant A (control)

Conversion rate

3.00%

Variant B (test)

Conversion rate

3.70%

Confidence threshold95%

Industry standard: 95%. Use 90% for early reads, 99% for high-stakes.

Verdict

B winswith 95.6% confidence
Relative lift
+23.3%
B vs A conversion rate change
Absolute lift
+0.70pp
Percentage-point delta
Z-score
2.02
|z| ≥ 1.96 ≈ 95% confidence
P-value
0.0437
Two-sided
Variant B beats Variant A by 23.3% with 95.6% confidence. Ship it. Run a 7-day post-launch hold-back to confirm the lift persists once 100% of traffic sees B — sometimes the variant-vs-variant signal differs from variant-vs-baseline.

What the math is doing

We run a pooled two-proportion z-test — the standard statistical test for whether two conversion rates are significantly different.

p_pool = (conv_a + conv_b) / (visitors_a + visitors_b)
se = √(p_pool × (1 − p_pool) × (1/v_a + 1/v_b))
z = (rate_b − rate_a) / se
confidence = 2 × Φ(|z|) − 1

At 95% confidence (industry standard), |z| ≥ 1.96 is the threshold. P-values below 0.05 (the complement of 95%) signal "this difference is unlikely to have happened by chance."

Important caveats. (1) Don't peek at the test daily and stop the moment significance is hit — that biases the result. Decide your sample-size target up front and run to it. (2) Significance isn't the same as "big enough to matter." A statistically significant 0.3% lift might be operationally meaningless if your implementation cost is high.

Frequently asked

What does 95% confidence actually mean?

It means: if the variants truly had identical conversion rates, the probability of observing a difference this large (or larger) just by chance is less than 5%. It does NOT mean 'B is 95% likely to be better than A' — that's a common misinterpretation. The honest interpretation: at 95% confidence, you're accepting a 5% false-positive rate over many tests.

When should I use 90% vs 95% vs 99%?

95% is the default for almost everything in performance marketing — it balances false-positive risk against test duration. Use 90% for fast, low-risk creative tests (more false positives but faster shipping) and 99% for high-stakes irreversible changes (pricing, checkout flow, brand positioning).

How many conversions do I need before I read the verdict?

At least 50 per variant for a meaningful read; 100+ per variant for a confident one. Below 50 conversions, even a 30% apparent lift can fail to reach significance — the variance is huge at low volume. If your daily traffic is too low to hit 100 conversions per variant in 14 days, consider running fewer parallel tests or testing bigger changes (more likely to produce large lift).

What's wrong with peeking at a test daily?

Each time you check, you have a small chance of seeing 'significance' by random fluctuation alone. Check 14 times and your effective false-positive rate is around 25%, not 5%. To peek safely, either pre-commit to a sample size up front, or use sequential testing methods (Bayesian A/B testing tools like Optimizely's stats engine, which adjust significance dynamically).

How is statistical significance different from business significance?

Statistical significance asks 'is this lift real?'. Business significance asks 'is this lift worth shipping?'. A 0.4% lift on a 3% baseline (relative lift +13%) is huge in ecommerce. A 0.4% lift on a 30% baseline (relative lift +1.3%) might not survive minor implementation overhead. Always reason about both — significance is a green light to consider shipping, not a mandate to ship.

Should I use this for incrementality testing too?

Yes, but with caveats. Conversion-lift tests (Meta Conversion Lift, geo holdouts) use the same two-proportion math, but the treatment group is exposed to a campaign and the holdout isn't. The math is identical; the sample-size requirements are much larger because lift is usually smaller than in standard A/B tests. Use the conversion-lift sample-size calculator to plan; use this tool to read the result.

Stop guessing whether a creative actually wins.

Floowzy auto-tests creative variants across your accounts and flags significance once each variant has enough volume — no more eyeballing dashboards.