Free tool · 2026
A/B Test Significance Calculator
Stopping a test early because the winner "looks right" is the single most common A/B testing mistake. Drop in your raw numbers — we'll tell you if you've actually seen a real lift or just noise.
Test inputs
Drop in your raw visitors and conversions for each variant.
Variant A (control)
Conversion rate
3.00%
Variant B (test)
Conversion rate
3.70%
Industry standard: 95%. Use 90% for early reads, 99% for high-stakes.
Verdict
What the math is doing
We run a pooled two-proportion z-test — the standard statistical test for whether two conversion rates are significantly different.
p_pool = (conv_a + conv_b) / (visitors_a + visitors_b)
se = √(p_pool × (1 − p_pool) × (1/v_a + 1/v_b))
z = (rate_b − rate_a) / se
confidence = 2 × Φ(|z|) − 1At 95% confidence (industry standard), |z| ≥ 1.96 is the threshold. P-values below 0.05 (the complement of 95%) signal "this difference is unlikely to have happened by chance."
Important caveats. (1) Don't peek at the test daily and stop the moment significance is hit — that biases the result. Decide your sample-size target up front and run to it. (2) Significance isn't the same as "big enough to matter." A statistically significant 0.3% lift might be operationally meaningless if your implementation cost is high.
Frequently asked
›What does 95% confidence actually mean?
It means: if the variants truly had identical conversion rates, the probability of observing a difference this large (or larger) just by chance is less than 5%. It does NOT mean 'B is 95% likely to be better than A' — that's a common misinterpretation. The honest interpretation: at 95% confidence, you're accepting a 5% false-positive rate over many tests.
›When should I use 90% vs 95% vs 99%?
95% is the default for almost everything in performance marketing — it balances false-positive risk against test duration. Use 90% for fast, low-risk creative tests (more false positives but faster shipping) and 99% for high-stakes irreversible changes (pricing, checkout flow, brand positioning).
›How many conversions do I need before I read the verdict?
At least 50 per variant for a meaningful read; 100+ per variant for a confident one. Below 50 conversions, even a 30% apparent lift can fail to reach significance — the variance is huge at low volume. If your daily traffic is too low to hit 100 conversions per variant in 14 days, consider running fewer parallel tests or testing bigger changes (more likely to produce large lift).
›What's wrong with peeking at a test daily?
Each time you check, you have a small chance of seeing 'significance' by random fluctuation alone. Check 14 times and your effective false-positive rate is around 25%, not 5%. To peek safely, either pre-commit to a sample size up front, or use sequential testing methods (Bayesian A/B testing tools like Optimizely's stats engine, which adjust significance dynamically).
›How is statistical significance different from business significance?
Statistical significance asks 'is this lift real?'. Business significance asks 'is this lift worth shipping?'. A 0.4% lift on a 3% baseline (relative lift +13%) is huge in ecommerce. A 0.4% lift on a 30% baseline (relative lift +1.3%) might not survive minor implementation overhead. Always reason about both — significance is a green light to consider shipping, not a mandate to ship.
›Should I use this for incrementality testing too?
Yes, but with caveats. Conversion-lift tests (Meta Conversion Lift, geo holdouts) use the same two-proportion math, but the treatment group is exposed to a campaign and the holdout isn't. The math is identical; the sample-size requirements are much larger because lift is usually smaller than in standard A/B tests. Use the conversion-lift sample-size calculator to plan; use this tool to read the result.
Stop guessing whether a creative actually wins.
Floowzy auto-tests creative variants across your accounts and flags significance once each variant has enough volume — no more eyeballing dashboards.