Free tool · 2026

Conversion Lift Sample Size Calculator

Incrementality testing answers "did this campaign actually cause this revenue?" — the only honest measurement of ad ROI. Plan the test before you spend: how big the sample needs to be, and how long it'll take.

Test design

Plan a Meta Conversion Lift, geo holdout, or any 2-arm test.

%

Your current (control) conversion rate

%

Smallest relative lift you want to be able to detect

Statistical power80%

Probability of detecting a real lift. 80% is standard.

Confidence (two-sided)95%

Significance threshold. 95% is standard.

How many visitors / users per arm per day

Sample size per group

29.2Kvisitors

You'll need this many per arm. Total test size: 58.4K.

Days to significance
20 days
At 19.5 days per arm
Treatment rate (expected)
2.88%
Baseline × (1 + 15%)
Minimum detectable absolute lift
+0.37pp
Percentage-point delta we can detect
Conversions needed per arm
840
Treatment-side conversion count
Reasonable test duration (20 days). This is the sweet spot — long enough to capture weekday/weekend patterns, short enough to act on results.

What this is for

Three test designs use the same math. Plan all of them with this tool:

  • Meta Conversion Lift / Snap Brand Lift / X Conversion API lift studies — platform splits your audience randomly into treatment and holdout.
  • Geo holdouts — you turn ads off in selected metros and compare to statistically-matched metros where ads continue.
  • User-level holdouts — using a CDP, randomly hold 5-10% of your audience back from all marketing for the test period.

The math:

n per group ≈ ((z_α + z_β)² × variance) / lift²
where variance = p₁(1−p₁) + p₂(1−p₂)

Higher power, tighter significance, smaller minimum detectable lift all push sample size up. Daily traffic per arm decides how long that sample takes to accumulate.

Pair this tool with the incrementality testing guide for the methodology details.

Frequently asked

What is incrementality testing?

Incrementality testing answers: 'would this conversion have happened anyway if we hadn't run the ads?' Standard attribution (last-click, MTA) gives the credit to whichever channel touched the user last. Incrementality measures actual causal lift by comparing a treatment group (sees ads) to a holdout (doesn't) and computing the conversion-rate difference. It's the only methodology that catches channels which are cannibalizing organic conversions you'd have earned anyway.

How is this different from a regular A/B test?

The math is identical — both are two-proportion tests. The setup differs: an A/B test compares two creative variants both shown to ad-exposed users. An incrementality test compares ad-exposed users to a group that sees no ads. Sample sizes are usually much larger for incrementality because lift is typically smaller (5-15% incremental lift is common; A/B winners often lift 10-30% relative to the loser).

What's a good minimum detectable lift to design for?

Industry norm: 10-20% relative lift. Smaller is more rigorous but explodes your sample size. If you're running a Meta Conversion Lift test on a large account, 10% is reasonable. For geo holdouts on a smaller account, you might design for 25% to keep the test affordable. Don't design for 5% unless you have very high daily volume — the test will take months.

Why 80% statistical power?

Power is the probability of detecting a real lift if one exists. 80% is the industry default (you'll catch 4 out of 5 real lifts). 90% is more conservative but requires ~30% more sample. 95% power is rare in marketing — it's usually overkill for what we're trying to measure.

Should I always use 95% confidence?

For directional reads (is this channel incremental?), 95% is standard. For high-stakes irreversible decisions (cutting a whole channel), use 99% to reduce false positives. For early-read 'should I bet on this' decisions, 90% is reasonable since you'll re-test later. Match the confidence threshold to the cost of being wrong.

What if my calculated test duration is longer than 8 weeks?

Three ways to shorten. (1) Increase daily traffic per group — for geo tests, this means picking larger / more metros. (2) Widen the minimum detectable lift — if you only care about lifts above 20%, design for that. (3) Reduce power to 70% — you accept more false negatives but finish faster. If none of those work, the test is impractical for this channel; use MMM (media-mix modeling) for measurement instead.

Run a real incrementality test every quarter.

Floowzy makes geo-holdout planning and result reading straightforward — joined to your Stripe revenue so you see actual incremental lift, not platform-reported.