P-Values, Type I and Type II Errors Explained
What You’ll Learn
This lesson demystifies p-values and introduces you to the two critical ways A/B tests can mislead you: false positives (Type I errors) and false negatives (Type II errors). For The A/B Test Starter, mastering these concepts means avoiding costly mistakes like implementing winning variants that aren’t actually winners, or dismissing tests too early and missing real improvements.
Key Concepts
The p-value is the probability of observing your test results (or more extreme results) if the null hypothesis were true—that is, if there were genuinely no difference between your control and variant. Type I and Type II errors represent the two ways this probability framework can fail: Type I is a false positive (claiming victory when there’s no real difference), and Type II is a false negative (missing a real winner because you stopped testing too soon or had insufficient sample size). For The A/B Test Starter, understanding these errors helps you set appropriate confidence levels and sample sizes before you launch, rather than discovering problems after implementation.
- P-Value Definition: A p-value of 0.05 means there’s only a 5% probability you’d see your observed results if the variants truly performed identically. When your test reports p < 0.05, it means the observed difference is unlikely to be pure chance. Importantly, a p-value is NOT the probability that the winning variant is actually better—it’s the probability of the data given that there’s no real difference.
- Type I Error (False Positive): This occurs when you declare a variant the winner when it’s actually performing the same as the control (or worse). At a 95% confidence level, you’re accepting a 5% Type I error rate—meaning roughly 1 in 20 tests you run will show a winner that isn’t real. This is why implementing every “significant” test result without strategic judgment leads to incremental damage over time.
- Type II Error (False Negative): This is when a variant actually performs better than control, but your test fails to detect it—usually because you stopped testing too early or had too small a sample size. If your test has low statistical power (typically you want 80% power minimum), you’re likely to miss real improvements, especially smaller but meaningful ones.
- The Trade-Off Between Errors: Lowering your confidence threshold (requiring less evidence) increases Type I errors but decreases Type II errors, and vice versa. For The A/B Test Starter, the 95% confidence standard represents a reasonable middle ground, but you should consciously choose your threshold based on the cost of each error type—high-stakes changes warrant 99% confidence, while low-risk changes can use 90%.
Practical Application
Identify one current or recent A/B test and explicitly label whether you’d be making a Type I or Type II error if it were wrong: write “If we implement this ‘winner,’ we risk a Type I error” or “If we stop this test now, we risk a Type II error.” Then use this categorization to decide whether you need additional confirmation testing or larger sample sizes before taking action.