How Statistical Generalizations Fail

Statistical regularities are powerful when the underlying assumptions hold.

Phillip Batz

Nov 18, 2025

They fail — sometimes catastrophically — when those assumptions break.

Here are the main pathways for failure:

1. Sample Size Too Small

Small samples exaggerate noise and suppress the underlying distributional signal.

What fails

Central Limit Theorem approximations
Frequency-based distributions
Power laws (including Benford’s)

Example

A small town’s incomes won’t follow the log-normal or power-law shapes typical in large populations. Any attempt to apply Benford-like distributions to their financial data will look “non-compliant,” not because it’s fraudulent but because the population is too small to generate the law’s dynamics.

General lesson

Many statistical laws are asymptotic. They need large-N to work.

2. Violated Independence or Identical Distribution (non-IID data)

Statistical generalizations often assume:

independence
stationarity
identical underlying process

Break any of these and the generalization collapses.

Example

Polling percentages for head-to-head races are not IID:

they’re bounded (0–100)
they’re correlated over time
they’re influenced by structural factors (demographics, sample weighting)

Because of this, Benford’s Law should not apply to polling percentages. The percentages are not generated by multiplicative, scale-invariant processes, which is required for Benford behavior.

3. Distribution Mechanism Doesn’t Match the Law

Many famous statistical generalizations (Benford’s, Zipf’s, Pareto, log-normal distributions) arise only when the generative mechanism satisfies specific properties.

Benford-specific requirements:

Benford’s Law emerges when:

Numbers span several orders of magnitude
Growth is multiplicative
There’s no artificial upper bound
The dataset combines heterogeneous sources (“mixture distributions”)

If even one condition fails, Benford disappears.

Common failure modes:

bounded datasets (e.g., human heights, school test scores)
additive processes instead of multiplicative ones
engineered or human-curated numbers (prices ending in .99)

4. Survivorship Bias / Truncation / Censoring

If some values are systematically missing, the distribution becomes distorted.

Example

World population by country roughly follows a log-normal shape, but if you:

remove small islands
truncate extremely small values
exclude regions with poor census reporting

…the distribution suddenly looks “non-normal,” even though the underlying process is unchanged.

5. Temporal Instability (non-stationary data)

Statistical laws assume stable generative processes.

When the underlying reality changes (demographics shift, technology adoption accelerates, preferences shift), the statistics begin to “fail” — not because the math is wrong, but because the system changed.

Example

Election polling after 2016:

Nonresponse bias increased
Education weighting changed
Late-deciding voters became more volatile

Traditional polling generalizations began failing because the population behavior changed, not the statistics.

6. Human-Generated Data

Humans introduce biases that destroy normal statistical patterns:

round numbers
heuristics
conventions (e.g., prices at $19.99)
strategic presentation (e.g., finance numbers chosen to look neat)

This is why:

ZIP codes don’t follow Benford
phone numbers don’t
product prices don’t
polling percentages don’t

They are constructed, not emergent.

7. Overfitting or Mis-generalization

Sometimes the statistical rule fits the historical data, but not the future or the theory.

This includes:

Simpson’s paradox
spurious correlations
structural breaks
concept drift

Example

A distribution may look exponential for 30 years and then flatten due to new technology, regulation, or demographic turnover.

This is where metaphors like “trend lines approaching zero or infinity post-2050” show their teeth — the map changes.

Phillip’s Substack

Discussion about this post

Ready for more?