How Statistical Generalizations Fail
Statistical regularities are powerful when the underlying assumptions hold.
They fail — sometimes catastrophically — when those assumptions break.
Here are the main pathways for failure:
1. Sample Size Too Small
Small samples exaggerate noise and suppress the underlying distributional signal.
What fails
Central Limit Theorem approximations
Frequency-based distributions
Power laws (including Benford’s)
Example
A small town’s incomes won’t follow the log-normal or power-law shapes typical in large populations. Any attempt to apply Benford-like distributions to their financial data will look “non-compliant,” not because it’s fraudulent but because the population is too small to generate the law’s dynamics.
General lesson
Many statistical laws are asymptotic. They need large-N to work.
2. Violated Independence or Identical Distribution (non-IID data)
Statistical generalizations often assume:
independence
stationarity
identical underlying process
Break any of these and the generalization collapses.
Example
Polling percentages for head-to-head races are not IID:
they’re bounded (0–100)
they’re correlated over time
they’re influenced by structural factors (demographics, sample weighting)
Because of this, Benford’s Law should not apply to polling percentages. The percentages are not generated by multiplicative, scale-invariant processes, which is required for Benford behavior.
3. Distribution Mechanism Doesn’t Match the Law
Many famous statistical generalizations (Benford’s, Zipf’s, Pareto, log-normal distributions) arise only when the generative mechanism satisfies specific properties.
Benford-specific requirements:
Benford’s Law emerges when:
Numbers span several orders of magnitude
Growth is multiplicative
There’s no artificial upper bound
The dataset combines heterogeneous sources (“mixture distributions”)
If even one condition fails, Benford disappears.
Common failure modes:
bounded datasets (e.g., human heights, school test scores)
additive processes instead of multiplicative ones
engineered or human-curated numbers (prices ending in .99)
4. Survivorship Bias / Truncation / Censoring
If some values are systematically missing, the distribution becomes distorted.
Example
World population by country roughly follows a log-normal shape, but if you:
remove small islands
truncate extremely small values
exclude regions with poor census reporting
…the distribution suddenly looks “non-normal,” even though the underlying process is unchanged.
5. Temporal Instability (non-stationary data)
Statistical laws assume stable generative processes.
When the underlying reality changes (demographics shift, technology adoption accelerates, preferences shift), the statistics begin to “fail” — not because the math is wrong, but because the system changed.
Example
Election polling after 2016:
Nonresponse bias increased
Education weighting changed
Late-deciding voters became more volatile
Traditional polling generalizations began failing because the population behavior changed, not the statistics.
6. Human-Generated Data
Humans introduce biases that destroy normal statistical patterns:
round numbers
heuristics
conventions (e.g., prices at $19.99)
strategic presentation (e.g., finance numbers chosen to look neat)
This is why:
ZIP codes don’t follow Benford
phone numbers don’t
product prices don’t
polling percentages don’t
They are constructed, not emergent.
7. Overfitting or Mis-generalization
Sometimes the statistical rule fits the historical data, but not the future or the theory.
This includes:
Simpson’s paradox
spurious correlations
structural breaks
concept drift
Example
A distribution may look exponential for 30 years and then flatten due to new technology, regulation, or demographic turnover.
This is where metaphors like “trend lines approaching zero or infinity post-2050” show their teeth — the map changes.

