Mistaken Thinking
There was a Presidential election in the US in 1936.
Republican Alfred Landon stood against Democrat Franklin Roosevelt.
The Literary Digest conducted a postal opinion poll amongst 10 million people (nearly a quarter of the electorate).
And, based on the first 2.4 million responses, predicted a convincing win for Landon of 55% to 41%.
It was spectacularly wrong.
Roosevelt won the election by 61% to 37%.
How could this have happened with such a big data set?
Especially as George Gallup‘s poll, amongst just 3000 people came much closer to the actual result.
It’s because too much faith had been put in the scale of the data, rather than what was in it. In other words, The Literary Digest forgot about sample bias.
It had mailed questionnaires to people from lists of car registrations and telephone directories.
Which was hugely unrepresentative of the US Electorate at the time.
It was biased.
On top of this, it turned out that Landon supporters were also more likely to mail back their answers.
For every person Gallup interviewed (in his unbiased sample), The Literary Digest received 800 responses.
But all it gave them was a very precise estimate of the wrong answer.
Big data is seductive but the numbers don’t speak for themselves.
Nor do they negate the need to worry about sampling disciplines.
Size isn’t everything.