On Monday, Troy Hunt published a blog post suggesting that there is a new trend of historical data breaches showing up on dark web markets. The blog post observed a cluster of four advertisements of data from past breaches, all from the same seller on the same market, in the same month. Based on this data, Troy suggested that this was a new trend. This got us thinking about anecdotes and statistical inference. What, if anything, can we conclude from a couple of anecdotes? Here’s our take.
Yes, the plural of anecdote is data.
The negated version of this sentence – ‘the plural of anecdote is not data’ – is often presented as advice in a professional seminar or as an insult in a debate. This is problematic in part because it is wrong, but more because it reflects a misunderstanding of how and why statistical inference works.
There is nothing wrong with anecdotes. A dataset is a nothing more than a collection of observations, each anecdotal or non-representative of the population on its own. One can refer to a collection of anecdotes as ‘random collection of observations’ or even a ‘random sample’ and arrive naturally at the functional unit of experimental science and modern statistics.
This simplification is unfair to the common wisdom. The implicit assumption of the ‘anecdotes are not data’ quote is that anecdotes are not random, that a collection of anecdotes must have been selected for specific reason and are necessarily biased. The common wisdom assumes that a collection of biased anecdotes cannot be used to make accurate inferences.
This is a false dilemma. It is possible to make useful, even accurate, inferences from a small biased subset of data (yes, even a single anomaly). In fact, statisticians do this all the time.
The anecdote-versus-data dichotomy is flawed because it emphasizes the wrong relationship. The relevant transformation is not between anecdotes and data (which are equivalent). The relevant transformation is between data and evidence.
Data is not itself evidence.
The fact that data is not inherently meaningful (nor objective) is a largely overlooked reality from within the hype of our data-driven tech culture. While this is worth a longer discussion, we can attempt to summarize decades of philosophical, statistical, and scientific debates into a single sentence:
Evidence is a quantification the amount of surprise one should feel at having observed the data given a specific default hypothesis.
A dataset is necessary for evidence, but it is not sufficient. Evidence requires comparing the new observations against the specific predictions of an existing world view.
In practice, this transformation may take the form of comparing new data against a null hypothesis, a Bayesian prior, or informal gut instinct, i.e. ‘Your friend and my friend are BOTH named John and BOTH have a tattoo of a velociraptor. The odds of that happening is low. We must know the same person.’
Anecdotal Evidence: Forgetting to transform data into evidence.
The problem with anecdotes is that any single or small group of examples is unlikely to be sufficiently surprising, or sufficient evidence, to be compelling against anything but the narrowest default hypotheses (i.e. this has NEVER happened before).
The reason we disagree with Troy is not because he was using a small or biased dataset, but because he didn’t turn this data into evidence. In order to claim that this is a new trend, he needed to argue that this is distinctly different from what has happened before. Without an explanation or quantification of what has occurred in past (how often have data from historical data breaches re-appeared?), it is impossible for us to evaluate the strength of his claim of a new trend. Have historical data breaches never re-appeared previously? If they have occurred, do they tend to appear as a cluster? Random events appear to cluster when observed over a long period of time. Why is the relevant time period of a cluster one month? Without context and normalization, the new observations are data; but the data is not evidence.
To be clear, Troy is neither the first, nor likely the last, to substitute data for evidence. But this is as good time as ever to remember that the problem with anecdotal evidence is the evidence, not the anecdotes. When we make conclusions, we need to supply evidence, not just data.
Troy did not make any strong or actionable conclusions which we seek to refute. It is just a recent example of the type of reasoning that results in well-meaning analysts seeing patterns in noise.