Content area
Full Text
Data-dependent analysis-a "garden of forking paths"- explains why many statistically significant comparisons don't hold up.
There is a growing realization that reported "statistically significant" claims in scientific publications are routinely mistaken. Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation. The value of p (for "probability") is a way of measuring the extent to which a data set provides evidence against a so-called null hypothesis. By convention, a p-value below 0.05 is considered a meaningful refutation of the null hypothesis; however, such conclusions are less solid than they appear.
The idea is that when p is less than some prespecified value such as 0.05, the null hypothesis is rejected by the data, allowing researchers to claim strong evidence in favor of the alternative. The concept of p-values was originally developed by statistician Ronald Fisher in the 1920s in the context of his research on crop variance in Hertfordshire, England. Fisher offered the idea of p-values as a means of protecting researchers from declaring truth based on patterns in noise. In an ironic twist, p-values are now often manipulated to lend credence to noisy claims based on small samples.
In general, p-values are based on what would have happened under other possible data sets. As a hypothetical example, suppose a researcher is interested in how Democrats and Republicans perform differently in a short mathematics test when it is expressed in two different contexts, involving either healthcare or the military. The question may be framed nonspecifically as an investigation of possible associations between party affiliation and mathematical reasoning across contexts. The null hypothesis is that the political context is irrelevant to the task, and the alternative hypothesis is that context matters and the difference in performance between the two parties would be different in the military and healthcare contexts.
At this point a huge number of possible comparisons could be performed, all consistent with the researcher's theory. For example, the null hypothesis could be rejected (with statistical significance) among men and not among women-explicable under the theory that men are more ideological than women. The pattern could be found among women but not among menexplicable under the theory that women...