P-Hacking

May 22 2015 Published by under Uncategorized

During the journal club discussions about p-hacking, we went through the list of six things that Head et al talked about in their paper. The discussion was great (side note: the journal club is a set of three labs that work on different systems, and ask different questions yet we use similar methods, and to the larger world, yeah, well, we do the same thing. The other two PI's are part of the reason I came to almost-MRU. The relationships among the three of us are very good, and sharing a large lab, with dedicated spaces to specific activities, has prompted collaborations amongst us. Also, our post-docs share an office and our grad students, too. It works).

Our first conclusion was that yes, p-hacking happens. Yes, we have all probably done something close to this, not on purpose, but as a function of analysis. The second large conclusion came after we went through the list of six sins, one at a time and discussed them in the context of the data we were collecting and analyzing. I should add that the three of us are all fairly well trained in data analysis, and are the go-to people in the department for stats. What came out of our discussion was that yes, these are wrong, but there are subtleties there, and that blanket statements seldom encompass all the issues.

I'd like to talk about the sins problems here, maybe not one at a time. Some of the perceptions/insights we came to, or were coming to, at the end of our meeting are worthy of further discussion.

So... to start the first issue Head et al raise: (I am going to try and put the citations in here, even if it makes the post a bit longer):

conducting analyses midway through experiments to decide whether to continue collecting data [15,16];

  • 15. Gadbury GL, Allison DB (2014) Inappropriate fiddling with statistical analyses to obtain a desirable p-value: Tests to detect its presence in published literature. PLoS ONE 7: e46363. doi: 10.1371/journal.pone.0046363 View Article PubMed/NCBI Google Scholar
  • 16. John LK, Loewenstein G, Prelec D (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol Sci 23: 524–532. doi: 10.1177/0956797611430953. pmid:22508865 View Article PubMed/NCBI Google Scholar

Obviously, if one has made a decision, done the power analyses and determined the number of subjects/trials/experiments, then one should stick with it. Stopping early may give false significance. So why would anyone do this? We came up with two reasons, one of which they mention at the end of the paper:

Amazingly, some animal ethics boards even encourage or mandate the termination of research if a significant result is obtained during the study, which is a particularly egregious form of p-hacking (Anonymous reviewer, personal communication).

This is not just animal ethics boards, but IRBs that oversee human subject research. Possibly one of the most famous examples of this was the first aspirin RCT called The Physician's Health Study. In this case the effect size (the value of aspirin in preventing cardiovascular mortality) was so remarkable, that in fact the study was terminated three years early. The abstract, from The Annals of Epidemiology (1(5):395-405):

The Physicians' Health Study is a randomized, double-blind, placebo-controlled prevention trial of 22,071 US physicians, using a factorial design to evaluate the role of aspirin in the prevention of cardiovascular mortality and beta carotene in the reduction of cancer incidence. After approximately 5 years of follow-up, the aspirin component was terminated, 3 years ahead of schedule. Several factors were considered in the decision to terminate, including a cardiovascular mortality rate markedly lower than expected in both aspirin and placebo subjects, precluding the evaluation of the primary aspirin hypothesis, and a highly significant (P < .00001) and impressive (44%) reduction in the risk of first myocardial infarction in the aspirin group. Issues in the decision to terminate are described in this report.

The question here that arises is one of ethics. If a researcher can show that something is valuable is it ethical to withhold this information or delay it? The important point is "can show". If you have p-hacked, have you shown? In the aspirin study, the significance level and effect size were so remarkable that there was no question of having shown. But how do you know that your result is remarkable without doing something wrong in terms of data analysis? Just breaking the blinding (which had to be done to show the effect) is wrong. Maybe there is some more sophisticated reasoning on this of which I am unaware. But the rationale for the animal studies is the concept of "reduction" which is one of the hallmarks of ethical use of animals- use no more than you need.

From a trainee or PI's view this is a tremendously tempting sin. Saving resources, be they animals, research costs and time, let alone animal or human lives, is a powerful motivating factor. What we agreed in the discussion is that acute and incisive pre-planning is one way to avoid this problem. Preliminary studies (which are not then folded into the larger study) to estimate sample size, or more importantly identify problematic covariates and control for variation in response make it likely that one will get the appropriate final sample size.

There is another factor that prompts people to do analysis half-way through data collection: timing of abstract submission for scientific meetings. Abstracts are usually due anywhere from 4 to 12 months prior to the meeting. One needs to have Something Important to say in the abstract. One of the (clinical) organizations of which I am (reluctantly) a part requires p-values and statements about significance in the abstracts (and they review and select, and less than 50% end up presenting at the meeting). Someone mentioned that some societies only accept titles in advance (for program organization) and that abstracts are submitted right before the meeting. Since the abstracts are for the most part "published" in one form or another, and get cited, or constitute part of a trainee's portfolio, there is pressure to get something significant rather than just exploring data.

If we wish to reduce the p-hacking, that either goes on in our labs (inadvertently, of course) or that we perceive in people we advise, there is more culture change than just being aware of the problem.

6 responses so far

  • DJMH says:

    Re the problem with blanket statements: it also depends what kind of science you're doing, frankly. I've often committed nearly all of these "sins" because often, the most interesting thing in your data is NOT what you went in looking for. I'm sure these statistical issues are a big deal in clinical studies or other areas where the statistics are king, but that's just not the reality for some other kinds of science. In my kind of data, you can see most effects by eye--the stats are just icing.

    • potnia theron says:

      this is part of what we talked about at length after discussing the individual problems. And that they vary (more so than in the original article) with subdiscipline.

  • ecologist says:

    Interesting discussion. Part of the trouble with p-hacking is not the hacking, but the "p". I recommend a look at:

    Royall, R. 1997. Statistical Evidence: A Likelihood Paradigm. Chapman and Hall/CRC Monographs on Statistics and Applied Probability.

    It is an very readable discussion of the basis for interpreting the results of experiments or observations as scientific evidence. It has a lengthy treatment of the "stopping rule" problem.

  • Daniel Lakens says:

    Maybe this recently accepted paper, which criticizes the general approach in Head et al, would be interesting for a future journal club: https://dl.dropboxusercontent.com/u/133567/Lakens%20-%202015%20-On%20the%20challenges%20of%20drawing%20conclusions%20from%20p-values%20just%20below%200.05.pdf

Leave a Reply