The Subtle Menace of Spurious Correlation

sahoang
Jan 23, 2024
5 min read

Updated: Jan 24, 2024

Is there a statistical booby trap threatening your big, expensive experiment?

The most pernicious mistakes are easy to make, hard to detect, and have serious consequences. When interpreting the results of an experiment, confusing spurious correlation for a real effect is that kind of mistake. This post will discuss spurious correlation related to a particular experimental design pattern commonly seen in modern biotech labs.

Spurious correlation is when two variables falsely appear to be associated with one another due to the effect of a third, unconsidered variable. In a transcriptomic experiment, it might manifest as two treatments that wrongly appear to have similar effects on gene expression. In a compound library screen, a set of compounds may appear active, but the result fails to repeat in a follow-up experiment. The source of these illusions is often spurious correlation caused by shared controls, i.e., when multiple treatments are compared to the same control. Few sources of spurious correlation are so commonly baked unwittingly into the design of an experiment.

Any two treatments that share a control will tend to look similar simply because they share a control. The reason for this is easy to understand if you consider that controls, like any experimental group, are subject to random fluctuations. If a control group bounces high or low, that bounce will be seen in the treatment effect since it is defined relative to the control. Unfortunately, the consequences of this are not always intuitive, particularly when many contrasts are made.

A simulated example

Consider a simulated data set with two treatments, A and B, and two controls, Control 1 and Control 2. Each run of our hypothetical assay measures many features, similar to standard 'omic' assays. In this simulation, the four groups are independent and identically distributed; i.e., they are uncorrelated:

Let's say we want to compare the effects of treatment A to treatment B. It is common to do this comparison using the `log_2` ratio of the treatment over the control (i.e., the `log_2` fold change). The following figure shows scatterplots of the effects of treatment A versus treatment B. The result on the left uses one control for both treatments, and the one on the right uses a different control for each treatment.

The result on the left is counterintuitive—I have seen it surprise several professional statisticians. All three variables that go into the plot are mutually uncorrelated, yet ratios composed of them are correlated. (If each of the three variables is normal and i.i.d., the expected Pearson correlation for the lefthand plot is 0.5. See the technical note at the end that explains why this is the case.) The plot on the right is what most people would expect to see—uncorrelated effects.

What are the implications, and what can be done to protect your experiment?

The example above is inspired by assays like RNA-seq, metabolomics, proteomics, etc., that generate many measurements per sample. For these assays, spurious correlation can make it look like two treatments have similar global effects, even if they have no actual similarity. Closely inspect any analysis that involves treatment similarity based on fold change values. Sometimes, these analyses are explicit similarity comparisons, like in the case of cluster analysis. Sometimes, the notion of similarity enters the picture more subtly, like in supervised machine learning models that use fold changes as part of their feature vectors.

A straightforward remedy to the control-induced correlation problem is ensuring that every treatment in your experiment has its own controls. Unfortunately, this is often impractical due to resource or technical constraints. Other specific recommendations are challenging to give since the impact of the problem depends on your particular experimental goals and constraints. Maybe you can tolerate a bit of spurious correlation because you are focused on very large effects. Perhaps treatment-to-treatment comparisons are not a primary goal for your study. However, here are some things to consider:

Randomization can be helpful. For example, don't cluster similar molecules in a library screen such that they are compared to the same control; mix it up.
If you have a crucial contrast in your experimental design, strive to give each treatment in the contrast its own control.
The amount of control-induced correlation depends on how much overlap exists in the control groups. You will reduce the unwanted correlation if you reduce the overlap in control replicates for a pair of treatment-vs-control contrasts.
Repeating your experiment with different treatment-control pairings will help expose spurious correlation issues.

Awareness of the issue is the best protection against its ill effects. You are less likely to be fooled if you know what to look for. If you are designing an experiment, raise the issue with a data analysis expert. They won't be able to salvage your data if a fundamental flaw is baked into the design. If you are responsible for analyzing the results, get involved in the experimental design as early as possible. Ensure that spurious correlation will not undermine the experiment's goals. It's almost always worth your time to carefully analyze the impacts of this effect on your data. These experiments are expensive, and so are red herrings.

Technical notes

Read on to understand why the common control scatterplot in the figure above has a Pearson correlation of approximately 0.5.

Consider three random variables, A, B, and C, normally distributed and i.i.d. We would like to calculate the Pearson correlation between `log_2 (A/C)` and `log_2(B/C).`

Since

`E[log_2 (A) - log_2 (C)] = 0 quad quad and`

`E[log_2 (B) - log_2 (C)] = 0`

The covariance of `log_2 (A/C)` and `log_2 (B/C)` can be calculated as follows:

`Cov (log_2 (A/C), log_2 (B/C)) = E[(log_2 (A) - log_2 (C))(log_2 (B) - log_2 (C))] quad quad quad quad (1)`

Expanding the RHS, we get:

`E[log_2 (A) log_2 (B)] - E[log_2 (A) log_2 (C)] - E[log_2 (B) log_2 (C)] + E[log_2(C)^2] quad quad quad quad (2)`

Since A, B, and C are i.i.d.,

`E[log_2 (A) log_2 (B)] = E[log_2 (A)]E[log_2 (B)] quad quad and`

`E[log_2 (A)] = E[log_2 (B)] = E[log_2 (C)]`

Therefore (2) simplifies to:

`E[log_2 (C)^2] - (E[log_2(C)])^2 quad quad quad quad (3)`

Which is equivalent to the variance of `log_2 (C).` To get to the Pearson correlation, we need to divide the covariance (3) by the product of the standard deviations of `log_2 (A/C)` and `log_2 (B/C).` Since the variance of the difference of two random variables is the sum of their individual variances, we have:

`Var(log_2 (A/C)) = Var(log_2 (A) - log_2 (C)) = Var(log_2 (A)) + Var(log_2 (C))`

Again, since A, B, and C are i.i.d., we have:

`Var(log_2 (A/C)) = Var(log_2 (B/C)) = 2Var(log_2 (C)) quad quad quad quad (4)`

Finally, combining (3) and (4), we end up with an expression for the Pearson correlation:

`Co\r(log_2 (A/C), log_2 (B/C)) = (Var(log_2 (C))) / (2Var(log_2 (C))) = 1/2`

This is why we see a PCC close to 0.5 in the simulated data set above.

STEELYARD SCIENCE

Biomedical Data Science

The Subtle Menace of Spurious Correlation

A simulated example

What are the implications, and what can be done to protect your experiment?

Technical notes

Recent Posts

Comments

Subscribe to Steelyard Science