Simplifying Multi-omics by Combining Probabilities

sahoang
Feb 16, 2024
9 min read

Do you have a pile of multi-omics data? Are you looking for tactics to make sense of it?

If your answers are 'yes' and 'yes,' this post is for you. It describes a framework to help you deal with challenging tasks frequently encountered when analyzing omics data. Examples include:

Combining results across multiple studies (meta-analysis)
Combining results from various types of omics data (multi-omics analysis)
Performing complex comparisons within and between studies.

These tasks are becoming more common and fundamental as omics technologies proliferate. Simple, expressive frameworks for extracting meaning from these kinds of data are extremely valuable. As such, I developed an R package that will allow readers to get started right away with the types of analyses described here.

What types of data are we talking about?

Modern molecular biology is rich with various types of "omics" studies—genomics, transcriptomics, epigenomics, proteomics, metabolomics, and lipidomics, to name a few. These assays involve comprehensive parallel measurement of a particular class of molecular entity, for example, the expression level of each gene in the genome. Analyzing these data often involves many hypothesis tests, for example, treatment vs. control contrasts across all genes. The primary output is a ream of p-values and fold-changes (there are other important statistics, but we will ignore them for our purposes). Once these statistics are calculated, the analysis proceeds into the garden of forking paths, where the route depends on the research questions at hand. This post discusses concepts that will help researchers navigate the garden with freedom and clarity.

Many research questions call for comparative analysis, like across-study contrasts, multi-way comparisons of treatment effects, and integration of multiple data types. For omics datasets, comparative analysis can be a knotty task, full of opportunities for discovery but also error. A few of the pitfalls that are commonly encountered are failure to account for batch effects, spurious correlation, confounding variables, inconsistent/inappropriate normalization, and many others. Each of these topics deserves treatment far beyond the scope of this post, so we'll set them aside for now. Comparative analyses often involve questions like:

"What genes are regulated by treatments A and B but not C?"
"Is gene regulation consistent with protein regulation in my study? How do they diverge?"
"Are the results from my study and a previously published study consistent? How are they different?"
"In what ways are treatments A and B similar? How are they different?"
"Does a treatment in my experiment cause a reversal of changes seen in a disease?"
"What associated genes, enzymes, and metabolites are simultaneously dysregulated in a disease state?"

And so forth. Each question has in common that they can be packaged into queries involving the Boolean operators `tt"AND",` `tt"OR",` and `tt"NOT".` The goal of this post is to describe a framework that enables comparative analysis questions to be posed as Boolean statements and for answers to be computed using straightforward probability calculations. The inner workings of this framework turn on the reams of p-values described earlier.

So you've got a bunch of p-values...

As a preamble to our primary goal, we must discuss some details of a simple and underrated exploratory data analysis tool: the humble p-value histogram.

The p-value histogram is a valuable tool for summarizing omics analyses for several reasons:

It conveys the extent of treatment effects across the entire analysis.
It can be used to quickly identify the presence of problems with the underlying hypothesis testing framework.
It can be used as the basis for meta-analysis that combines multiple disparate datasets and/or data types.

To fully understand these points, we need to take a key fact onboard: p-values are, by definition, uniformly distributed between zero and one when the null hypothesis is true. This fact is often unappreciated by scientists who consume omics data. Several valuable insights and analysis opportunities flow from this single fact.

Let's consider the analysis of a straightforward 'treatment vs. control' gene expression experiment, where the treatment is inert; there are no differentially expressed genes. From the genome-wide collection of treatment vs. control contrasts, we expect a flat distribution of p-values ranging from zero to one—every possible p-value is equally likely. Here is a simulation that illustrates the point:

The histogram contains 10,000 p-values from 10,000 t-tests comparing normal i.i.d. samples. The null hypothesis, `H_0`, is true in all 10,000 cases. (Careful consideration of this result indicates why multiple testing is a problem. In this hypothetical null experiment, we would expect about 5% of the tests to have a p-value < 0.05 since the interval [0-0.05] spans 5% of the interval [0-1]. If we used a 0.05 p-value threshold on this contrived example, we would end up with about 500 false positives.)

Now let's look at a simulation where there is a real effect in 20% of the data:

Notice how the distribution resembles a mixture of the original uniform distribution and a new component that peaks near zero and quickly decays as p-values get larger. The latter bit contains the exciting stuff: the genes where the treatment induced an effect. You can see the extent of the treatment effect by inspecting this plot. The bigger the peak on the left, the greater the effect (i.e., the number of genes affected). Figure 2 shows what a sound analysis typically looks like. Figure 3 is an example of how a problematic analysis might look:

Figure 3: Problematic p-value distribution

This histogram was generated the same way as the example in Figure 1, except the data was sampled from a lognormal rather than normal distribution. It violates the t-test's requirement that the data are normally distributed, which results in a non-uniform distribution of p-values. This kind of "wavy" distribution of p-values can indicate some incompatibility between the data and the assumptions of your test. If you see a histogram that looks like this, you may need to check the upstream steps in your analysis and the quality of the raw data.

How to query your results by calculating and combining probabilities

We are ready to move on now that we've seen the basics of p-value distributions. Recall that before our digression, we were discussing comparative analyses in terms of Boolean queries. Suppose we have a gene expression dataset and a complimentary metabolomics dataset from a single experiment, and we want to integrate the two. Let's use `G_i` to refer to the event that gene `i` is regulated in our experiment. And let's use `M_j` to refer to the event that the level metabolite `j` changed. A simple question we might ask is: what genes and metabolites were co-regulated in the experiment? (Presumably, you would ask this question for functionally related gene-metabolite pairs.) As a Boolean statement, we are looking for cases where `G_i\ tt"AND"\ M_j` is true. We can evaluate this, assuming the independence of `G_i` and `M_j`, as `P(G_i) \times P(M_j).` This is the joint probability that gene `i` and metabolite `j` were affected in our experiment.

In general, for two independent events, `A` and `B`, the fundamental Boolean operations implemented as probability calculations are as follows:

AND (intersection): `P(A) \times P(B)`
OR (union): `P(A) + P(B) - P(A) \times P(B)`
NOT (complement): `(1- P(A))` or `(1-P(B))`

We can construct Boolean queries of arbitrary complexity and evaluate them using compositions of these three calculations. There is, however, an important caveat to underscore: these calculations assume the independence of `A` and `B`. This is certainly not strictly true in many biological contexts. Returning to our example, a gene and metabolite may have correlated regulation, and indeed, they often must. If we remove the independence assumption, the `tt"AND"` calculation becomes `P(A|B) \times P(B).` If `A` and `B` are positively correlated, then `P(A|B) \times P(B)` > `P(A) \times P(B).` So, for positively correlated events, the independent `tt"AND"` calculation produces conservative probability estimates, leading to potential false negatives. By a similar logic, the independence assumption leads to anti-conservative `tt"OR"` probabilities. The degree of non-independence will influence the degree of bias in the combined probabilities. This caveat notwithstanding, the recipe for combining the probabilities of independent events is practical for a vast number of analysis scenarios.

We now know the kinds of probabilities we want, but what specific probabilities should we use in practice? P-values are probabilities, but do their properties suit our needs? A precise interpretation of their meaning is notoriously elusive for many. A commonly heard but erroneous definition of a p-value is "the probability that the null hypothesis is true." We could combine p-values to compute our queries if this were the correct definition. Alas, the actual definition of a p-value is more tortuous: "the probability of obtaining test results at least as extreme as the observed results, under the assumption that the null hypothesis is true." Non-experts can be forgiven for not having that on the tips of their tongues. Combining p-values with Boolean probability calculations would produce offspring who would mightily resist plain English descriptions. Fortunately, a probability with the interpretation we seek can be extracted from the p-value histogram.

We observed that the p-value distribution in Figure 2 appears to have uniform and non-uniform components. It is possible to model this as a mixture of a uniform distribution and one or more beta distributions (which are flexible on the interval [0, 1]), as in the following figure:

Figure 4: A uniform-beta mixture model fit to a p-value distribution

Using the fitted mixture model, one can calculate the Bayesian posterior probability that a given p-value is associated with a non-uniform component of the mixture (see the articles in 'Further reading' for details of this calculation). Recall the fact emphasized earlier: p-values are, by definition, uniformly distributed between zero and one when the null hypothesis is true. Thus, the posterior probability can be interpreted as "the probability that the null hypothesis is false" or, equivalently, "the probability that the alternative hypothesis is true." The upshot is that this method maps p-values onto the probabilities we want for our Boolean queries.

The technique is powerful because it creates a near one-to-one mapping between the statement of a research question and the computational steps necessary to answer it. This can be incredibly liberating since many questions that are possible to ask in Boolean terms are awkward to answer using alternative analysis frameworks. A surprising and happy outcome of working in this framework is that impactful questions naturally arise that might be otherwise unconsidered. We should not, however, forget the independence assumptions baked into the method. They highlight the importance of a nuanced approach to its application. Specifically, the framework is exceptionally well-suited as an exploratory tool. Since one of its advantages is enabling rapid answers to complex questions, one practical strategy for integrating it into established workflows is to use it to rapidly identify and prioritize signals of interest and validate them with complementary methods.

If you want to apply these ideas to your data, this R package will get you started by performing the p-value-to-posterior-probability calculations: https://github.com/stevehoang/pbayes.

A practical application: The Response Similarity Index

A useful application of this method is in the pairwise comparison of treatments. Imagine, again, that we have a gene expression experiment, this time with two treatments, A and B. We want to identify effects that are similar and different between them. A conventional approach is to scoop up all genes below a given significance threshold for each treatment and inspect the resulting Venn diagram. An alternative approach is to calculate the signed joint posterior probability, `p,` for each gene, `g,` which is a value I refer to as the Response Similarity Index (RSI):

`RSI_g^{AB} = tt"sgn"(\phi_g^A) \times tt"sgn"(\phi_g^B) \times p_g^A \times p_g^B`

Where `\phi_g^X` is the log fold change of gene `g` due to treatment `X.` Genes with RSI values near 1 are significantly regulated in the same direction; values near -1 indicate regulation in opposing directions. This strategy for comparing two treatments has several advantages over conventional approaches:

It provides information on the degree of similarity/difference for each gene.
It enables rank-based analysis methods (e.g., pathway analysis) on joint effects.
It enables new and valuable visualization opportunities.
It achieves all of the above without the need to select significance thresholds.

Regarding point 1, quickly identifying effects with strong joint evidence across multiple conditions/experiments can help assuage the negative impacts of spurious correlation. Have a look at my previous post on the topic for more details. Regarding point 3, a visualization that I find resonates strongly with many scientists is a fold change-fold change scatterplot colored by RSI:

This figure makes it easy to get a gestalt sense of how similar or different the two treatments are. If the treatments had identical effects, we would see purple points along the line y=x; if they had opposite effects, we would see green points along y=-x. Note that the greyed-out points (RSI ~ 0) represent genes that are not regulated in one condition or the other.

RSI is just one example of how probabilities can be combined creatively to consume the results of an omics experiment. There are many other possibilities; for instance, the RSI concept can be modified to identify effects that are exclusive to each treatment.

In general, I advocate combining posterior probabilities because of the opportunities for creative analysis strategies it enables and for the framework's simplicity. Yes, calculating the posterior probabilities is tricky; however, I hope the R package helps. And yes, there are things like independence assumptions that the user should be aware of. I encourage anyone interested in learning more to look at the references below. Although there are limitations, the benefits are many and great. In summary, by harnessing the power of Boolean logic and Bayesian probability, this framework simplifies the complexity of multi-omics data analysis and gives researchers new avenues for discovery.

STEELYARD SCIENCE

Biomedical Data Science

Simplifying Multi-omics by Combining Probabilities

What types of data are we talking about?

So you've got a bunch of p-values...

How to query your results by calculating and combining probabilities

A practical application: The Response Similarity Index

Further reading

Recent Posts

Comentarios

Subscribe to Steelyard Science