Category Archives: Multi-Omics

Multi-omics is an approach, which combines biological data from different sources. Therefore its main challenges are data integration and combination. While there are many available statistical methods to do so, there’s no silver bullet. That’s why multi-omics is an area of research in need of clever people.

Unfortunately for now this are is still in its infancy. My goal here is it to present you some multi-omic approaches and the statistical methods and algorithms involved in them.
Furthermore I’ll try to present to you what the difficulties of such approaches could be. Also always keep in mind, that some simplifications are necessary, when dealing with complex contexts.

Network-Based Integration Through Heat Diffusion

Re-Implementing netICS Heat Diffusion In R

OK, I successfully procrastinated this blog entry about heat diffusion now for too long, even though the code for it has been waiting for over a month. :’D So it’s now time for it! I promised it! And it’s time for it now. And I guess it doesn’t need to be perfect, because the whole idea of this blog is to put things out now, rather than waiting for some future moment, where I can do it perfectly.

So that being said, let’s start. 🙂

In my last post I already talked about the integration of multi-omics data. This time I will show you one method called netICS from a more or less pure coding perspective. It uses Heat Diffusion for finding so-called mediator genes in multi-omics data.
You can find the original code written in Matlab here and the corresponding paper here, which I would suggest you reading, if I aroused your interest in the topic.

Thus let’s just dive straight into the matter!

Loading Data The Cool Way

Because we are cool kids we don’t download the data “manually”, but do it right away in R.
And of course we do this with the super fast and comfy fread function from the data.table package, which I would suggest you to learn, when doing R.

We load the data directly from the Github of the original netICS:

library(data.table)

 mutationData <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/mutation_data_breast.txt", 
                       col.names = c("Gene", "Sample"), header = FALSE)
networkGenes <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/network_genes.txt", 
                      header = FALSE, col.names = "Gene")

rnaDiffExp <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/RNA_diff_expr_breast.txt", 
                    header = FALSE, col.names = c("Gene", "pval"))
rppa <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/protein_diff_expr_breast.txt", 
              header = FALSE, col.names = c("Gene", "pval"))

The network used in the method is stored as a Matlab object. But ff course there’s also a package that does this job for us! After that we have to extract the adjacency matrix from it. Because we will need this adjacency matrix later.

library(R.matlab)
tmp<-readMat("https://github.com/cbg-ethz/netics/blob/master/data/adj_lar_com.mat?raw=true")
adjacencyMatrix<-tmp$adj.lar.com
rm(tmp)

And that already was it with loading the data!

Combing p-values

In my last post I already talked about combining p-values. We need to implement it to combine the p-values from the rnaDiffExp and rppa.

Maybe you will also see now, why I like data.tables so much. I’ve commented the code for you.

combineDifferentialExpressions<-function(diffExp1, diffExp2){
    # Combine both data.tables through an outer join
    mergedDiff<-merge(diffExp1, diffExp2, by ="Gene", all=TRUE)
    # if the first p-value is missing take the second one
    mergedDiff[is.na(pval.x) & !is.na(pval.y), pval := pval.y]
    # if the second p-value is missing take the first one
    mergedDiff[!is.na(pval.x) & is.na(pval.y), pval := pval.x]
    # if both are present combine them by Fisher's method and perform a chi-square test
    mergedDiff[!is.na(pval.x) & !is.na(pval.y), pval := pchisq(-2 * (log(pval.x) + log(pval.y)) , 4)]
    return(mergedDiff[,.(Gene, pval)])
}

I really love the syntax of data.tables in R I have to say. 😀
Now let’s execute it and afterwards adjust the p-values with the FDR method and then take a look at it:

diffExp <- combineDifferentialExpressions(rnaDiffExp, rppa)
diffExp[, p.adjusted:=p.adjust(pval, method = "fdr")]
knitr::kable(diffExp[1:5])
Genepvalp.adjusted
A1BG0.00000000.0000000
A1CF0.88511070.9505269
A2BP10.22216470.2792292
A2LD10.91032690.9695794
A2M0.00000000.0000000

Heat Diffusion

I won’t go into detail about heat diffusion and its theoretical background at this place. Let me just summarize the idea or rather goal short. We’re transforming the adjacency matrix first to a transition matrix, with which you might be familiar from Markov Chains.
Basically each cell in the matrix stands for a transition between two genes (in our case).

Here R can also shine with its natural Linear Algebra capabilities.

normaliseAdjacencyMatrix <- function(adjacencyMatrix){
    return(adjacencyMatrix %*% diag(1/colSums(adjacencyMatrix)))
}

You see… Isn’t that basically a beautiful one-liner?

In the next step I implement the transformation of this matrix to a diffusion matrix. This matrix basically represents the connectivity and network topology.
Multiplying a vector of starting probabilities with it, gives us back a stationary distribution. Which basically means after initially many transitions in the network we will end up at a certain set of vertices in the network with a certain probability.
The restart probability indicates how likely it is that we will be “reported” back to our starting point instead of a transition in a step.

So let’s implement it:

performInsulatedHeatDiffusion <- function(adjacencyMatrix, restartProbability){
    temperature <- diag(dim(adjacencyMatrix)[1]) - (1 - restartProbability) * adjacencyMatrix
    return(restartProbability * solve(temperature))
}

We will run through this network in both directions. Basically, one time from the direction of mutated genes and the other time from the differentially expressed ones.
So let’s wrap this up in another function. We set the default restart probability to 0.4.

netICS <- function(adjacencyMatrix, networkGenes, mutationData, diffExpGenes,
                   restartProbability = 0.4){

    # only keep mutated and differentially expressed genes, that are present in the network
    mutationInNetwork <- mutationData[networkGenes, on="Gene", nomatch = 0]
    diffExpInNetwork <- diffExpGenes[networkGenes, on="Gene", nomatch = 0]

    # calculating the regular diffusion matrix
    connectivity <- performInsulatedHeatDiffusion(normaliseAdjacencyMatrix(adjacencyMatrix),
                                                  restartProbability)
    # calculating the diffusion matrix in the opposite direction
    connectivityBackward <- performInsulatedHeatDiffusion(normaliseAdjacencyMatrix(t(adjacencyMatrix)),
                                                          restartProbability)

    # ranking mediator genes
    result <- prioritize(connectivity, connectivityBackward, networkGenes,
                           mutationData, mutationInNetwork, diffExpInNetwork)

    return(result)

}

At this point you might have already noticed that we haven’t done anything yet with the diffusion matrix. For this reason we will implement the prioritize function, which will do this for us.

Gene Prioritization

First we implement a helper function for the diffusion to avoid redundant code:

diffuseSample<-function(connectivity, networkGenes, S){
    # find the positions of the input genes in the matrix
    positions <- networkGenes$Gene %in% S$Gene
    #multiply them with equal probability with the diffusion matrix
    weights <- rep(1/sum(positions), sum(positions)) %*% connectivity[positions,]
    return(as.vector(weights))
}

S here is the list of genes with which we multiply the diffusion matrix. The starting probability is for that matter uniformally distributed between them.

Afterwards we write our actual prioritization function. The diffusion in this case is only once executed for the differentially expressed genes, but sample-wise for the mutated genes.

Subsequently, we combine both probabilities sample-wise by multiplying and rank them.
Those ranks are then combined by three different methods, them being median, sums and rho-scores.
For the last one the RobustRankAggreg package is used:

prioritize<-function(connectivity, connectivityBackward, networkGenes,
                         mutationData, mutationInNetwork, diffExpInNetwork){

    # diffuse for differential expression, if differential expression isn't sample-wise
    if(!"Sample"  %in% colnames(diffExpInNetwork)){
        Ed<-diffuseSample(connectivityBackward, networkGenes, diffExpInNetwork)
    }

    scores <- data.table()

    for(sample in unique(mutationData$Sample)){
        # diffuse for mutated genes
        Em<-diffuseSample(connectivity, networkGenes, mutationInNetwork[Sample == sample])

        # diffuse for differential expression, if differential expression is sample-wise
        if("Sample"  %in% colnames(diffExpInNetwork)){
            Ed<-diffuseSample(connectivityBackward, networkGenes, diffExpInNetwork[Sample == sample])
        }

        # Multiply scores
        E <- Em * Ed
        scores <- rbind(scores, data.table(Gene = networkGenes$Gene, Sample = sample, Score = E))
    }

    # rank scores
    scores[, rank.in.sample:=rank(-Score), by=.(Sample)]

    # combine ranks
    library(RobustRankAggreg)
    ranks<-scores[, .(med=median(rank.in.sample), sum=sum(rank.in.sample),
                      rho = rhoScores(rank.in.sample/max(rank.in.sample))),
                  by=.(Gene)]

    return(ranks)
}

Phew… The work is done! Now it’s time to execute the code. If you do this at home you might notice, that it will take a while until it’s finished. That’s because the calculation of the diffusion matrix is very expensive, when it comes to processing time. Your computer has to solve a system of linear equations for it, which is not linear. But don’t ask me in which O it exactly lies. 🙂

result<-netICS(adjacencyMatrix = adjacencyMatrix, networkGenes,
               mutationData = mutationData, diffExpGenes = diffExp[p.adjusted < 0.05, .(Gene)])
knitr::kable(result[order(sum)][1:10])
Genemedsumrho
RPS6KB196874489.50e+00
AKT377077330.51e-07
AKT194478520.50e+00
SERPINE181881454.50e+00
TP53102684433.51e-07
MAP2K284386530.51e-07
CDKN2D75886534.50e+00
CREB1108087179.51e-07
VEGFA98787678.50e+00
E2F188588245.50e+00

I will probably use this code at some point of the future again, as Network theory is one of my areas of interest and it’s pretty nice to have that lying around.
This method, heat diffusion, is by the way very similar to Pageranking… The stuff that made Google big. 🙂

Some Stuff About The Near Future

I will probably next week do a post about making map plots. As I have to do one anyway and I figured it would be nice to document it here, as it could be quite useful to some of you.
And in the first week of April I will be on a Hackathon (my very first!) about Metabolomics in R. I will probably keep you posted about that then and I’m looking so forward to it.

So see you hopefully soon!

Availability Of My Code

You can access a maintained version of the code of the correct version in my GitHub repository Raspository under R/NetICS.R.

Please follow and like us:
error0

Integration of Multi-Omics Data

Or the Devil is in the Preprocessing

It’s been a while, eh? So I guess it’s time for another contribution to the Raspository. I didn’t really find the time to write something sensible last month, as I had a lot of things to do. More specifically I had to finish the essay I talked about in my last post. Then I had to learn for my exams and do some stuff regarding my old apartment, but now I have time again.

The topic of my essay will also be the topic of this post… More or less. The method I had to write about and for which I’m preparing a presentation right now is called netICS(10.1093/bioinformatics/bty148). It is a graph based method for the integration of multi-omics data.
I won’t go into detail about the core method itself in this post, but this will be a topic for a coming post… I just need to think about how to present the code I already coded to you! Instead of that I will tell you about the non-trivial step of preprocessing for the integration of multi-omics data. This is probably the part you’ll be spending most of your time on, if you’re planing to do so.

Necessary simplifications

If you’re integrating data from various sources, what you need to do is to implement some sensible simplification. This does not only hold for multi-omics data1, but basically to every case where you combine different kind of sources.
This is kind of necessary as you don’t have unlimited resources. To elaborate more on this… With resources I mean everything from samples to computing power and verified2 interactions in your network.

I’m talking about networks, because networks are one way of making sense of those data-sets. In this manner you can treat your features as nodes in the network which can interact in certain ways with each other.
Of course nodes of different kinds interact in different ways with each other. That’s something you either have to think about unless this is the part where you simplify your model. And it is a reasonable simplification to treat all nodes in a network the same. That is more or less what was done with netICS.
In more detail, in the network all different products of a gene3 are represented by the same node.
This is reasonable as we may know a lot about interactions between certain kinds of gene products 4, but we don’t know enough about all kinds of possible interactions. But you always have to keep in mind your simplifications.

Combination of different data types

Then there is the question on how to combine different data types. Because in the case of the above mentioned simplification you have to combine the data from the different gene products into one.

One example used in netICS is the combination of p-values according to the following formula, which is called Fisher’s method(10.2307/2681650):

\( X = -2 \cdot (\sum^{k}_{i = 1}log(p_{i})) \)

You can then apply a Chi-Square test with 2k degrees of freedom on the variable X to get a new combined p-value. Of course this combination method comes with its own limitations.
One of them being that you can only apply it, when the different p-values you’re combing are independent. Independence is often assumed, but there are also methods for dependent cases. It however gets more complicated then.

Depending on what kind of omic data you want to combine it might also be sensible to combine them into a indicator variable. For example this indicator variable could take on the value of 1 if either one or both of two features from different data-sets are present in a sample and take on 0 otherwise.

Integration of Multi-Omics Data is hard work

There is however no recipe that you can always just apply… Each data-set is different. And so is a combination of different data-sets. That’s why Data Science needs industrious people, that don’t shy away from boggling their mind.

But this work will also hold its own rewards. Because if you do so you will be able to use plethora of sweet5 algorithms on your now preprocessed data-set. netICS e.g. uses an approach from page ranking. The thing our friend Google does. And I will talk more about it very soon.
For now I hope I could give you a little impression into the struggle of data integration. Of course there are more things to keep in mind like unclean data. But we will save the cleaning of data-sets for another time. 😛

So see you soon!

Please follow and like us:
error0

Bibliography