Re-Implementing netICS Heat Diffusion In R

OK, I successfully procrastinated this blog entry about heat diffusion now for too long, even though the code for it has been waiting for over a month. :’D So it’s now time for it! I promised it! And it’s time for it now. And I guess it doesn’t need to be perfect, because the whole idea of this blog is to put things out now, rather than waiting for some future moment, where I can do it perfectly.

So that being said, let’s start. 🙂

In my last post I already talked about the integration of multi-omics data. This time I will show you one method called netICS from a more or less pure coding perspective. It uses Heat Diffusion for finding so-called mediator genes in multi-omics data.
You can find the original code written in Matlab here and the corresponding paper here, which I would suggest you reading, if I aroused your interest in the topic.

Thus let’s just dive straight into the matter!

Because we are cool kids we don’t download the data “manually”, but do it right away in R.
And of course we do this with the super fast and comfy fread function from the data.table package, which I would suggest you to learn, when doing R.

We load the data directly from the Github of the original netICS:

library(data.table)

col.names = c("Gene", "Sample"), header = FALSE)
header = FALSE, col.names = "Gene")

header = FALSE, col.names = c("Gene", "pval"))
header = FALSE, col.names = c("Gene", "pval"))


The network used in the method is stored as a Matlab object. But ff course there’s also a package that does this job for us! After that we have to extract the adjacency matrix from it. Because we will need this adjacency matrix later.

library(R.matlab)
adjacencyMatrix<-tmp$adj.lar.com rm(tmp)  And that already was it with loading the data! Combing p-values In my last post I already talked about combining p-values. We need to implement it to combine the p-values from the rnaDiffExp and rppa. Maybe you will also see now, why I like data.tables so much. I’ve commented the code for you. combineDifferentialExpressions<-function(diffExp1, diffExp2){ # Combine both data.tables through an outer join mergedDiff<-merge(diffExp1, diffExp2, by ="Gene", all=TRUE) # if the first p-value is missing take the second one mergedDiff[is.na(pval.x) & !is.na(pval.y), pval := pval.y] # if the second p-value is missing take the first one mergedDiff[!is.na(pval.x) & is.na(pval.y), pval := pval.x] # if both are present combine them by Fisher's method and perform a chi-square test mergedDiff[!is.na(pval.x) & !is.na(pval.y), pval := pchisq(-2 * (log(pval.x) + log(pval.y)) , 4)] return(mergedDiff[,.(Gene, pval)]) }  I really love the syntax of data.tables in R I have to say. 😀 Now let’s execute it and afterwards adjust the p-values with the FDR method and then take a look at it: diffExp <- combineDifferentialExpressions(rnaDiffExp, rppa) diffExp[, p.adjusted:=p.adjust(pval, method = "fdr")] knitr::kable(diffExp[1:5])  Genepvalp.adjusted A1BG0.00000000.0000000 A1CF0.88511070.9505269 A2BP10.22216470.2792292 A2LD10.91032690.9695794 A2M0.00000000.0000000 Heat Diffusion I won’t go into detail about heat diffusion and its theoretical background at this place. Let me just summarize the idea or rather goal short. We’re transforming the adjacency matrix first to a transition matrix, with which you might be familiar from Markov Chains. Basically each cell in the matrix stands for a transition between two genes (in our case). Here R can also shine with its natural Linear Algebra capabilities. normaliseAdjacencyMatrix <- function(adjacencyMatrix){ return(adjacencyMatrix %*% diag(1/colSums(adjacencyMatrix))) }  You see… Isn’t that basically a beautiful one-liner? In the next step I implement the transformation of this matrix to a diffusion matrix. This matrix basically represents the connectivity and network topology. Multiplying a vector of starting probabilities with it, gives us back a stationary distribution. Which basically means after initially many transitions in the network we will end up at a certain set of vertices in the network with a certain probability. The restart probability indicates how likely it is that we will be “reported” back to our starting point instead of a transition in a step. So let’s implement it: performInsulatedHeatDiffusion <- function(adjacencyMatrix, restartProbability){ temperature <- diag(dim(adjacencyMatrix)[1]) - (1 - restartProbability) * adjacencyMatrix return(restartProbability * solve(temperature)) }  We will run through this network in both directions. Basically, one time from the direction of mutated genes and the other time from the differentially expressed ones. So let’s wrap this up in another function. We set the default restart probability to 0.4. netICS <- function(adjacencyMatrix, networkGenes, mutationData, diffExpGenes, restartProbability = 0.4){ # only keep mutated and differentially expressed genes, that are present in the network mutationInNetwork <- mutationData[networkGenes, on="Gene", nomatch = 0] diffExpInNetwork <- diffExpGenes[networkGenes, on="Gene", nomatch = 0] # calculating the regular diffusion matrix connectivity <- performInsulatedHeatDiffusion(normaliseAdjacencyMatrix(adjacencyMatrix), restartProbability) # calculating the diffusion matrix in the opposite direction connectivityBackward <- performInsulatedHeatDiffusion(normaliseAdjacencyMatrix(t(adjacencyMatrix)), restartProbability) # ranking mediator genes result <- prioritize(connectivity, connectivityBackward, networkGenes, mutationData, mutationInNetwork, diffExpInNetwork) return(result) } At this point you might have already noticed that we haven’t done anything yet with the diffusion matrix. For this reason we will implement the prioritize function, which will do this for us. Gene Prioritization First we implement a helper function for the diffusion to avoid redundant code: diffuseSample<-function(connectivity, networkGenes, S){ # find the positions of the input genes in the matrix positions <- networkGenes$Gene %in% S$Gene #multiply them with equal probability with the diffusion matrix weights <- rep(1/sum(positions), sum(positions)) %*% connectivity[positions,] return(as.vector(weights)) } S here is the list of genes with which we multiply the diffusion matrix. The starting probability is for that matter uniformally distributed between them. Afterwards we write our actual prioritization function. The diffusion in this case is only once executed for the differentially expressed genes, but sample-wise for the mutated genes. Subsequently, we combine both probabilities sample-wise by multiplying and rank them. Those ranks are then combined by three different methods, them being median, sums and rho-scores. For the last one the RobustRankAggreg package is used: prioritize<-function(connectivity, connectivityBackward, networkGenes, mutationData, mutationInNetwork, diffExpInNetwork){ # diffuse for differential expression, if differential expression isn't sample-wise if(!"Sample" %in% colnames(diffExpInNetwork)){ Ed<-diffuseSample(connectivityBackward, networkGenes, diffExpInNetwork) } scores <- data.table() for(sample in unique(mutationData$Sample)){
# diffuse for mutated genes
Em<-diffuseSample(connectivity, networkGenes, mutationInNetwork[Sample == sample])

# diffuse for differential expression, if differential expression is sample-wise
if("Sample"  %in% colnames(diffExpInNetwork)){
Ed<-diffuseSample(connectivityBackward, networkGenes, diffExpInNetwork[Sample == sample])
}

# Multiply scores
E <- Em * Ed
scores <- rbind(scores, data.table(Gene = networkGenes\$Gene, Sample = sample, Score = E))
}

# rank scores
scores[, rank.in.sample:=rank(-Score), by=.(Sample)]

# combine ranks
library(RobustRankAggreg)
ranks<-scores[, .(med=median(rank.in.sample), sum=sum(rank.in.sample),
rho = rhoScores(rank.in.sample/max(rank.in.sample))),
by=.(Gene)]

return(ranks)
}

Phew… The work is done! Now it’s time to execute the code. If you do this at home you might notice, that it will take a while until it’s finished. That’s because the calculation of the diffusion matrix is very expensive, when it comes to processing time. Your computer has to solve a system of linear equations for it, which is not linear. But don’t ask me in which O it exactly lies. 🙂

result<-netICS(adjacencyMatrix = adjacencyMatrix, networkGenes,
mutationData = mutationData, diffExpGenes = diffExp[p.adjusted < 0.05, .(Gene)])
knitr::kable(result[order(sum)][1:10])

Genemedsumrho
RPS6KB196874489.50e+00
AKT377077330.51e-07
AKT194478520.50e+00
SERPINE181881454.50e+00
TP53102684433.51e-07
MAP2K284386530.51e-07
CDKN2D75886534.50e+00
CREB1108087179.51e-07
VEGFA98787678.50e+00
E2F188588245.50e+00

I will probably use this code at some point of the future again, as Network theory is one of my areas of interest and it’s pretty nice to have that lying around.
This method, heat diffusion, is by the way very similar to Pageranking… The stuff that made Google big. 🙂

Some Stuff About The Near Future

I will probably next week do a post about making map plots. As I have to do one anyway and I figured it would be nice to document it here, as it could be quite useful to some of you.
And in the first week of April I will be on a Hackathon (my very first!) about Metabolomics in R. I will probably keep you posted about that then and I’m looking so forward to it.

So see you hopefully soon!

Availability Of My Code

You can access a maintained version of the code of the correct version in my GitHub repository Raspository under R/NetICS.R.