Category Archives: Networks

Network theory is the scientific discipline studying networks. Whereas networks consist out of nodes and vertices. Those nodes represent entities and the vertices the relationship between those entities.
While there are many ways to represent a network, the most seen one are graphs. And therefore network science can profit much from the mathematical field of graph theory. Furthermore it can also draw on statistical methods and linear algebra.
In this way there are many available methods for investigating them.

Different partical fields like biology and social interactions and the internet apply network theory. In biology it can be used e.g. to study the interaction of proteins. Gene-regulatory networks are another prominent application.

During my Bachelor’s degree I engaged a lot in topics related to network theory and read a lot of papers doing so. While I don’t think, that networks are an one size fits all anymore, they are still useful models.

Bookmark-Coloring Algorithm

And Writing Documentations In R

I wrote this function months ago, while writing the report for my seminar and I wanted to my make a post about this Bookmark-Coloring Algorithm ever since, but I never found the right chance. As I yesterday in the evening wrote some documentation for the Raspository, I thought it would be nice to make a post about this specific topic. Therefor I thought that using the Bookmark-Coloring Algorithm as an example would be nice.

The Bookmark-Coloring Algorithm

This specific algorithm’s use is to generate a personal pagerank. In more detail this algorithm calculates given a starting website the chance to end up at other websites. But this idea is applicable to other fields as well. In my post about network-based integration through heat diffusion I showed you a similar method applied to a network of multi-omic data. On the same data-set you could use the Bookmark-Coloring Algorithm.
The basic idea behind the Bookmark-Coloring Algorithm is that you have some color that diffuses through a network, which is in my opinion equivalent to heat diffusing in a network. Correct me, if I’m wrong.

I implemented the algorithm following the paper about it by Pavel Berkhin(10.1080/15427951.2006.10129116). More precisely I implemented the Algorithm 2 from the paper. So let me show my implementation:

require(igraph)
require(dequer)
BCA<- function(graph, v, retentionCoefficient = 0.4, tol = 0.001){
    # initialise vector of transition chances
    p<-c()
    p[V(graph)] <- 0

    q <- queue()
    pushback(q, v)

    # initialise vector that indicates how much color is in one node
    colorInVertex <- c()
    colorInVertex[v] <- 1

    # execute as long queque q has elements
    while(length(q) > 0){
        i <- pop(q)
        w <- colorInVertex[i]
        # use up the color in node
        colorInVertex[i] <- NA

        p[i] <- p[i] + retentionCoefficient * w

        # if all color is used up continuew to next element in queque
        if(w < tol){

            next
        }

        # execute for all neighbors
        for(j in neighbors(graph, i, mode = "out")){
            if(!is.na(colorInVertex[j])){
                # add color to neighbor
                colorInVertex[j] <- colorInVertex[j] +
                    ((1 - retentionCoefficient) * w/degree(graph, i, mode = "out"))
            }else{
                # initialise color in neighbor
                pushback(q, j)
                colorInVertex[j] <- (1 - retentionCoefficient) * w/degree(graph, i, mode = "out")
            }
        }

    }

    return(p)
}

I wrote some comments, that hopefully help you to understand, what’s going on. That’s also the first part of documentation. It’s the step you’ll probably do, while writing your code and in my opinion it’s always useful.

So are we done with the documentation? No. Not, if we want to this function into a package.

roxygen2 documentation

roxygen2 is a nice package allowing you to write in-line documentation in R. I won’t go to much into detail about it here as there are lot of online sources for it1, but I will show you a short example, how to do it!

Now let me show, how to write a documentation for the BCA function, I will go over all the specified tags.

#' Bookmark Coloring Algorithm
#' 
#' @aliases BookmarkColoringAlgorithm
#' 
#' @description This function calculates a teleportation vector from a given 
#' starting node to other nodes in a given network.
#'
#' @export BCA
#' @import dequer
#' @import igraph
#'
#' @param graph an object of type \code{\link[igraph]{igraph}}.
#' @param v a starting vertex from the above graph. Can be either its identifier 
#' or a igraph.vs object.
#' @param retentionCoefficient the restart probability for each node.
#' @param tol a tolerance treshold, indicating what the smalltest value of color 
#' is, that should propagate further
#'
#' @return a preference/teleportation vector
#'
#' @references \insertRef{Berkhin2006}{Raspository}
#'
#' @examples
#' library(igraph)
#' g <- make_ring(5)
#' preferenceVector <- BCA(g, 1)
BCA <- function(graph, v, retentionCoefficient = 0.4, tol = 0.001){

OK, let’s go over some of the tags…
In the first line of course you have the title of the function. Additional to the description, you can also add a details tag, where description should give a short overview over the method, you could include theoretical background in the details.
Then needed packages are imported. roxygen2 will convert them into lines in your NAMESPACE file.

With params and return you should shortly describe their types and what they should contain.

For the following use of the references tag, you also need to import the Rdpack package and include a “REFERENCES.bib” in the inst folder with the regarding BibTeX entries. In my opinion you should always use this, when implementing some method from some source… Be it a book or a paper. Rdpack imports those references automatically from your BibTeX file into the documentation files.

Last, but not least I included a runnable example. This is important to give your user a starting point on how to use your function, but furthermore it is a test for the function. Each time your package is built, example code is run. So you will be notified, if there are any errors.
But we will go more into automated testing another time. Because there is of course more you can do. But you should always write example code, if your function is visible to the user.

After writing this documentation in your file, you have to compile it by writing:

roxygen2::roxygenise()
BCA(graph, v, retentionCoefficient = 0.4, tol = 0.001)
Arguments

graph	
an object of type igraph.
v	
a starting vertex from the above graph. Can be either its identifier or a igraph.vs object.
retentionCoefficient	
the restart probability for each node.
tol	
a tolerance treshold, indicating what the smalltest value of color is, that should propagate further
Value

a preference/teleportation vector

References

Berkhin P (2006). “Bookmark-Coloring Algorithm for Personalized PageRank Computing.” Internet Mathematics, 3(1), 41–62. ISSN 1542-7951, doi: 10.1080/15427951.2006.10129116, http://www.internetmathematicsjournal.com/article/1412.
An excerpt from the generated documentation as it appears in RStudio

I hope that this short example will help you writing and documenting your own functions! 🙂

Availability Of The Code

You can access a maintained version of the code in my GitHub repository Raspository under R/BCA.R.

Please follow and like us:
error0

Bibliography

Network-Based Integration Through Heat Diffusion

Re-Implementing netICS Heat Diffusion In R

OK, I successfully procrastinated this blog entry about heat diffusion now for too long, even though the code for it has been waiting for over a month. :’D So it’s now time for it! I promised it! And it’s time for it now. And I guess it doesn’t need to be perfect, because the whole idea of this blog is to put things out now, rather than waiting for some future moment, where I can do it perfectly.

So that being said, let’s start. 🙂

In my last post I already talked about the integration of multi-omics data. This time I will show you one method called netICS from a more or less pure coding perspective. It uses Heat Diffusion for finding so-called mediator genes in multi-omics data.
You can find the original code written in Matlab here and the corresponding paper here, which I would suggest you reading, if I aroused your interest in the topic.

Thus let’s just dive straight into the matter!

Loading Data The Cool Way

Because we are cool kids we don’t download the data “manually”, but do it right away in R.
And of course we do this with the super fast and comfy fread function from the data.table package, which I would suggest you to learn, when doing R.

We load the data directly from the Github of the original netICS:

library(data.table)

 mutationData <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/mutation_data_breast.txt", 
                       col.names = c("Gene", "Sample"), header = FALSE)
networkGenes <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/network_genes.txt", 
                      header = FALSE, col.names = "Gene")

rnaDiffExp <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/RNA_diff_expr_breast.txt", 
                    header = FALSE, col.names = c("Gene", "pval"))
rppa <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/protein_diff_expr_breast.txt", 
              header = FALSE, col.names = c("Gene", "pval"))

The network used in the method is stored as a Matlab object. But ff course there’s also a package that does this job for us! After that we have to extract the adjacency matrix from it. Because we will need this adjacency matrix later.

library(R.matlab)
tmp<-readMat("https://github.com/cbg-ethz/netics/blob/master/data/adj_lar_com.mat?raw=true")
adjacencyMatrix<-tmp$adj.lar.com
rm(tmp)

And that already was it with loading the data!

Combing p-values

In my last post I already talked about combining p-values. We need to implement it to combine the p-values from the rnaDiffExp and rppa.

Maybe you will also see now, why I like data.tables so much. I’ve commented the code for you.

combineDifferentialExpressions<-function(diffExp1, diffExp2){
    # Combine both data.tables through an outer join
    mergedDiff<-merge(diffExp1, diffExp2, by ="Gene", all=TRUE)
    # if the first p-value is missing take the second one
    mergedDiff[is.na(pval.x) & !is.na(pval.y), pval := pval.y]
    # if the second p-value is missing take the first one
    mergedDiff[!is.na(pval.x) & is.na(pval.y), pval := pval.x]
    # if both are present combine them by Fisher's method and perform a chi-square test
    mergedDiff[!is.na(pval.x) & !is.na(pval.y), pval := pchisq(-2 * (log(pval.x) + log(pval.y)) , 4)]
    return(mergedDiff[,.(Gene, pval)])
}

I really love the syntax of data.tables in R I have to say. 😀
Now let’s execute it and afterwards adjust the p-values with the FDR method and then take a look at it:

diffExp <- combineDifferentialExpressions(rnaDiffExp, rppa)
diffExp[, p.adjusted:=p.adjust(pval, method = "fdr")]
knitr::kable(diffExp[1:5])
Genepvalp.adjusted
A1BG0.00000000.0000000
A1CF0.88511070.9505269
A2BP10.22216470.2792292
A2LD10.91032690.9695794
A2M0.00000000.0000000

Heat Diffusion

I won’t go into detail about heat diffusion and its theoretical background at this place. Let me just summarize the idea or rather goal short. We’re transforming the adjacency matrix first to a transition matrix, with which you might be familiar from Markov Chains.
Basically each cell in the matrix stands for a transition between two genes (in our case).

Here R can also shine with its natural Linear Algebra capabilities.

normaliseAdjacencyMatrix <- function(adjacencyMatrix){
    return(adjacencyMatrix %*% diag(1/colSums(adjacencyMatrix)))
}

You see… Isn’t that basically a beautiful one-liner?

In the next step I implement the transformation of this matrix to a diffusion matrix. This matrix basically represents the connectivity and network topology.
Multiplying a vector of starting probabilities with it, gives us back a stationary distribution. Which basically means after initially many transitions in the network we will end up at a certain set of vertices in the network with a certain probability.
The restart probability indicates how likely it is that we will be “reported” back to our starting point instead of a transition in a step.

So let’s implement it:

performInsulatedHeatDiffusion <- function(adjacencyMatrix, restartProbability){
    temperature <- diag(dim(adjacencyMatrix)[1]) - (1 - restartProbability) * adjacencyMatrix
    return(restartProbability * solve(temperature))
}

We will run through this network in both directions. Basically, one time from the direction of mutated genes and the other time from the differentially expressed ones.
So let’s wrap this up in another function. We set the default restart probability to 0.4.

netICS <- function(adjacencyMatrix, networkGenes, mutationData, diffExpGenes,
                   restartProbability = 0.4){

    # only keep mutated and differentially expressed genes, that are present in the network
    mutationInNetwork <- mutationData[networkGenes, on="Gene", nomatch = 0]
    diffExpInNetwork <- diffExpGenes[networkGenes, on="Gene", nomatch = 0]

    # calculating the regular diffusion matrix
    connectivity <- performInsulatedHeatDiffusion(normaliseAdjacencyMatrix(adjacencyMatrix),
                                                  restartProbability)
    # calculating the diffusion matrix in the opposite direction
    connectivityBackward <- performInsulatedHeatDiffusion(normaliseAdjacencyMatrix(t(adjacencyMatrix)),
                                                          restartProbability)

    # ranking mediator genes
    result <- prioritize(connectivity, connectivityBackward, networkGenes,
                           mutationData, mutationInNetwork, diffExpInNetwork)

    return(result)

}

At this point you might have already noticed that we haven’t done anything yet with the diffusion matrix. For this reason we will implement the prioritize function, which will do this for us.

Gene Prioritization

First we implement a helper function for the diffusion to avoid redundant code:

diffuseSample<-function(connectivity, networkGenes, S){
    # find the positions of the input genes in the matrix
    positions <- networkGenes$Gene %in% S$Gene
    #multiply them with equal probability with the diffusion matrix
    weights <- rep(1/sum(positions), sum(positions)) %*% connectivity[positions,]
    return(as.vector(weights))
}

S here is the list of genes with which we multiply the diffusion matrix. The starting probability is for that matter uniformally distributed between them.

Afterwards we write our actual prioritization function. The diffusion in this case is only once executed for the differentially expressed genes, but sample-wise for the mutated genes.

Subsequently, we combine both probabilities sample-wise by multiplying and rank them.
Those ranks are then combined by three different methods, them being median, sums and rho-scores.
For the last one the RobustRankAggreg package is used:

prioritize<-function(connectivity, connectivityBackward, networkGenes,
                         mutationData, mutationInNetwork, diffExpInNetwork){

    # diffuse for differential expression, if differential expression isn't sample-wise
    if(!"Sample"  %in% colnames(diffExpInNetwork)){
        Ed<-diffuseSample(connectivityBackward, networkGenes, diffExpInNetwork)
    }

    scores <- data.table()

    for(sample in unique(mutationData$Sample)){
        # diffuse for mutated genes
        Em<-diffuseSample(connectivity, networkGenes, mutationInNetwork[Sample == sample])

        # diffuse for differential expression, if differential expression is sample-wise
        if("Sample"  %in% colnames(diffExpInNetwork)){
            Ed<-diffuseSample(connectivityBackward, networkGenes, diffExpInNetwork[Sample == sample])
        }

        # Multiply scores
        E <- Em * Ed
        scores <- rbind(scores, data.table(Gene = networkGenes$Gene, Sample = sample, Score = E))
    }

    # rank scores
    scores[, rank.in.sample:=rank(-Score), by=.(Sample)]

    # combine ranks
    library(RobustRankAggreg)
    ranks<-scores[, .(med=median(rank.in.sample), sum=sum(rank.in.sample),
                      rho = rhoScores(rank.in.sample/max(rank.in.sample))),
                  by=.(Gene)]

    return(ranks)
}

Phew… The work is done! Now it’s time to execute the code. If you do this at home you might notice, that it will take a while until it’s finished. That’s because the calculation of the diffusion matrix is very expensive, when it comes to processing time. Your computer has to solve a system of linear equations for it, which is not linear. But don’t ask me in which O it exactly lies. 🙂

result<-netICS(adjacencyMatrix = adjacencyMatrix, networkGenes,
               mutationData = mutationData, diffExpGenes = diffExp[p.adjusted < 0.05, .(Gene)])
knitr::kable(result[order(sum)][1:10])
Genemedsumrho
RPS6KB196874489.50e+00
AKT377077330.51e-07
AKT194478520.50e+00
SERPINE181881454.50e+00
TP53102684433.51e-07
MAP2K284386530.51e-07
CDKN2D75886534.50e+00
CREB1108087179.51e-07
VEGFA98787678.50e+00
E2F188588245.50e+00

I will probably use this code at some point of the future again, as Network theory is one of my areas of interest and it’s pretty nice to have that lying around.
This method, heat diffusion, is by the way very similar to Pageranking… The stuff that made Google big. 🙂

Some Stuff About The Near Future

I will probably next week do a post about making map plots. As I have to do one anyway and I figured it would be nice to document it here, as it could be quite useful to some of you.
And in the first week of April I will be on a Hackathon (my very first!) about Metabolomics in R. I will probably keep you posted about that then and I’m looking so forward to it.

So see you hopefully soon!

Availability Of My Code

You can access a maintained version of the code of the correct version in my GitHub repository Raspository under R/NetICS.R.

Please follow and like us:
error0