Category Archives: Programming

Programming is my life! As I’m working as a programmer, it more or less is. And I’m also coding in my free-time.
Languages I’m proficient in include R, Java, Python and C++. But I’m also capable in writing Bash or Perl scripts, even though I don’t like the second one that much.

In this category of my blog I will generally present code to you. Be it code in form of the implementation of an algorithm or an application. The two languages Iyou will probably find most of here are R and C++. Also because they are easy to integrate into each other. But this might change over time, as programming languages and their use change.

If you want some algorithm explained, just write me. I will see, what I can do!

Bookmark-Coloring Algorithm

And Writing Documentations In R

I wrote this function months ago, while writing the report for my seminar and I wanted to my make a post about this Bookmark-Coloring Algorithm ever since, but I never found the right chance. As I yesterday in the evening wrote some documentation for the Raspository, I thought it would be nice to make a post about this specific topic. Therefor I thought that using the Bookmark-Coloring Algorithm as an example would be nice.

The Bookmark-Coloring Algorithm

This specific algorithm’s use is to generate a personal pagerank. In more detail this algorithm calculates given a starting website the chance to end up at other websites. But this idea is applicable to other fields as well. In my post about network-based integration through heat diffusion I showed you a similar method applied to a network of multi-omic data. On the same data-set you could use the Bookmark-Coloring Algorithm.
The basic idea behind the Bookmark-Coloring Algorithm is that you have some color that diffuses through a network, which is in my opinion equivalent to heat diffusing in a network. Correct me, if I’m wrong.

I implemented the algorithm following the paper about it by Pavel Berkhin(10.1080/15427951.2006.10129116). More precisely I implemented the Algorithm 2 from the paper. So let me show my implementation:

require(igraph)
require(dequer)
BCA<- function(graph, v, retentionCoefficient = 0.4, tol = 0.001){
    # initialise vector of transition chances
    p<-c()
    p[V(graph)] <- 0

    q <- queue()
    pushback(q, v)

    # initialise vector that indicates how much color is in one node
    colorInVertex <- c()
    colorInVertex[v] <- 1

    # execute as long queque q has elements
    while(length(q) > 0){
        i <- pop(q)
        w <- colorInVertex[i]
        # use up the color in node
        colorInVertex[i] <- NA

        p[i] <- p[i] + retentionCoefficient * w

        # if all color is used up continuew to next element in queque
        if(w < tol){

            next
        }

        # execute for all neighbors
        for(j in neighbors(graph, i, mode = "out")){
            if(!is.na(colorInVertex[j])){
                # add color to neighbor
                colorInVertex[j] <- colorInVertex[j] +
                    ((1 - retentionCoefficient) * w/degree(graph, i, mode = "out"))
            }else{
                # initialise color in neighbor
                pushback(q, j)
                colorInVertex[j] <- (1 - retentionCoefficient) * w/degree(graph, i, mode = "out")
            }
        }

    }

    return(p)
}

I wrote some comments, that hopefully help you to understand, what’s going on. That’s also the first part of documentation. It’s the step you’ll probably do, while writing your code and in my opinion it’s always useful.

So are we done with the documentation? No. Not, if we want to this function into a package.

roxygen2 documentation

roxygen2 is a nice package allowing you to write in-line documentation in R. I won’t go to much into detail about it here as there are lot of online sources for it1, but I will show you a short example, how to do it!

Now let me show, how to write a documentation for the BCA function, I will go over all the specified tags.

#' Bookmark Coloring Algorithm
#' 
#' @aliases BookmarkColoringAlgorithm
#' 
#' @description This function calculates a teleportation vector from a given 
#' starting node to other nodes in a given network.
#'
#' @export BCA
#' @import dequer
#' @import igraph
#'
#' @param graph an object of type \code{\link[igraph]{igraph}}.
#' @param v a starting vertex from the above graph. Can be either its identifier 
#' or a igraph.vs object.
#' @param retentionCoefficient the restart probability for each node.
#' @param tol a tolerance treshold, indicating what the smalltest value of color 
#' is, that should propagate further
#'
#' @return a preference/teleportation vector
#'
#' @references \insertRef{Berkhin2006}{Raspository}
#'
#' @examples
#' library(igraph)
#' g <- make_ring(5)
#' preferenceVector <- BCA(g, 1)
BCA <- function(graph, v, retentionCoefficient = 0.4, tol = 0.001){

OK, let’s go over some of the tags…
In the first line of course you have the title of the function. Additional to the description, you can also add a details tag, where description should give a short overview over the method, you could include theoretical background in the details.
Then needed packages are imported. roxygen2 will convert them into lines in your NAMESPACE file.

With params and return you should shortly describe their types and what they should contain.

For the following use of the references tag, you also need to import the Rdpack package and include a “REFERENCES.bib” in the inst folder with the regarding BibTeX entries. In my opinion you should always use this, when implementing some method from some source… Be it a book or a paper. Rdpack imports those references automatically from your BibTeX file into the documentation files.

Last, but not least I included a runnable example. This is important to give your user a starting point on how to use your function, but furthermore it is a test for the function. Each time your package is built, example code is run. So you will be notified, if there are any errors.
But we will go more into automated testing another time. Because there is of course more you can do. But you should always write example code, if your function is visible to the user.

After writing this documentation in your file, you have to compile it by writing:

roxygen2::roxygenise()
BCA(graph, v, retentionCoefficient = 0.4, tol = 0.001)
Arguments

graph	
an object of type igraph.
v	
a starting vertex from the above graph. Can be either its identifier or a igraph.vs object.
retentionCoefficient	
the restart probability for each node.
tol	
a tolerance treshold, indicating what the smalltest value of color is, that should propagate further
Value

a preference/teleportation vector

References

Berkhin P (2006). “Bookmark-Coloring Algorithm for Personalized PageRank Computing.” Internet Mathematics, 3(1), 41–62. ISSN 1542-7951, doi: 10.1080/15427951.2006.10129116, http://www.internetmathematicsjournal.com/article/1412.
An excerpt from the generated documentation as it appears in RStudio

I hope that this short example will help you writing and documenting your own functions! 🙂

Availability Of The Code

You can access a maintained version of the code in my GitHub repository Raspository under R/BCA.R.

Please follow and like us:

Bibliography

Salt And Pepper Noise

And Measuring Noise

Using the nomenclature developed in yesterday’s post I will today also implement a method for creating salt and pepper noise in images. This noise simulates dead pixels by setting them either to the lowest or highest grey value, in our case 0 or 1.

First let’s install and load the Raspository, where the methods and class developed yesterday are located in.

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")
devtools::install_github("David-J-R/Raspository")
library(Raspository)

Without further ado let me show you how to implement the salt and pepper noise.

saltAndPepperNoise <- function(object, percentage = .2){
    # select the indices to set to 0 or 1 at random
    indices <- sample(length(object@current), length(object@current) * percentage)
    # draw zeros and ones from a binomial distribution
    values <- rbinom(length(indices), 1, 0.5)

    object@current[indices] <- values
    return(object)
}

OK, our animal guest test subject today will be an owl.

set.seed(1)
owl <- imageBWFromJpeg("owl.jpg")
plot(owl)
A black and white picture of a snow owl, that looks somehow suprised.
A picture of a snow owl I took at the Saarbrücker Wildpark

It looks surprised, doesn’t it? It probably knows about the salt and pepper noise we will add right away.

owlNoise <- saltAndPepperNoise(owl)
plot(owlNoise)
The same picture with a lot of grainy noise added.
The same picture with salt and pepper noise added

Uhm, this looks really annoying now. The other kind of noises hadn’t been that disrupting. Just like dead pixels on your screen, they make you feel bad.

Introducing Two Measures of Noise


But what would be nice now as well, would be to have some measure of the noise. Let’s just use the mean squared error for it. I’ll implement a function that can compare two pictures or the current and original picture in an imageBW object depending on the input parameters.

MSE <- function(object, other = NULL){
    if(is.null(other)){
        errorMatrix <- object@original - object@current
    }else{
        errorMatrix <- other@current - object@current
    }
    squaredErrorMatrix <- errorMatrix ^ 2
    return(mean(squaredErrorMatrix))
}

Nice! Now we finally have a measure for our noise. Of course this will be especially helpful later, when we want to evaluate the performance of reconstruction methods.
So let’s test it.

MSE(owlNoise)
## [1] 0.05545031
MSE(owlNoise, owl)
## [1] 0.05545031

As expected1 both function calls leave us with the same result.

Let’s also implement the peak-signal-to-noise ratio, which gives us back a decibel value. The lower the value the more noise you have in the picture. You can calculate it according to the following formula, which I will also “simplify” a bit.

\(
PSNR(x, y) = 10 \cdot log_{10} \left( \frac{MAX^{2}}{MSE(x, y)} \right) \
\)

I hope you know the logarithm rules. 😛

\(
PSNR(x, y) = 10 \cdot log_{10}(MAX^{2}) – 10 \cdot log_{10}(MSE(x, y)) \
\) \(
PSNR(x, y) = 20 \cdot log_{10}(MAX) – 10 \cdot log_{10}(MSE(x, y))
\)

Where x and y are the two pictures to compare and MAX is the maximum value a pixel can take on.

And for MAX being one in our case we get:

\(
PSNR(x, y) = – 10 \cdot log_{10}(MSE(x, y))
\)

Subsequently, it’s now time to implement this measure as well:

PSNR <- function(object, other = NULL){
    mse <- MSE(object, other)
    return(-10 * log(mse, base = 10))
}

That was pretty straightforward. So let’s test it!

PSNR(owlNoise)
## [1] 12.56096

It’s a relatively low value, that indicates a lot of noise. In my opinion this value is more intuitive, probably because of use of the logarithm function. Indeed the MSE value alone didn’t seem that big.

But that’s it for now. There will be probably further stuff next week. But probably not that much.
Thanks for reading and have nice remaining weekend. 🙂

Availability Of The Code

You can access a maintained version of the code in my GitHub repository Raspository under R/imageNoise.R.
But as I will expand the code, it will also probably grow there.

Please follow and like us:

Multiplicative Noise And S4 Classes

Let’s Add Some More Noise

Like in my yesterday’s post mentioned, there are more kinds of noise. So let’s first load today’s picture and then add some multiplicative noise!

library(jpeg)
pigeon<- readJPEG("pigeon.jpg")
pigeonBW <- pigeon[,,1]
plot(as.raster(pigeonBW))
Black and white picture of a pigeon sitting on a street light
The original image of a pigeon on a street light

Oh look it’s a pigeon on a lantern. I wonder what this little guy was thinking? But let’s not dwell on this. Rather let’s think how we can scramble up this picture!
I will first try multiplicative normal distributed noise, which is pretty straight forward, if you followed yesterday’s post.

set.seed(1)
pigeonNormalNoise <- pigeonBW * (1 + rnorm(length(pigeonBW), sd = sd(pigeonBW)))
pigeonNormalNoise[pigeonNormalNoise > 1] <- 1
pigeonNormalNoise[pigeonNormalNoise < 0] <- 0
plot(as.raster(pigeonNormalNoise))
Black and white picture of a pigeon sitting on a street light with some noise on the brigther areas
The same picture with multiplicative noise added

Can you see the difference to the additive noise from yesterday? Indeed the darker areas of the pictures seem nearly untouched from the noise. That’s because the intensity of the noise is now dependent on the intensity of the pixel. And intensity is what dark pixels are lacking… So to speak.

But it’s really annoying and redundant to have to write the same stuff over and over again, isn’t it? So let’s do, what a Computer Scientist is best in. Let’s generalize some methods.
But where to start?

In my view it would be useful to have some kind of an (black and white) image class, which saves the image and operations on it. Hence the process would also be reproducible, which is always nice.

imageBW <- setClass("imageBW", slots=list(original="matrix", current="matrix", operations="list"))
imageBWFromJpeg <-function(pathToJpeg){
    require(jpeg)
    image<- readJPEG(pathToJpeg)
    imageBW <- image[,,1]
    return(new("imageBW", original = imageBW, current = imageBW, operations = list()))
}
plot.imageBW <- function(object){
    plot(as.raster(object@current))}

As an illustration I also overwrote the default plot function. This makes our lives even easier. In the next step I’ll implement the different noise functions. There are probably more generalized ways of doing so, but for now this will suffice. I also haven’t come up with a good way of storing the done operations yet. So I’ll probably also do that another day.

cropPixels<- function(object){
    object@current[object@current > 1] <- 1
    object@current[object@current < 0] <- 0
    return(object)
}
addUnifNoise <- function(object){
    slot(object, "current") <- slot(object, "current") + 
        runif(length(slot(object, "current")), min = -1, max = 1)
    return(cropPixels(object))
}
addNormalNoise <- function(object, sd = NULL){
    if(is.null(sd)){
        object@current <- object@current + rnorm(length(object@current), sd = sd(object@current))
    }else{
        object@current <- object@current + rnorm(length(object@current), sd = sd)
    }
    return(cropPixels(object))
}
multiplyUnifNoise <- function(object){
    object@current <- object@current * (1 + runif(length(object@current), min = -1, max = 1))
    return(cropPixels(object))
}
multiplyNormalNoise <- function(object, sd = NULL){
    if(is.null(sd)){
        object@current <- object@current * ( 1 + rnorm(length(object@current), 
                                                       sd = sd(object@current)))
    }else{
        object@current <- object@current * ( 1 + rnorm(length(object@current), 
                                                       sd = sd))
    }
    return(cropPixels(object))
}

For the future I would also like to have this working with infix operators. Meaning that I could do stuff like image <- image + normalNoise(...) or image <- image * (1 + normalNoise(...)) so that I have a neat little grammar for working with images. However for the moment those functions will do the job. Now let us make use of the newly implemented methods and add a lot of noise to the picture.

image <- imageBWFromJpeg("pigeon.jpg")
image <- addNormalNoise(image)
image <- multiplyUnifNoise(image)
image <- addUnifNoise(image)
plot(image)
Black and white picture of a pigeon sitting on a street light with so much noise, that you can only recognize the outlines.
The same picture with lots of kinds of noises added

Haha, there’s not much left of this poor pigeons. But you can still see its and the lamp’s outlines. And I’m anyway quite certain that there are algorithms out there, that could reconstruct a lot of the original image. You will see that later on. 🙂

Availability Of The Code

You can access a maintained version of the code in my GitHub repository Raspository under R/imageOneChannel.R.
But as I will expand the code, it will also probably grow there.

Please follow and like us:

How to Add Noise To Images

Or Let’s Make Some Noise!!

As I wrote in one of my last posts there will be some content about image processing and computer vision. And following my second lecture of it, here we go! Fortunately R has some capabilities of handling pictures, of which I will take use.

For now the topic will just be, how to add noise to images, which can be useful to you, e.g. if you want to prove the performance of an algorithm for handling noise. So we’re simulating noise. Natural noise in images can come from different sources… The most common of it would be due to high ISO values or/and long exposure times. So in general in low light conditions.
However I’ll only work with black and white pictures for today. Maybe color comes at some point in the future. 🙂

Loading Pictures Into

First I’ll show you how to load pictures into R and how to print them again. It’s quite easy tbh:

library(jpeg)
goose <- readJPEG("goose.jpg")
class(goose)
## [1] "array"
dim(goose)
## [1] 600 897   3

As the picture is black and white we only need one dimension of it. And we can just plot it by transforming it into a raster object:

gooseBW <- goose[,,1]
plot(as.raster(gooseBW))
A Canada goose grooming itself
A picture of a Canada goose I took in Saarbrücken

Cute this goose, isn’t it? 🙂 Now let’s destroy this picture with noise! Nyehehehehe1.!

Adding The Noise

To begin with I will add some uniform distributed noise. As the values in the matrix are between 0 and 1, I will add values between -1 and 1

set.seed(1)
gooseUniformNoise <- gooseBW + runif(length(gooseBW), min = -1, max = 1)

But we can’t plot this right, as the matrix now also contains values above 1 and below 0. I’ll fix this by cropping the boundaries and then plot.

gooseUniformNoise[gooseUniformNoise > 1] <- 1
gooseUniformNoise[gooseUniformNoise < 0] <- 0
plot(as.raster(gooseUniformNoise))
The same picture with uniform noise added. It now looks grainy
The same picture with uniform noise added

Interesting how you can still make out the goose in spite of the noise? Now let’s do the same with normal distributed noise. I’ll take the standard deviation of the black and white picture.

gooseNormalNoise <- gooseBW + rnorm(length(gooseBW), sd = sd(gooseBW))
gooseNormalNoise[gooseNormalNoise > 1] <- 1
gooseNormalNoise[gooseNormalNoise < 0] <- 0
plot(as.raster(gooseNormalNoise))
The same picture with uniform noise added. It now looks like an old black and white picture
The same picture with normal distributed noise added


In the lecture we also did multiplicative and impulse noise. However I won’t do them in this post, maybe another time.
Anyway, do you already see, why I like R so much? 🙂 If you do Linear Algebra and Statistics stuff it’s just super comfy!
I’ll show you probably even more reasons why, when I demonstrate you, how to denoise picture again. But that’s it for today.

Have a nice weekend!

Please follow and like us:

Map Plots About the Global Burden of Disease

A practical example

Like promised in another post I will show you how to do a map plots with R. For this purpose I will use the ggmap package, which makes this a relatively easy task.
But before I begin with the actual code, let me give you a short motivation

Why to use map plots

Motivations for using map plots can be various. For example if you’re a journalist and let’s say you want to visualize a kind of events (like say earthquakes) in a regional context, this is a very demonstrative way of visualizing your data.
Or if you want to present some kind of data about places or countries map plots are always a good option.

The first time I did a map plot was actually part of an awesome lecture I had back in Munich at the TUM. Afterwards I got the chance to use this skill right away in the next semester for my Bachelor’s thesis.
As some of you might know the area, where I applied the algorithm which I improved and implemented for my thesis, was mental disorders. During the writing process of my thesis, I found it a good starting point in my thesis and the accompanying presentation to emphasize the prevalence of mental disorders in the world. In order to do so I used a map plot.
That’s basically also what I will do now, but this time with the case of cancer.

But on a side note I’m not saying that you should do those kinds of plots for each thesis or presentation regarding a disease topic. It’s just one possible starting point for it and not necessarily the best.
So please don’t just mindlessly copy, what I’m doing here. 🙂

Getting the data for dissease related map plots

First let’s load all the packages we will need for this little exercise.

library(XLConnect)
library(data.table)
library(ggplot2)
library(ggthemes)
library(maps)

XLConnect is a package for loading excel sheet, which we will need.
That I like to use data.table you probably already noticed. It’s just super fast and comfy for some procedures and it has some nice synergies with ggplot2.
The maps package contains as the name suggests map data, which can be used to plot. Alternatively one could also use ggmap.
And ggthemes contains a neat theme for maps, which I will use.

First let’s load our world map. This data.table contains region names and the boundaries of those regions as longitudes and latitudes. ggplot can plot those as polygons.

mapdata <- data.table(map_data("world"))
knitr::kable(mapdata[1:5])
longlatgrouporderregionsubregion
-69.8991212.4520011ArubaNA
-69.8957112.4230012ArubaNA
-69.9421912.4385313ArubaNA
-70.0041512.5004914ArubaNA
-70.0661212.5469715ArubaNA

OK, done. Now we need to download the data on 2004’s mortality from the WHO.

download.file("www.who.int/entity/healthinfo/global_burden_disease/gbddeathdalycountryestimates2004.xls", "gbd.xls")
tmp <- readWorksheetFromFile(file = "gbd.xls", sheet = "Deaths 2004")

causes <- tmp$Col1[14:143]

countries <- unname(tmp[6,7:198])
deathRates <- tmp[14:143,7:198]

You should probably take a look at the Excel file yourself to understand it and what I’m doing later. The file is made for humans to look at it and not directly for machines to read it. Which is why we have to do some cleaning and transforming. In my experience as a Bioinformatics students this is something you have to do almost always. Even if you have a machine readable format, there’s no perfect data-set. You will always have some missing data or have to transform your data in some way.
And this isn’t necessarily a trivial step. Often you will spend a lot of time here. And that’s good. If cleaning data was trivial, then we wouldn’t need data scientist.

Cleaning data

To begin with we have to transform the death rates to numeric values… Because they’re characters (strings) right now. For this purpose we have to also replace the separating comma at the thousand position. You see? What’s done to make the data more human readable, makes it less machine readable. That’s often the case.
Then we set the column names to the countries and transform the matrix together with the vector of causes to a data.table.

deathRatesNum <- matrix(as.numeric(gsub(",", "", as.matrix(deathRates))), nrow = dim(deathRates)[1])
## Warning in matrix(as.numeric(gsub(",", "", as.matrix(deathRates))), nrow =
## dim(deathRates)[1]): NAs introduced by coercion
colnames(deathRatesNum) <- countries
DT <- data.table(causes = causes, deathRatesNum)

Now we want a clean or also called long data-set. In this new data set we will have only three columns. Two variables (causes and region), which uniquely identify the value death rate.
Similar to a database we can also set those variable columns as keys, which makes it very fast searchable.

DTclean <- melt(DT, id.vars = "causes",  variable.name = "region", value.name = "deathRate")
setkey(DTclean, causes, region)

Next let us see, if we have some regions in our data.table that aren’t in our map.

DTclean[!region %in% mapdata$region, unique(region)]
##  [1] Antigua and Barbuda                      
##  [2] Brunei Darussalam                        
##  [3] Congo                                    
##  [4] Côte d'Ivoire                            
##  [5] Democratic People's Republic of Korea    
##  [6] Iran (Islamic Republic of)               
##  [7] Lao People's Democratic Republic         
##  [8] Libyan Arab Jamahiriya                   
##  [9] Micronesia (Federated States of)         
## [10] Republic of Korea                        
## [11] Republic of Moldova                      
## [12] Russian Federation                       
## [13] Saint Kitts and Nevis                    
## [14] Saint Vincent and the Grenadines         
## [15] Serbia and Montenegro                    
## [16] Syrian Arab Republic                     
## [17] The former Yugoslav Republic of Macedonia
## [18] Trinidad and Tobago                      
## [19] Tuvalu                                   
## [20] United Kingdom                           
## [21] United Republic of Tanzania              
## [22] United States of America                 
## [23] Venezuela (Bolivarian Republic of)       
## [24] Viet Nam                                 
## 192 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe

As expected, there are 24 regions from the WHO sheet not in the mapdata. Even though there’s probably a more elegant solution, I will change them manually. It’s a work that has to be done once. For this purpose it’s probably only necessary to fill it in for the big countries. So this is bearable.

DTclean[region == "Brunei Darussalam", region := "Brunei"]
DTclean[region == "Congo", region := "Republic of Congo"]
DTclean[region == "Democratic People's Republic of Korea", region := "North Korea"]
DTclean[region == "Iran (Islamic Republic of)", region := "Iran"]
DTclean[region == "Côte d'Ivoire", region := "Ivory Coast"]
DTclean[region == "Lao People's Democratic Republic", region := "Laos"]
DTclean[region == "Libyan Arab Jamahiriya", region := "Libya"]
DTclean[region == "The former Yugoslav Republic of Macedonia", region := "Macedonia"]
DTclean[region == "Micronesia (Federated States of)", region := "Micronesia"]
DTclean[region == "Republic of Moldova", region := "Moldova"]
DTclean[region == "Republic of Korea", region := "South Korea"]
DTclean[region == "Russian Federation", region := "Russia"]
DTclean[region == "Serbia and Montenegro", region := "Serbia"]
DTclean[region == "Syrian Arab Republic", region := "Syria"]
DTclean[region == "United Republic of Tanzania", region := "Tanzania"]
DTclean[region == "United Kingdom", region := "UK"]
DTclean[region == "United States of America", region := "USA"]
DTclean[region == "Venezuela (Bolivarian Republic of)", region := "Venezuela"]
DTclean[region == "Viet Nam", region := "Vietnam"]

And yea of course the work isn’t done completely yet. We also should check if there are regions in the mapdata, that aren’t in the WHO data-set. This could be due to various reasons… One being, that a region isn’t a member of the WHO and therefore the WHO doesn’t publish data on them.
Or more likely that a country from the WHO data-set span more than one region on the map, Serbia and Montenegro being such a case.
However I’m lazy now and I won’t do this today. How about you doing it and writing me a comment? 😛 Let it be a team1 effort.

Making the map plots

OK, before we do the actual plotting let’s first calculate for how much percentage of all deaths in each country cancer is the cause. In detail I do this by joining the data.table with itself.
On a side note: W000 is the WHO code for all death causes combined and W060 for Malignant neoplasms, which is a more formal name for cancer.

Then we need to join the data.table with the map on the region name.

DTcaused <- DTclean[causes == "W000"][DTclean[causes == "W060"], on = "region"][, .(region, percentageCaused = i.deathRate / deathRate)]

deathrateMap <- mapdata[DTcaused, on = "region", allow.cartesian=TRUE, nomatch = 0]

And finally we can do our plot. For this purpose we first plot all regions in grey and as overlay we fill the countries, that we have data on, with a color between grey and red depending on how frequent cancer as a death cause is.

g <- ggplot() + geom_polygon(data = mapdata,  aes(long, lat, group = group), fill = "grey")  
g <- g +  geom_polygon(data = deathrateMap,  aes(long, lat, group = group, fill = percentageCaused)) 
g <- g + scale_fill_gradient(low = "grey", high = "red", aesthetics = "fill", name = "Percentage of\ndeaths caused\nby cancer") 
g + ggthemes::theme_map()
World map showing on a scale from grey to red the percentage of deaths cancer is responsible for in different countries. The USA, Canada, Australia and Europe according to it have the highest death rates due to cancer, with up to 30 percentage.
World map showing on a scale from grey to red the percentage of deaths cancer is responsible for in different countries

And of course there’s one thing about this plot that could be misleading. Given that regions with missing data and very low prevalence of cancer deaths will both be grey, you hopefully see the potential problem here?
It’s not necessarily wrong or bad to do so. But I hope you recognize how someone could make a plot this way to mislead his audience. That’s why I recommend when it comes to looking at plots not only to think about, what is shown, but also what isn’t shown. Since no large data-set is complete… So ask the person who presents it to you, how she/he handled missing data points.

So what does this map actually say? From my perspective I don’t think anything surprising. At the moment, this data set captured, cancer was (and probably still is) mostly a problem of industrialized countries and it doesn’t seem to be connected to geography primarily (Can you see how Israel, Japan and South Korea pop up?).
Although the difference between the USA and Canada could be something interesting.
But this map, in my opinion, shows very clearly that cancer is one of the leading causes of death in the developed world, which also is the reason, why we also spend so much money on researching it.

However the main purpose of this post was to show you, how to make such plots and not discuss the reasons of different causes of mortality.
Ultimately I hope that this post has helped you.

Cite/mention your sources

Of course it is important that you mention your sources (cite them if you write a paper). This is because your approach has to be reproducible and you have to give those people, who did the preliminary work, credit for it.

In R you can get the proper citations for the packages you used the following way:

citation("ggmap")
## 
## To cite ggmap in publications, please use:
## 
##   D. Kahle and H. Wickham. ggmap: Spatial Visualization with
##   ggplot2. The R Journal, 5(1), 144-161. URL
##   http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     author = {David Kahle and Hadley Wickham},
##     title = {ggmap: Spatial Visualization with ggplot2},
##     journal = {The R Journal},
##     year = {2013},
##     volume = {5},
##     number = {1},
##     pages = {144--161},
##     url = {https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf},
##   }
citation("maps")
## 
## To cite package 'maps' in publications use:
## 
##   Original S code by Richard A. Becker, Allan R. Wilks. R version
##   by Ray Brownrigg. Enhancements by Thomas P Minka and Alex
##   Deckmyn. (2018). maps: Draw Geographical Maps. R package version
##   3.3.0. https://CRAN.R-project.org/package=maps
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {maps: Draw Geographical Maps},
##     author = {Original S code by Richard A. Becker and Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn.},
##     year = {2018},
##     note = {R package version 3.3.0},
##     url = {https://CRAN.R-project.org/package=maps},
##   }
## 
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.

You get the idea. Also cite the other packages, if you use them in your publication or thesis.
The output is in bibtex format. So I hope you know what to do with it. 😛

Of course the data on the global burden of disease you have to cite as well. Thus I’ll give you the formatted citation for it:

WHO. (2004). The global burden of disease: 2004 update: causes of death. 2004 Update, 8–26.

And last, but not least, please also mention me. This however is not a necessity, but a sign of respect towards my work. By all means respect is an important thing, unfortunately not often enough given in our society.

Please follow and like us:

Network-Based Integration Through Heat Diffusion

Re-Implementing netICS Heat Diffusion In R

OK, I successfully procrastinated this blog entry about heat diffusion now for too long, even though the code for it has been waiting for over a month. :’D So it’s now time for it! I promised it! And it’s time for it now. And I guess it doesn’t need to be perfect, because the whole idea of this blog is to put things out now, rather than waiting for some future moment, where I can do it perfectly.

So that being said, let’s start. 🙂

In my last post I already talked about the integration of multi-omics data. This time I will show you one method called netICS from a more or less pure coding perspective. It uses Heat Diffusion for finding so-called mediator genes in multi-omics data.
You can find the original code written in Matlab here and the corresponding paper here, which I would suggest you reading, if I aroused your interest in the topic.

Thus let’s just dive straight into the matter!

Loading Data The Cool Way

Because we are cool kids we don’t download the data “manually”, but do it right away in R.
And of course we do this with the super fast and comfy fread function from the data.table package, which I would suggest you to learn, when doing R.

We load the data directly from the Github of the original netICS:

library(data.table)

 mutationData <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/mutation_data_breast.txt", 
                       col.names = c("Gene", "Sample"), header = FALSE)
networkGenes <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/network_genes.txt", 
                      header = FALSE, col.names = "Gene")

rnaDiffExp <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/RNA_diff_expr_breast.txt", 
                    header = FALSE, col.names = c("Gene", "pval"))
rppa <- fread("https://raw.githubusercontent.com/cbg-ethz/netics/master/data/protein_diff_expr_breast.txt", 
              header = FALSE, col.names = c("Gene", "pval"))

The network used in the method is stored as a Matlab object. But ff course there’s also a package that does this job for us! After that we have to extract the adjacency matrix from it. Because we will need this adjacency matrix later.

library(R.matlab)
tmp<-readMat("https://github.com/cbg-ethz/netics/blob/master/data/adj_lar_com.mat?raw=true")
adjacencyMatrix<-tmp$adj.lar.com
rm(tmp)

And that already was it with loading the data!

Combing p-values

In my last post I already talked about combining p-values. We need to implement it to combine the p-values from the rnaDiffExp and rppa.

Maybe you will also see now, why I like data.tables so much. I’ve commented the code for you.

combineDifferentialExpressions<-function(diffExp1, diffExp2){
    # Combine both data.tables through an outer join
    mergedDiff<-merge(diffExp1, diffExp2, by ="Gene", all=TRUE)
    # if the first p-value is missing take the second one
    mergedDiff[is.na(pval.x) & !is.na(pval.y), pval := pval.y]
    # if the second p-value is missing take the first one
    mergedDiff[!is.na(pval.x) & is.na(pval.y), pval := pval.x]
    # if both are present combine them by Fisher's method and perform a chi-square test
    mergedDiff[!is.na(pval.x) & !is.na(pval.y), pval := pchisq(-2 * (log(pval.x) + log(pval.y)) , 4)]
    return(mergedDiff[,.(Gene, pval)])
}

I really love the syntax of data.tables in R I have to say. 😀
Now let’s execute it and afterwards adjust the p-values with the FDR method and then take a look at it:

diffExp <- combineDifferentialExpressions(rnaDiffExp, rppa)
diffExp[, p.adjusted:=p.adjust(pval, method = "fdr")]
knitr::kable(diffExp[1:5])
Genepvalp.adjusted
A1BG0.00000000.0000000
A1CF0.88511070.9505269
A2BP10.22216470.2792292
A2LD10.91032690.9695794
A2M0.00000000.0000000

Heat Diffusion

I won’t go into detail about heat diffusion and its theoretical background at this place. Let me just summarize the idea or rather goal short. We’re transforming the adjacency matrix first to a transition matrix, with which you might be familiar from Markov Chains.
Basically each cell in the matrix stands for a transition between two genes (in our case).

Here R can also shine with its natural Linear Algebra capabilities.

normaliseAdjacencyMatrix <- function(adjacencyMatrix){
    return(adjacencyMatrix %*% diag(1/colSums(adjacencyMatrix)))
}

You see… Isn’t that basically a beautiful one-liner?

In the next step I implement the transformation of this matrix to a diffusion matrix. This matrix basically represents the connectivity and network topology.
Multiplying a vector of starting probabilities with it, gives us back a stationary distribution. Which basically means after initially many transitions in the network we will end up at a certain set of vertices in the network with a certain probability.
The restart probability indicates how likely it is that we will be “reported” back to our starting point instead of a transition in a step.

So let’s implement it:

performInsulatedHeatDiffusion <- function(adjacencyMatrix, restartProbability){
    temperature <- diag(dim(adjacencyMatrix)[1]) - (1 - restartProbability) * adjacencyMatrix
    return(restartProbability * solve(temperature))
}

We will run through this network in both directions. Basically, one time from the direction of mutated genes and the other time from the differentially expressed ones.
So let’s wrap this up in another function. We set the default restart probability to 0.4.

netICS <- function(adjacencyMatrix, networkGenes, mutationData, diffExpGenes,
                   restartProbability = 0.4){

    # only keep mutated and differentially expressed genes, that are present in the network
    mutationInNetwork <- mutationData[networkGenes, on="Gene", nomatch = 0]
    diffExpInNetwork <- diffExpGenes[networkGenes, on="Gene", nomatch = 0]

    # calculating the regular diffusion matrix
    connectivity <- performInsulatedHeatDiffusion(normaliseAdjacencyMatrix(adjacencyMatrix),
                                                  restartProbability)
    # calculating the diffusion matrix in the opposite direction
    connectivityBackward <- performInsulatedHeatDiffusion(normaliseAdjacencyMatrix(t(adjacencyMatrix)),
                                                          restartProbability)

    # ranking mediator genes
    result <- prioritize(connectivity, connectivityBackward, networkGenes,
                           mutationData, mutationInNetwork, diffExpInNetwork)

    return(result)

}

At this point you might have already noticed that we haven’t done anything yet with the diffusion matrix. For this reason we will implement the prioritize function, which will do this for us.

Gene Prioritization

First we implement a helper function for the diffusion to avoid redundant code:

diffuseSample<-function(connectivity, networkGenes, S){
    # find the positions of the input genes in the matrix
    positions <- networkGenes$Gene %in% S$Gene
    #multiply them with equal probability with the diffusion matrix
    weights <- rep(1/sum(positions), sum(positions)) %*% connectivity[positions,]
    return(as.vector(weights))
}

S here is the list of genes with which we multiply the diffusion matrix. The starting probability is for that matter uniformally distributed between them.

Afterwards we write our actual prioritization function. The diffusion in this case is only once executed for the differentially expressed genes, but sample-wise for the mutated genes.

Subsequently, we combine both probabilities sample-wise by multiplying and rank them.
Those ranks are then combined by three different methods, them being median, sums and rho-scores.
For the last one the RobustRankAggreg package is used:

prioritize<-function(connectivity, connectivityBackward, networkGenes,
                         mutationData, mutationInNetwork, diffExpInNetwork){

    # diffuse for differential expression, if differential expression isn't sample-wise
    if(!"Sample"  %in% colnames(diffExpInNetwork)){
        Ed<-diffuseSample(connectivityBackward, networkGenes, diffExpInNetwork)
    }

    scores <- data.table()

    for(sample in unique(mutationData$Sample)){
        # diffuse for mutated genes
        Em<-diffuseSample(connectivity, networkGenes, mutationInNetwork[Sample == sample])

        # diffuse for differential expression, if differential expression is sample-wise
        if("Sample"  %in% colnames(diffExpInNetwork)){
            Ed<-diffuseSample(connectivityBackward, networkGenes, diffExpInNetwork[Sample == sample])
        }

        # Multiply scores
        E <- Em * Ed
        scores <- rbind(scores, data.table(Gene = networkGenes$Gene, Sample = sample, Score = E))
    }

    # rank scores
    scores[, rank.in.sample:=rank(-Score), by=.(Sample)]

    # combine ranks
    library(RobustRankAggreg)
    ranks<-scores[, .(med=median(rank.in.sample), sum=sum(rank.in.sample),
                      rho = rhoScores(rank.in.sample/max(rank.in.sample))),
                  by=.(Gene)]

    return(ranks)
}

Phew… The work is done! Now it’s time to execute the code. If you do this at home you might notice, that it will take a while until it’s finished. That’s because the calculation of the diffusion matrix is very expensive, when it comes to processing time. Your computer has to solve a system of linear equations for it, which is not linear. But don’t ask me in which O it exactly lies. 🙂

result<-netICS(adjacencyMatrix = adjacencyMatrix, networkGenes,
               mutationData = mutationData, diffExpGenes = diffExp[p.adjusted < 0.05, .(Gene)])
knitr::kable(result[order(sum)][1:10])
Genemedsumrho
RPS6KB196874489.50e+00
AKT377077330.51e-07
AKT194478520.50e+00
SERPINE181881454.50e+00
TP53102684433.51e-07
MAP2K284386530.51e-07
CDKN2D75886534.50e+00
CREB1108087179.51e-07
VEGFA98787678.50e+00
E2F188588245.50e+00

I will probably use this code at some point of the future again, as Network theory is one of my areas of interest and it’s pretty nice to have that lying around.
This method, heat diffusion, is by the way very similar to Pageranking… The stuff that made Google big. 🙂

Some Stuff About The Near Future

I will probably next week do a post about making map plots. As I have to do one anyway and I figured it would be nice to document it here, as it could be quite useful to some of you.
And in the first week of April I will be on a Hackathon (my very first!) about Metabolomics in R. I will probably keep you posted about that then and I’m looking so forward to it.

So see you hopefully soon!

Availability Of My Code

You can access a maintained version of the code of the correct version in my GitHub repository Raspository under R/NetICS.R.

Please follow and like us:

Alternating Least Squares

How To Implement Alternating Least Squares In R And How Not To Do It

So it’s time for my first actual content. And like predicted in my Hello World post, it will be something implemented in the R programming language. More precisely it’s a Machine learning algorithm called Alternating Least Squares. But first, before we indulge ourselves in code, let me tell you why this algorithm is of interest for me and what it does.

Introduction

I’ve been working now for a few months as part of my research assistant job on the Bioconductor package BEclear (10.1371/journal.pone.0159921). I won’t go into detail about the package, you only need to know that it uses something called a Latent Factor Model to impute1 missing data in data-sets.

Let’s say you have a matrix D containing missing values. The rows in the matrix stand for features and the columns for samples. Then you could assume that the matrix \(D_ {ij}\) is modeled by both features and sample specific effects in the following way:

\(D_ {ij} = L_ {i}^ {T} \times R_ {j}\)

Where \(L_i\)is the feature specific matrix and \(R_j\) the sample specific matrix. For the imputation of missing values you now try to estimate those two matrices, the latent factors, from the existing values. Methods based on this assumption are already applied in variety of fields. Like with batch effects in DNA Methylation in the case of the BEclear package or in recommender systems for e.g. Netflix like described in a paper(10.1109/MC.2009.263), which helped me a lot in understanding this topic.

To estimate the latent factors there are different methods. One of them, implemented in the BEclear package is a gradient descent. Another method for it is Alternating Least Squares (ALS), which I wanted to implement on my own.2

The lecture notes of Hastie et al(1410.2596) served as my source for implementing this method. I highly recommend you reading both the paper from Koren et al and those lecture notes, if you want to know more about the theoretical background and also the applications of those methods. You just need to know some Linear Algebra, then they should be easy enough to understand in my opinion.

But as a short summary… ALS tries to estimate the feature and sample matrix by alternating fixing one of them and then calculating the other one by solving the the system of equations and you do this in general until convergence. In Gradient Descent on the other hand both matrices are estimated at the same time.

How Not To Implement Alternating Least Squares

As a start I will show you my first faulty try in implementing the Alternating Least Squares. Maybe you will learn something by me sharing it with you. And you should try to guess what my mistake was.

As my implementation reuses a function from BEclear34, you have to install this package first. For this purpose I guess it’s easiest if you just install it from GitHub via the following lines of code:

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")
devtools::install_github("David-J-R/BEclear")

And now let’s come to the first implementation of the Alternating Least Squares algorithm. See if you can find its problem. I tried to comment all those steps so that the code should be comprehensible. But if not, please let me know! 🙂

imputeALSfaulty<- function(data, lambda = 0.75, r = 10, iters = 20){

    # We require the BEclear package, because we're using its loss function
    require(BEclear)

    D <- data
    D[is.na(D)] <- 0

    # We initialise L and R with random values
    L <- matrix(rnorm(nrow(data) * r), nrow(data), r) / sqrt(r)
    R <- matrix(rnorm(r * ncol(data)), r, ncol(data)) / sqrt(r)

    currLoss <- BEclear:::loss(L,R, 1, data)$loss

    print(currLoss)

    for(i in 1:iters){
        # L and R are determined by solving the following system of the equations
        # We repeat this step iters-times
        L <- t(solve(R %*% t(R) + diag(lambda,r), R %*% t(D)))
        R <- solve(t(L) %*% L + diag(lambda,r), t(L) %*% D)

        currLoss <- BEclear:::loss(L,R, 1, data)$loss

        print(currLoss)
    }

    # L and R are multiplied to get the estimated values
    D <- L %*% R

    # Missing values are replaced with estimated value
    for (i in seq_len(nrow(data)))
        for (j in seq_len(ncol(data)))
        {
            if (is.na(data[i, j])) {
                data[i, j] <- D[i, j]
            }
        }

    return(data)
}

Now let me show you the problem with this implementation, if you haven’t recognized it on your own by now.
First we will load the example data and functions from the BEclear package to generate ourselves a sample data-set with missing values:

library(BEclear)
data("BEclearData")
batchEffect <- calcBatchEffects(
    data = ex.data, samples = ex.samples,
    adjusted = TRUE, method = "fdr")

mdifs <- batchEffect$med
pvals <- batchEffect$pval

summary <-calcSummary(mdifs, pvals)
cleared.data <- clearBEgenes(ex.data, ex.samples, summary)

And then we run the faulty version of ALS on it. The printed output of the function is the loss of the current solution during each iteration.

result <- imputeALSfaulty(cleared.data, iters = 10)
## [1] 2586.68
## [1] 101.8086
## [1] 95.60281
## [1] 95.29458
## [1] 95.21404
## [1] 95.20139
## [1] 95.20632
## [1] 95.21329
## [1] 95.21869
## [1] 95.22233
## [1] 95.2247

If we now take a look at the imputed values, you can see what’s wrong:

boxplot(result[is.na(cleared.data)])
A boxplot with a median at about 0.02 and the upper quartile at about 0.07

They’re all pretty close to zero. That’s because we set the missing values to zero. This way the solve method “tries” to generate R and L the way that the missing values are also very close to zero.
Of course we don’t want that… This way we could just set the missing values right away to zero.

How To Implement Alternating Least Squares

Finally let me show you an implementation that actually does, what it should do. And again if something is unclear, don’t hesitate to ask me!

imputeALScorrect<- function(data, lambda = 0.75, r = 10, iters = 80){

    # We require the BEclear package, because we're using its loss function
    require(BEclear)

    # We initialise L and R with random values
    L <- matrix(rnorm(nrow(data) * r), nrow(data), r) / sqrt(r)
    R <- matrix(rnorm(r * ncol(data)), r, ncol(data)) / sqrt(r)

    currLoss <- BEclear:::loss(L,R, 1, data)$loss
    print(currLoss)

    for(iter in 1:iters){

        # Now we iterate over the feature dimmension of L
        for(i in 1:dim(L)[[1]]){
            # We determine the revealed entries for the feature
            # And subset the data and R so to only retain the revealed entries
            revealedEntries <- !is.na(data[i,])
            y <- as.matrix(data[i, revealedEntries])
            x <- R[,revealedEntries]
            # We solve the linear equation for the feature
            L[i,] <- as.vector(solve(x %*% t(x) + diag(lambda, r), x %*% y))
        }

        # We iterate over the sample dimmension of R
        for(j in 1:dim(R)[[2]]){
            # We determine the revealed entries for the sample
            # And subset the data and L so to only retain the revealed entries
            revealedEntries <- !is.na(data[,j])
            y <- as.matrix(data[revealedEntries, j])
            x <- L[revealedEntries,]
            # We solve the linear equation for the sample
            R[,j] <- as.vector(solve(t(x) %*% x + diag(lambda, r), t(x) %*% y))
        }
        currLoss <- BEclear:::loss(L,R, 1, data)$loss

        print(currLoss)
    }

    # L and R are multiplied to get the estimated values
    D <- L %*% R

    # Missing values are replaced with estimated value

    for (i in seq_len(nrow(data)))
        for (j in seq_len(ncol(data)))
        {
            if (is.na(data[i, j])) {
                data[i, j] <- D[i, j]
            }
        }

    return(data)
}

A further advantage of this implementation is, that it is relatively easy to write a parallelised version of it. Maybe I will show you that in one of my next posts. After I overheard a conversation at the university that R is apparently bad for this, I feel almost challenged to do so.

Now let’s take a look at the imputed values. We just take the sample data-set from before for this cause.

result <- imputeALScorrect(cleared.data, iters = 10)
## [1] 2571.072
## [1] 109.301
## [1] 99.38027
## [1] 97.17519
## [1] 95.42625
## [1] 94.00547
## [1] 92.83838
## [1] 91.87368
## [1] 91.07338
## [1] 90.40794
## [1] 89.85372
boxplot(result[is.na(cleared.data)])
A boxplot with a median at about 0.4 and the upper quartile at about 0.7

Now that looks more like real data… Doesn’t it? But to be sure let’s compare it to the predicted values by the BEclear package. For the comparison we calculated the Root Mean Squared Error:

library(Metrics)
result.BEclear <- imputeMissingData(cleared.data)
## INFO [2019-02-08 12:17:10] Starting the imputation of missing values.
## INFO [2019-02-08 12:17:10] This might take a while.
## INFO [2019-02-08 12:17:10] BEclear imputation is started:
## INFO [2019-02-08 12:17:10] block size: 60  x  60
## INFO [2019-02-08 12:17:10] Impute missing data for block 1 of 4
## INFO [2019-02-08 12:17:10] Impute missing data for block 2 of 4
## INFO [2019-02-08 12:17:11] Impute missing data for block 3 of 4
## INFO [2019-02-08 12:17:11] Impute missing data for block 4 of 4
rmse(result.BEclear[is.na(cleared.data)], result[is.na(cleared.data)])
## [1] 0.03196931

Well the difference isn’t that big. But of course for assessing the accuracy of the method an elaborate evaluation would be needed. However for something I coded just for fun I’m satisfied with this first look.

Addendum: Biases

Just for fun let’s also add biases to our model, like described by Koren et al, to further improve our algorithm.

The idea behind the bias is to capture the variability of the data that arises from the features or samples alone, while the two matrices L and R capture the variability that arises from the interaction of features and samples together.
In other words by introducing the biases we unburden L and R a bit.
We use a method, where the biases for each entry in the data-set are the sum of the overall average over all values and the average difference from this average of the corresponding column and row. And to save valuable computation time we just subtract this bias for each value from a copy of each value and use this transformed matrix for further calculations. Of course we have to add the bias later again.

And here we go with the improved implementation:

imputeALSBias<- function(data, lambda = 0.75, r = 5, iters = 10, use.biases=TRUE){

    # We require the BEclear package, because we're using its loss function
    require(BEclear)

    # copy the data
    D <- data

    # We initialise L and R with random values
    L <- matrix(rnorm(nrow(data) * r), nrow(data), r) / sqrt(r)
    R <- matrix(rnorm(r * ncol(data)), r, ncol(data)) / sqrt(r)

    currLoss <- BEclear:::loss(L,R, 1, D)$loss
    print(currLoss)

    if(use.biases){
        # we calculate the biases
        biasData<-mean(data, na.rm = TRUE)
        biasRows<-rowMeans(data - biasData, na.rm= TRUE)
        biasCols<-colMeans(data - biasData, na.rm= TRUE)

        # subtract the biases from the data
        D <- D - biasData - biasRows
        D <- t(t(D) - biasCols)
    }

    for(iter in 1:iters){

        # Now we iterate over the feature dimmension of L
        for(i in 1:dim(L)[[1]]){
            # We determine the revealed entries for the feature
            # And subset the data and R so to only retain the revealed entries
            revealedEntries <- !is.na(D[i,])
            y <- as.matrix(D[i, revealedEntries])
            x <- R[,revealedEntries]
            # We solve the linear equation for the feature
            L[i,] <- as.vector(solve(x %*% t(x) + diag(lambda, r), x %*% y))
        }

        # We iterate over the sample dimmension of R
        for(j in 1:dim(R)[[2]]){
            # We determine the revealed entries for the sample
            # And subset the data and L so to only retain the revealed entries
            revealedEntries <- !is.na(D[,j])
            y <- as.matrix(D[revealedEntries, j])
            x <- L[revealedEntries,]
            # We solve the linear equation for the sample
            R[,j] <- as.vector(solve(t(x) %*% x + diag(lambda, r), t(x) %*% y))
        }
        currLoss <- BEclear:::loss(L,R, 1, D)$loss

        print(currLoss)
    }

    # L and R are multiplied to get the estimated values
    D <- L %*% R

    if(use.biases){
        # we add the biases again
        D <- t(t(D) + biasCols)
        D <- D + biasData + biasRows
    }

    # Missing values are replaced with estimated value

    for (i in seq_len(nrow(data)))
        for (j in seq_len(ncol(data)))
        {
            if (is.na(data[i, j])) {
                data[i, j] <- D[i, j]
            }
        }

    return(data)
}

Testing this implementation, if you wish, is now your turn! 🙂
Maybe at some later point I will compare the performance and correctness of various different settings of this functions5. But for now that’s enough. Of course there are more sophisticated bias models we could think of. But one could even think of bias models like biases that are also determined by the Alternating Least Squares method during each iteration.
So we won’t run out of things to do any time soon.

Conclusion

Yea, what’s the conclusion? I think it’s quite simple… Don’t be lazy, while coding!

OK, OK… I will say a bit more. I think what you can learn from the faulty example is that you should always think what your code is actually doing and take a look at the results to see, if something is fishy. Other than that I hope that you learned something and I could show you that some methods used in Machine Learning aren’t that complicated.

For now my implementation of ALS is still worse, when it comes to run time, in comparison to the Gradient Descent implemented in the BEclear package. But I also spend a lot of time optimizing the second. And maybe I will show you in a future blog how to optimize it. As this is my first blog post I would highly welcome feedback from you! 🙂

So have a nice day and until next time!

Availability Of The Code

You can access a maintained version of the code of the correct version in my GitHub repository Raspository under R/imputeALS.R.

Please follow and like us:

Bibliography