# Tag Archives: Rstats

Rstats is a common hastag for the programming language called R. The stats comes from the part, that the language is used for statistical computing. But in my opinion the language can also find application to other areas.
I like to use it for its convinience of installing packages and seamlessly integrating C++ and other languages into it. Furthermore the language provides you with some useful features for ensuring reproducability. This is especially useful for data science, but also other areas of science.

As you guys might know, I’m studying Bioinformatics. For Bioinformaticians Bioconductor is a very useful project. For it contains various tools for this area of research written in R. And I’m even the maintainer of one of the packages there.

So let me show you all the nice things, you can do in R!

# Or Let’s Make Some Noise!!

As I wrote in one of my last posts there will be some content about image processing and computer vision. And following my second lecture of it, here we go! Fortunately R has some capabilities of handling pictures, of which I will take use.

For now the topic will just be, how to add noise to images, which can be useful to you, e.g. if you want to prove the performance of an algorithm for handling noise. So we’re simulating noise. Natural noise in images can come from different sources… The most common of it would be due to high ISO values or/and long exposure times. So in general in low light conditions.
However I’ll only work with black and white pictures for today. Maybe color comes at some point in the future. 🙂

First I’ll show you how to load pictures into R and how to print them again. It’s quite easy tbh:

library(jpeg)
class(goose)

## [1] "array"

dim(goose)

## [1] 600 897   3


As the picture is black and white we only need one dimension of it. And we can just plot it by transforming it into a raster object:

gooseBW <- goose[,,1]
plot(as.raster(gooseBW))


Cute this goose, isn’t it? 🙂 Now let’s destroy this picture with noise! Nyehehehehe1.!

To begin with I will add some uniform distributed noise. As the values in the matrix are between 0 and 1, I will add values between -1 and 1

set.seed(1)
gooseUniformNoise <- gooseBW + runif(length(gooseBW), min = -1, max = 1)


But we can’t plot this right, as the matrix now also contains values above 1 and below 0. I’ll fix this by cropping the boundaries and then plot.

gooseUniformNoise[gooseUniformNoise > 1] <- 1
gooseUniformNoise[gooseUniformNoise < 0] <- 0
plot(as.raster(gooseUniformNoise))


Interesting how you can still make out the goose in spite of the noise? Now let’s do the same with normal distributed noise. I’ll take the standard deviation of the black and white picture.

gooseNormalNoise <- gooseBW + rnorm(length(gooseBW), sd = sd(gooseBW))
gooseNormalNoise[gooseNormalNoise > 1] <- 1
gooseNormalNoise[gooseNormalNoise < 0] <- 0
plot(as.raster(gooseNormalNoise))


In the lecture we also did multiplicative and impulse noise. However I won’t do them in this post, maybe another time.
Anyway, do you already see, why I like R so much? 🙂 If you do Linear Algebra and Statistics stuff it’s just super comfy!
I’ll show you probably even more reasons why, when I demonstrate you, how to denoise picture again. But that’s it for today.

Have a nice weekend!

# A practical example

Like promised in another post I will show you how to do a map plots with R. For this purpose I will use the ggmap package, which makes this a relatively easy task.
But before I begin with the actual code, let me give you a short motivation

## Why to use map plots

Motivations for using map plots can be various. For example if you’re a journalist and let’s say you want to visualize a kind of events (like say earthquakes) in a regional context, this is a very demonstrative way of visualizing your data.
Or if you want to present some kind of data about places or countries map plots are always a good option.

The first time I did a map plot was actually part of an awesome lecture I had back in Munich at the TUM. Afterwards I got the chance to use this skill right away in the next semester for my Bachelor’s thesis.
As some of you might know the area, where I applied the algorithm which I improved and implemented for my thesis, was mental disorders. During the writing process of my thesis, I found it a good starting point in my thesis and the accompanying presentation to emphasize the prevalence of mental disorders in the world. In order to do so I used a map plot.
That’s basically also what I will do now, but this time with the case of cancer.

But on a side note I’m not saying that you should do those kinds of plots for each thesis or presentation regarding a disease topic. It’s just one possible starting point for it and not necessarily the best.
So please don’t just mindlessly copy, what I’m doing here. 🙂

## Getting the data for dissease related map plots

First let’s load all the packages we will need for this little exercise.

library(XLConnect)
library(data.table)
library(ggplot2)
library(ggthemes)
library(maps)

XLConnect is a package for loading excel sheet, which we will need.
That I like to use data.table you probably already noticed. It’s just super fast and comfy for some procedures and it has some nice synergies with ggplot2.
The maps package contains as the name suggests map data, which can be used to plot. Alternatively one could also use ggmap.
And ggthemes contains a neat theme for maps, which I will use.

First let’s load our world map. This data.table contains region names and the boundaries of those regions as longitudes and latitudes. ggplot can plot those as polygons.

mapdata <- data.table(map_data("world"))
knitr::kable(mapdata[1:5])
longlatgrouporderregionsubregion
-69.8991212.4520011ArubaNA
-69.8957112.4230012ArubaNA
-69.9421912.4385313ArubaNA
-70.0041512.5004914ArubaNA
-70.0661212.5469715ArubaNA

OK, done. Now we need to download the data on 2004’s mortality from the WHO.

download.file("www.who.int/entity/healthinfo/global_burden_disease/gbddeathdalycountryestimates2004.xls", "gbd.xls")
tmp <- readWorksheetFromFile(file = "gbd.xls", sheet = "Deaths 2004")

causes <- tmp$Col1[14:143] countries <- unname(tmp[6,7:198]) deathRates <- tmp[14:143,7:198] You should probably take a look at the Excel file yourself to understand it and what I’m doing later. The file is made for humans to look at it and not directly for machines to read it. Which is why we have to do some cleaning and transforming. In my experience as a Bioinformatics students this is something you have to do almost always. Even if you have a machine readable format, there’s no perfect data-set. You will always have some missing data or have to transform your data in some way. And this isn’t necessarily a trivial step. Often you will spend a lot of time here. And that’s good. If cleaning data was trivial, then we wouldn’t need data scientist. ## Cleaning data To begin with we have to transform the death rates to numeric values… Because they’re characters (strings) right now. For this purpose we have to also replace the separating comma at the thousand position. You see? What’s done to make the data more human readable, makes it less machine readable. That’s often the case. Then we set the column names to the countries and transform the matrix together with the vector of causes to a data.table. deathRatesNum <- matrix(as.numeric(gsub(",", "", as.matrix(deathRates))), nrow = dim(deathRates)[1]) ## Warning in matrix(as.numeric(gsub(",", "", as.matrix(deathRates))), nrow = ## dim(deathRates)[1]): NAs introduced by coercion  colnames(deathRatesNum) <- countries DT <- data.table(causes = causes, deathRatesNum) Now we want a clean or also called long data-set. In this new data set we will have only three columns. Two variables (causes and region), which uniquely identify the value death rate. Similar to a database we can also set those variable columns as keys, which makes it very fast searchable. DTclean <- melt(DT, id.vars = "causes", variable.name = "region", value.name = "deathRate") setkey(DTclean, causes, region) Next let us see, if we have some regions in our data.table that aren’t in our map. DTclean[!region %in% mapdata$region, unique(region)]
##  [1] Antigua and Barbuda
##  [2] Brunei Darussalam
##  [3] Congo
##  [4] Côte d'Ivoire
##  [5] Democratic People's Republic of Korea
##  [6] Iran (Islamic Republic of)
##  [7] Lao People's Democratic Republic
##  [8] Libyan Arab Jamahiriya
##  [9] Micronesia (Federated States of)
## [10] Republic of Korea
## [11] Republic of Moldova
## [12] Russian Federation
## [13] Saint Kitts and Nevis
## [14] Saint Vincent and the Grenadines
## [15] Serbia and Montenegro
## [16] Syrian Arab Republic
## [17] The former Yugoslav Republic of Macedonia
## [19] Tuvalu
## [20] United Kingdom
## [21] United Republic of Tanzania
## [22] United States of America
## [23] Venezuela (Bolivarian Republic of)
## [24] Viet Nam
## 192 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe

As expected, there are 24 regions from the WHO sheet not in the mapdata. Even though there’s probably a more elegant solution, I will change them manually. It’s a work that has to be done once. For this purpose it’s probably only necessary to fill it in for the big countries. So this is bearable.

DTclean[region == "Brunei Darussalam", region := "Brunei"]
DTclean[region == "Congo", region := "Republic of Congo"]
DTclean[region == "Democratic People's Republic of Korea", region := "North Korea"]
DTclean[region == "Iran (Islamic Republic of)", region := "Iran"]
DTclean[region == "Côte d'Ivoire", region := "Ivory Coast"]
DTclean[region == "Lao People's Democratic Republic", region := "Laos"]
DTclean[region == "Libyan Arab Jamahiriya", region := "Libya"]
DTclean[region == "The former Yugoslav Republic of Macedonia", region := "Macedonia"]
DTclean[region == "Micronesia (Federated States of)", region := "Micronesia"]
DTclean[region == "Republic of Moldova", region := "Moldova"]
DTclean[region == "Republic of Korea", region := "South Korea"]
DTclean[region == "Russian Federation", region := "Russia"]
DTclean[region == "Serbia and Montenegro", region := "Serbia"]
DTclean[region == "Syrian Arab Republic", region := "Syria"]
DTclean[region == "United Republic of Tanzania", region := "Tanzania"]
DTclean[region == "United Kingdom", region := "UK"]
DTclean[region == "United States of America", region := "USA"]
DTclean[region == "Venezuela (Bolivarian Republic of)", region := "Venezuela"]
DTclean[region == "Viet Nam", region := "Vietnam"]

And yea of course the work isn’t done completely yet. We also should check if there are regions in the mapdata, that aren’t in the WHO data-set. This could be due to various reasons… One being, that a region isn’t a member of the WHO and therefore the WHO doesn’t publish data on them.
Or more likely that a country from the WHO data-set span more than one region on the map, Serbia and Montenegro being such a case.
However I’m lazy now and I won’t do this today. How about you doing it and writing me a comment? 😛 Let it be a team1 effort.

## Making the map plots

OK, before we do the actual plotting let’s first calculate for how much percentage of all deaths in each country cancer is the cause. In detail I do this by joining the data.table with itself.
On a side note: W000 is the WHO code for all death causes combined and W060 for Malignant neoplasms, which is a more formal name for cancer.

Then we need to join the data.table with the map on the region name.

DTcaused <- DTclean[causes == "W000"][DTclean[causes == "W060"], on = "region"][, .(region, percentageCaused = i.deathRate / deathRate)]

deathrateMap <- mapdata[DTcaused, on = "region", allow.cartesian=TRUE, nomatch = 0]

And finally we can do our plot. For this purpose we first plot all regions in grey and as overlay we fill the countries, that we have data on, with a color between grey and red depending on how frequent cancer as a death cause is.

g <- ggplot() + geom_polygon(data = mapdata,  aes(long, lat, group = group), fill = "grey")
g <- g +  geom_polygon(data = deathrateMap,  aes(long, lat, group = group, fill = percentageCaused))
g <- g + scale_fill_gradient(low = "grey", high = "red", aesthetics = "fill", name = "Percentage of\ndeaths caused\nby cancer")
g + ggthemes::theme_map()

And of course there’s one thing about this plot that could be misleading. Given that regions with missing data and very low prevalence of cancer deaths will both be grey, you hopefully see the potential problem here?
It’s not necessarily wrong or bad to do so. But I hope you recognize how someone could make a plot this way to mislead his audience. That’s why I recommend when it comes to looking at plots not only to think about, what is shown, but also what isn’t shown. Since no large data-set is complete… So ask the person who presents it to you, how she/he handled missing data points.

So what does this map actually say? From my perspective I don’t think anything surprising. At the moment, this data set captured, cancer was (and probably still is) mostly a problem of industrialized countries and it doesn’t seem to be connected to geography primarily (Can you see how Israel, Japan and South Korea pop up?).
Although the difference between the USA and Canada could be something interesting.
But this map, in my opinion, shows very clearly that cancer is one of the leading causes of death in the developed world, which also is the reason, why we also spend so much money on researching it.

However the main purpose of this post was to show you, how to make such plots and not discuss the reasons of different causes of mortality.
Ultimately I hope that this post has helped you.

Of course it is important that you mention your sources (cite them if you write a paper). This is because your approach has to be reproducible and you have to give those people, who did the preliminary work, credit for it.

In R you can get the proper citations for the packages you used the following way:

citation("ggmap")
##
## To cite ggmap in publications, please use:
##
##   D. Kahle and H. Wickham. ggmap: Spatial Visualization with
##   ggplot2. The R Journal, 5(1), 144-161. URL
##   http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
##
## A BibTeX entry for LaTeX users is
##
##   @Article{,
##     author = {David Kahle and Hadley Wickham},
##     title = {ggmap: Spatial Visualization with ggplot2},
##     journal = {The R Journal},
##     year = {2013},
##     volume = {5},
##     number = {1},
##     pages = {144--161},
##     url = {https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf},
##   }
citation("maps")
##
## To cite package 'maps' in publications use:
##
##   Original S code by Richard A. Becker, Allan R. Wilks. R version
##   by Ray Brownrigg. Enhancements by Thomas P Minka and Alex
##   Deckmyn. (2018). maps: Draw Geographical Maps. R package version
##   3.3.0. https://CRAN.R-project.org/package=maps
##
## A BibTeX entry for LaTeX users is
##
##   @Manual{,
##     title = {maps: Draw Geographical Maps},
##     author = {Original S code by Richard A. Becker and Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn.},
##     year = {2018},
##     note = {R package version 3.3.0},
##     url = {https://CRAN.R-project.org/package=maps},
##   }
##
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.

You get the idea. Also cite the other packages, if you use them in your publication or thesis.
The output is in bibtex format. So I hope you know what to do with it. 😛

Of course the data on the global burden of disease you have to cite as well. Thus I’ll give you the formatted citation for it:

WHO. (2004). The global burden of disease: 2004 update: causes of death. 2004 Update, 8–26.

And last, but not least, please also mention me. This however is not a necessity, but a sign of respect towards my work. By all means respect is an important thing, unfortunately not often enough given in our society.

# Re-Implementing netICS Heat Diffusion In R

OK, I successfully procrastinated this blog entry about heat diffusion now for too long, even though the code for it has been waiting for over a month. :’D So it’s now time for it! I promised it! And it’s time for it now. And I guess it doesn’t need to be perfect, because the whole idea of this blog is to put things out now, rather than waiting for some future moment, where I can do it perfectly.

So that being said, let’s start. 🙂

In my last post I already talked about the integration of multi-omics data. This time I will show you one method called netICS from a more or less pure coding perspective. It uses Heat Diffusion for finding so-called mediator genes in multi-omics data.
You can find the original code written in Matlab here and the corresponding paper here, which I would suggest you reading, if I aroused your interest in the topic.

Thus let’s just dive straight into the matter!

Because we are cool kids we don’t download the data “manually”, but do it right away in R.
And of course we do this with the super fast and comfy fread function from the data.table package, which I would suggest you to learn, when doing R.

We load the data directly from the Github of the original netICS:

library(data.table)

col.names = c("Gene", "Sample"), header = FALSE)
header = FALSE, col.names = "Gene")

header = FALSE, col.names = c("Gene", "pval"))
header = FALSE, col.names = c("Gene", "pval"))


The network used in the method is stored as a Matlab object. But ff course there’s also a package that does this job for us! After that we have to extract the adjacency matrix from it. Because we will need this adjacency matrix later.

library(R.matlab)
adjacencyMatrix<-tmp$adj.lar.com rm(tmp)  And that already was it with loading the data! ## Combing p-values In my last post I already talked about combining p-values. We need to implement it to combine the p-values from the rnaDiffExp and rppa. Maybe you will also see now, why I like data.tables so much. I’ve commented the code for you. combineDifferentialExpressions<-function(diffExp1, diffExp2){ # Combine both data.tables through an outer join mergedDiff<-merge(diffExp1, diffExp2, by ="Gene", all=TRUE) # if the first p-value is missing take the second one mergedDiff[is.na(pval.x) & !is.na(pval.y), pval := pval.y] # if the second p-value is missing take the first one mergedDiff[!is.na(pval.x) & is.na(pval.y), pval := pval.x] # if both are present combine them by Fisher's method and perform a chi-square test mergedDiff[!is.na(pval.x) & !is.na(pval.y), pval := pchisq(-2 * (log(pval.x) + log(pval.y)) , 4)] return(mergedDiff[,.(Gene, pval)]) }  I really love the syntax of data.tables in R I have to say. 😀 Now let’s execute it and afterwards adjust the p-values with the FDR method and then take a look at it: diffExp <- combineDifferentialExpressions(rnaDiffExp, rppa) diffExp[, p.adjusted:=p.adjust(pval, method = "fdr")] knitr::kable(diffExp[1:5])  Genepvalp.adjusted A1BG0.00000000.0000000 A1CF0.88511070.9505269 A2BP10.22216470.2792292 A2LD10.91032690.9695794 A2M0.00000000.0000000 ## Heat Diffusion I won’t go into detail about heat diffusion and its theoretical background at this place. Let me just summarize the idea or rather goal short. We’re transforming the adjacency matrix first to a transition matrix, with which you might be familiar from Markov Chains. Basically each cell in the matrix stands for a transition between two genes (in our case). Here R can also shine with its natural Linear Algebra capabilities. normaliseAdjacencyMatrix <- function(adjacencyMatrix){ return(adjacencyMatrix %*% diag(1/colSums(adjacencyMatrix))) }  You see… Isn’t that basically a beautiful one-liner? In the next step I implement the transformation of this matrix to a diffusion matrix. This matrix basically represents the connectivity and network topology. Multiplying a vector of starting probabilities with it, gives us back a stationary distribution. Which basically means after initially many transitions in the network we will end up at a certain set of vertices in the network with a certain probability. The restart probability indicates how likely it is that we will be “reported” back to our starting point instead of a transition in a step. So let’s implement it: performInsulatedHeatDiffusion <- function(adjacencyMatrix, restartProbability){ temperature <- diag(dim(adjacencyMatrix)[1]) - (1 - restartProbability) * adjacencyMatrix return(restartProbability * solve(temperature)) }  We will run through this network in both directions. Basically, one time from the direction of mutated genes and the other time from the differentially expressed ones. So let’s wrap this up in another function. We set the default restart probability to 0.4. netICS <- function(adjacencyMatrix, networkGenes, mutationData, diffExpGenes, restartProbability = 0.4){ # only keep mutated and differentially expressed genes, that are present in the network mutationInNetwork <- mutationData[networkGenes, on="Gene", nomatch = 0] diffExpInNetwork <- diffExpGenes[networkGenes, on="Gene", nomatch = 0] # calculating the regular diffusion matrix connectivity <- performInsulatedHeatDiffusion(normaliseAdjacencyMatrix(adjacencyMatrix), restartProbability) # calculating the diffusion matrix in the opposite direction connectivityBackward <- performInsulatedHeatDiffusion(normaliseAdjacencyMatrix(t(adjacencyMatrix)), restartProbability) # ranking mediator genes result <- prioritize(connectivity, connectivityBackward, networkGenes, mutationData, mutationInNetwork, diffExpInNetwork) return(result) } At this point you might have already noticed that we haven’t done anything yet with the diffusion matrix. For this reason we will implement the prioritize function, which will do this for us. ## Gene Prioritization First we implement a helper function for the diffusion to avoid redundant code: diffuseSample<-function(connectivity, networkGenes, S){ # find the positions of the input genes in the matrix positions <- networkGenes$Gene %in% S$Gene #multiply them with equal probability with the diffusion matrix weights <- rep(1/sum(positions), sum(positions)) %*% connectivity[positions,] return(as.vector(weights)) } S here is the list of genes with which we multiply the diffusion matrix. The starting probability is for that matter uniformally distributed between them. Afterwards we write our actual prioritization function. The diffusion in this case is only once executed for the differentially expressed genes, but sample-wise for the mutated genes. Subsequently, we combine both probabilities sample-wise by multiplying and rank them. Those ranks are then combined by three different methods, them being median, sums and rho-scores. For the last one the RobustRankAggreg package is used: prioritize<-function(connectivity, connectivityBackward, networkGenes, mutationData, mutationInNetwork, diffExpInNetwork){ # diffuse for differential expression, if differential expression isn't sample-wise if(!"Sample" %in% colnames(diffExpInNetwork)){ Ed<-diffuseSample(connectivityBackward, networkGenes, diffExpInNetwork) } scores <- data.table() for(sample in unique(mutationData$Sample)){
# diffuse for mutated genes
Em<-diffuseSample(connectivity, networkGenes, mutationInNetwork[Sample == sample])

# diffuse for differential expression, if differential expression is sample-wise
if("Sample"  %in% colnames(diffExpInNetwork)){
Ed<-diffuseSample(connectivityBackward, networkGenes, diffExpInNetwork[Sample == sample])
}

# Multiply scores
E <- Em * Ed
scores <- rbind(scores, data.table(Gene = networkGenes$Gene, Sample = sample, Score = E)) } # rank scores scores[, rank.in.sample:=rank(-Score), by=.(Sample)] # combine ranks library(RobustRankAggreg) ranks<-scores[, .(med=median(rank.in.sample), sum=sum(rank.in.sample), rho = rhoScores(rank.in.sample/max(rank.in.sample))), by=.(Gene)] return(ranks) } Phew… The work is done! Now it’s time to execute the code. If you do this at home you might notice, that it will take a while until it’s finished. That’s because the calculation of the diffusion matrix is very expensive, when it comes to processing time. Your computer has to solve a system of linear equations for it, which is not linear. But don’t ask me in which O it exactly lies. 🙂 result<-netICS(adjacencyMatrix = adjacencyMatrix, networkGenes, mutationData = mutationData, diffExpGenes = diffExp[p.adjusted < 0.05, .(Gene)]) knitr::kable(result[order(sum)][1:10])  Genemedsumrho RPS6KB196874489.50e+00 AKT377077330.51e-07 AKT194478520.50e+00 SERPINE181881454.50e+00 TP53102684433.51e-07 MAP2K284386530.51e-07 CDKN2D75886534.50e+00 CREB1108087179.51e-07 VEGFA98787678.50e+00 E2F188588245.50e+00 I will probably use this code at some point of the future again, as Network theory is one of my areas of interest and it’s pretty nice to have that lying around. This method, heat diffusion, is by the way very similar to Pageranking… The stuff that made Google big. 🙂 ## Some Stuff About The Near Future I will probably next week do a post about making map plots. As I have to do one anyway and I figured it would be nice to document it here, as it could be quite useful to some of you. And in the first week of April I will be on a Hackathon (my very first!) about Metabolomics in R. I will probably keep you posted about that then and I’m looking so forward to it. So see you hopefully soon! ## Availability Of My Code You can access a maintained version of the code of the correct version in my GitHub repository Raspository under R/NetICS.R. Please follow and like us: ## Alternating Least Squares # How To Implement Alternating Least Squares In R And How Not To Do It So it’s time for my first actual content. And like predicted in my Hello World post, it will be something implemented in the R programming language. More precisely it’s a Machine learning algorithm called Alternating Least Squares. But first, before we indulge ourselves in code, let me tell you why this algorithm is of interest for me and what it does. ## Introduction I’ve been working now for a few months as part of my research assistant job on the Bioconductor package BEclear . I won’t go into detail about the package, you only need to know that it uses something called a Latent Factor Model to impute1 missing data in data-sets. Let’s say you have a matrix D containing missing values. The rows in the matrix stand for features and the columns for samples. Then you could assume that the matrix $$D_ {ij}$$ is modeled by both features and sample specific effects in the following way: $$D_ {ij} = L_ {i}^ {T} \times R_ {j}$$ Where $$L_i$$is the feature specific matrix and $$R_j$$ the sample specific matrix. For the imputation of missing values you now try to estimate those two matrices, the latent factors, from the existing values. Methods based on this assumption are already applied in variety of fields. Like with batch effects in DNA Methylation in the case of the BEclear package or in recommender systems for e.g. Netflix like described in a paper, which helped me a lot in understanding this topic. To estimate the latent factors there are different methods. One of them, implemented in the BEclear package is a gradient descent. Another method for it is Alternating Least Squares (ALS), which I wanted to implement on my own.2 The lecture notes of Hastie et al served as my source for implementing this method. I highly recommend you reading both the paper from Koren et al and those lecture notes, if you want to know more about the theoretical background and also the applications of those methods. You just need to know some Linear Algebra, then they should be easy enough to understand in my opinion. But as a short summary… ALS tries to estimate the feature and sample matrix by alternating fixing one of them and then calculating the other one by solving the the system of equations and you do this in general until convergence. In Gradient Descent on the other hand both matrices are estimated at the same time. ## How Not To Implement Alternating Least Squares As a start I will show you my first faulty try in implementing the Alternating Least Squares. Maybe you will learn something by me sharing it with you. And you should try to guess what my mistake was. As my implementation reuses a function from BEclear34, you have to install this package first. For this purpose I guess it’s easiest if you just install it from GitHub via the following lines of code: if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") devtools::install_github("David-J-R/BEclear")  And now let’s come to the first implementation of the Alternating Least Squares algorithm. See if you can find its problem. I tried to comment all those steps so that the code should be comprehensible. But if not, please let me know! 🙂 imputeALSfaulty<- function(data, lambda = 0.75, r = 10, iters = 20){ # We require the BEclear package, because we're using its loss function require(BEclear) D <- data D[is.na(D)] <- 0 # We initialise L and R with random values L <- matrix(rnorm(nrow(data) * r), nrow(data), r) / sqrt(r) R <- matrix(rnorm(r * ncol(data)), r, ncol(data)) / sqrt(r) currLoss <- BEclear:::loss(L,R, 1, data)$loss

print(currLoss)

for(i in 1:iters){
# L and R are determined by solving the following system of the equations
# We repeat this step iters-times
L <- t(solve(R %*% t(R) + diag(lambda,r), R %*% t(D)))
R <- solve(t(L) %*% L + diag(lambda,r), t(L) %*% D)

currLoss <- BEclear:::loss(L,R, 1, data)$loss print(currLoss) } # L and R are multiplied to get the estimated values D <- L %*% R # Missing values are replaced with estimated value for (i in seq_len(nrow(data))) for (j in seq_len(ncol(data))) { if (is.na(data[i, j])) { data[i, j] <- D[i, j] } } return(data) }  Now let me show you the problem with this implementation, if you haven’t recognized it on your own by now. First we will load the example data and functions from the BEclear package to generate ourselves a sample data-set with missing values: library(BEclear) data("BEclearData") batchEffect <- calcBatchEffects( data = ex.data, samples = ex.samples, adjusted = TRUE, method = "fdr") mdifs <- batchEffect$med
pvals <- batchEffect$pval summary <-calcSummary(mdifs, pvals) cleared.data <- clearBEgenes(ex.data, ex.samples, summary)  And then we run the faulty version of ALS on it. The printed output of the function is the loss of the current solution during each iteration. result <- imputeALSfaulty(cleared.data, iters = 10)  ## [1] 2586.68 ## [1] 101.8086 ## [1] 95.60281 ## [1] 95.29458 ## [1] 95.21404 ## [1] 95.20139 ## [1] 95.20632 ## [1] 95.21329 ## [1] 95.21869 ## [1] 95.22233 ## [1] 95.2247  If we now take a look at the imputed values, you can see what’s wrong: boxplot(result[is.na(cleared.data)])  They’re all pretty close to zero. That’s because we set the missing values to zero. This way the solve method “tries” to generate R and L the way that the missing values are also very close to zero. Of course we don’t want that… This way we could just set the missing values right away to zero. ## How To Implement Alternating Least Squares Finally let me show you an implementation that actually does, what it should do. And again if something is unclear, don’t hesitate to ask me! imputeALScorrect<- function(data, lambda = 0.75, r = 10, iters = 80){ # We require the BEclear package, because we're using its loss function require(BEclear) # We initialise L and R with random values L <- matrix(rnorm(nrow(data) * r), nrow(data), r) / sqrt(r) R <- matrix(rnorm(r * ncol(data)), r, ncol(data)) / sqrt(r) currLoss <- BEclear:::loss(L,R, 1, data)$loss
print(currLoss)

for(iter in 1:iters){

# Now we iterate over the feature dimmension of L
for(i in 1:dim(L)[[1]]){
# We determine the revealed entries for the feature
# And subset the data and R so to only retain the revealed entries
revealedEntries <- !is.na(data[i,])
y <- as.matrix(data[i, revealedEntries])
x <- R[,revealedEntries]
# We solve the linear equation for the feature
L[i,] <- as.vector(solve(x %*% t(x) + diag(lambda, r), x %*% y))
}

# We iterate over the sample dimmension of R
for(j in 1:dim(R)[[2]]){
# We determine the revealed entries for the sample
# And subset the data and L so to only retain the revealed entries
revealedEntries <- !is.na(data[,j])
y <- as.matrix(data[revealedEntries, j])
x <- L[revealedEntries,]
# We solve the linear equation for the sample
R[,j] <- as.vector(solve(t(x) %*% x + diag(lambda, r), t(x) %*% y))
}
currLoss <- BEclear:::loss(L,R, 1, data)$loss print(currLoss) } # L and R are multiplied to get the estimated values D <- L %*% R # Missing values are replaced with estimated value for (i in seq_len(nrow(data))) for (j in seq_len(ncol(data))) { if (is.na(data[i, j])) { data[i, j] <- D[i, j] } } return(data) }  A further advantage of this implementation is, that it is relatively easy to write a parallelised version of it. Maybe I will show you that in one of my next posts. After I overheard a conversation at the university that R is apparently bad for this, I feel almost challenged to do so. Now let’s take a look at the imputed values. We just take the sample data-set from before for this cause. result <- imputeALScorrect(cleared.data, iters = 10)  ## [1] 2571.072 ## [1] 109.301 ## [1] 99.38027 ## [1] 97.17519 ## [1] 95.42625 ## [1] 94.00547 ## [1] 92.83838 ## [1] 91.87368 ## [1] 91.07338 ## [1] 90.40794 ## [1] 89.85372  boxplot(result[is.na(cleared.data)])  Now that looks more like real data… Doesn’t it? But to be sure let’s compare it to the predicted values by the BEclear package. For the comparison we calculated the Root Mean Squared Error: library(Metrics) result.BEclear <- imputeMissingData(cleared.data)  ## INFO [2019-02-08 12:17:10] Starting the imputation of missing values. ## INFO [2019-02-08 12:17:10] This might take a while. ## INFO [2019-02-08 12:17:10] BEclear imputation is started: ## INFO [2019-02-08 12:17:10] block size: 60 x 60 ## INFO [2019-02-08 12:17:10] Impute missing data for block 1 of 4 ## INFO [2019-02-08 12:17:10] Impute missing data for block 2 of 4 ## INFO [2019-02-08 12:17:11] Impute missing data for block 3 of 4 ## INFO [2019-02-08 12:17:11] Impute missing data for block 4 of 4  rmse(result.BEclear[is.na(cleared.data)], result[is.na(cleared.data)])  ## [1] 0.03196931  Well the difference isn’t that big. But of course for assessing the accuracy of the method an elaborate evaluation would be needed. However for something I coded just for fun I’m satisfied with this first look. ## Addendum: Biases Just for fun let’s also add biases to our model, like described by Koren et al, to further improve our algorithm. The idea behind the bias is to capture the variability of the data that arises from the features or samples alone, while the two matrices L and R capture the variability that arises from the interaction of features and samples together. In other words by introducing the biases we unburden L and R a bit. We use a method, where the biases for each entry in the data-set are the sum of the overall average over all values and the average difference from this average of the corresponding column and row. And to save valuable computation time we just subtract this bias for each value from a copy of each value and use this transformed matrix for further calculations. Of course we have to add the bias later again. And here we go with the improved implementation: imputeALSBias<- function(data, lambda = 0.75, r = 5, iters = 10, use.biases=TRUE){ # We require the BEclear package, because we're using its loss function require(BEclear) # copy the data D <- data # We initialise L and R with random values L <- matrix(rnorm(nrow(data) * r), nrow(data), r) / sqrt(r) R <- matrix(rnorm(r * ncol(data)), r, ncol(data)) / sqrt(r) currLoss <- BEclear:::loss(L,R, 1, D)$loss
print(currLoss)

if(use.biases){
# we calculate the biases
biasData<-mean(data, na.rm = TRUE)
biasRows<-rowMeans(data - biasData, na.rm= TRUE)
biasCols<-colMeans(data - biasData, na.rm= TRUE)

# subtract the biases from the data
D <- D - biasData - biasRows
D <- t(t(D) - biasCols)
}

for(iter in 1:iters){

# Now we iterate over the feature dimmension of L
for(i in 1:dim(L)[[1]]){
# We determine the revealed entries for the feature
# And subset the data and R so to only retain the revealed entries
revealedEntries <- !is.na(D[i,])
y <- as.matrix(D[i, revealedEntries])
x <- R[,revealedEntries]
# We solve the linear equation for the feature
L[i,] <- as.vector(solve(x %*% t(x) + diag(lambda, r), x %*% y))
}

# We iterate over the sample dimmension of R
for(j in 1:dim(R)[[2]]){
# We determine the revealed entries for the sample
# And subset the data and L so to only retain the revealed entries
revealedEntries <- !is.na(D[,j])
y <- as.matrix(D[revealedEntries, j])
x <- L[revealedEntries,]
# We solve the linear equation for the sample
R[,j] <- as.vector(solve(t(x) %*% x + diag(lambda, r), t(x) %*% y))
}
currLoss <- BEclear:::loss(L,R, 1, D)\$loss

print(currLoss)
}

# L and R are multiplied to get the estimated values
D <- L %*% R

if(use.biases){
# we add the biases again
D <- t(t(D) + biasCols)
D <- D + biasData + biasRows
}

# Missing values are replaced with estimated value

for (i in seq_len(nrow(data)))
for (j in seq_len(ncol(data)))
{
if (is.na(data[i, j])) {
data[i, j] <- D[i, j]
}
}

return(data)
}


Testing this implementation, if you wish, is now your turn! 🙂
Maybe at some later point I will compare the performance and correctness of various different settings of this functions5. But for now that’s enough. Of course there are more sophisticated bias models we could think of. But one could even think of bias models like biases that are also determined by the Alternating Least Squares method during each iteration.
So we won’t run out of things to do any time soon.

## Conclusion

Yea, what’s the conclusion? I think it’s quite simple… Don’t be lazy, while coding!

OK, OK… I will say a bit more. I think what you can learn from the faulty example is that you should always think what your code is actually doing and take a look at the results to see, if something is fishy. Other than that I hope that you learned something and I could show you that some methods used in Machine Learning aren’t that complicated.

For now my implementation of ALS is still worse, when it comes to run time, in comparison to the Gradient Descent implemented in the BEclear package. But I also spend a lot of time optimizing the second. And maybe I will show you in a future blog how to optimize it. As this is my first blog post I would highly welcome feedback from you! 🙂

So have a nice day and until next time!

## Availability Of The Code

You can access a maintained version of the code of the correct version in my GitHub repository Raspository under R/imputeALS.R.