# How To Implement Alternating Least Squares In R And How Not To Do It

So it’s time for my first actual content. And like predicted in my Hello World post, it will be something implemented in the R programming language. More precisely it’s a Machine learning algorithm called Alternating Least Squares. But first, before we indulge ourselves in code, let me tell you why this algorithm is of interest for me and what it does.

## Introduction

I’ve been working now for a few months as part of my research assistant job on the Bioconductor package *BEclear* (10.1371/journal.pone.0159921). I won’t go into detail about the package, you only need to know that it uses something called a Latent Factor Model to impute^{1} missing data in data-sets.

Let’s say you have a matrix D containing missing values. The rows in the matrix stand for features and the columns for samples. Then you could assume that the matrix \(D_ {ij}\) is modeled by both features and sample specific effects in the following way:

\(D_ {ij} = L_ {i}^ {T} \times R_ {j}\)Where \(L_i\)is the feature specific matrix and \(R_j\) the sample specific matrix. For the imputation of missing values you now try to estimate those two matrices, the latent factors, from the existing values. Methods based on this assumption are already applied in variety of fields. Like with batch effects in DNA Methylation in the case of the *BEclear* package or in recommender systems for e.g. Netflix like described in a paper(10.1109/MC.2009.263), which helped me a lot in understanding this topic.

To estimate the latent factors there are different methods. One of them, implemented in the *BEclear* package is a gradient descent. Another method for it is Alternating Least Squares (ALS), which I wanted to implement on my own.^{2}

The lecture notes of Hastie et al(1410.2596) served as my source for implementing this method. I highly recommend you reading both the paper from Koren et al and those lecture notes, if you want to know more about the theoretical background and also the applications of those methods. You just need to know some Linear Algebra, then they should be easy enough to understand in my opinion.

But as a short summary… ALS tries to estimate the feature and sample matrix by alternating fixing one of them and then calculating the other one by solving the the system of equations and you do this in general until convergence. In Gradient Descent on the other hand both matrices are estimated at the same time.

## How Not To Implement Alternating Least Squares

As a start I will show you my first faulty try in implementing the Alternating Least Squares. Maybe you will learn something by me sharing it with you. And you should try to guess what my mistake was.

As my implementation reuses a function from *BEclear*^{3}^{4}, you have to install this package first. For this purpose I guess it’s easiest if you just install it from GitHub via the following lines of code:

```
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("David-J-R/BEclear")
```

And now let’s come to the first implementation of the Alternating Least Squares algorithm. See if you can find its problem. I tried to comment all those steps so that the code should be comprehensible. But if not, please let me know! 🙂

```
imputeALSfaulty<- function(data, lambda = 0.75, r = 10, iters = 20){
# We require the BEclear package, because we're using its loss function
require(BEclear)
D <- data
D[is.na(D)] <- 0
# We initialise L and R with random values
L <- matrix(rnorm(nrow(data) * r), nrow(data), r) / sqrt(r)
R <- matrix(rnorm(r * ncol(data)), r, ncol(data)) / sqrt(r)
currLoss <- BEclear:::loss(L,R, 1, data)$loss
print(currLoss)
for(i in 1:iters){
# L and R are determined by solving the following system of the equations
# We repeat this step iters-times
L <- t(solve(R %*% t(R) + diag(lambda,r), R %*% t(D)))
R <- solve(t(L) %*% L + diag(lambda,r), t(L) %*% D)
currLoss <- BEclear:::loss(L,R, 1, data)$loss
print(currLoss)
}
# L and R are multiplied to get the estimated values
D <- L %*% R
# Missing values are replaced with estimated value
for (i in seq_len(nrow(data)))
for (j in seq_len(ncol(data)))
{
if (is.na(data[i, j])) {
data[i, j] <- D[i, j]
}
}
return(data)
}
```

Now let me show you the problem with this implementation, if you haven’t recognized it on your own by now.

First we will load the example data and functions from the *BEclear* package to generate ourselves a sample data-set with missing values:

```
library(BEclear)
data("BEclearData")
batchEffect <- calcBatchEffects(
data = ex.data, samples = ex.samples,
adjusted = TRUE, method = "fdr")
mdifs <- batchEffect$med
pvals <- batchEffect$pval
summary <-calcSummary(mdifs, pvals)
cleared.data <- clearBEgenes(ex.data, ex.samples, summary)
```

And then we run the faulty version of ALS on it. The printed output of the function is the loss of the current solution during each iteration.

```
result <- imputeALSfaulty(cleared.data, iters = 10)
```

## [1] 2586.68 ## [1] 101.8086 ## [1] 95.60281 ## [1] 95.29458 ## [1] 95.21404 ## [1] 95.20139 ## [1] 95.20632 ## [1] 95.21329 ## [1] 95.21869 ## [1] 95.22233 ## [1] 95.2247

If we now take a look at the imputed values, you can see what’s wrong:

```
boxplot(result[is.na(cleared.data)])
```

They’re all pretty close to zero. That’s because we set the missing values to zero. This way the `solve`

method “tries” to generate R and L the way that the missing values are also very close to zero.

Of course we don’t want that… This way we could just set the missing values right away to zero.

## How To Implement Alternating Least Squares

Finally let me show you an implementation that actually does, what it should do. And again if something is unclear, don’t hesitate to ask me!

```
imputeALScorrect<- function(data, lambda = 0.75, r = 10, iters = 80){
# We require the BEclear package, because we're using its loss function
require(BEclear)
# We initialise L and R with random values
L <- matrix(rnorm(nrow(data) * r), nrow(data), r) / sqrt(r)
R <- matrix(rnorm(r * ncol(data)), r, ncol(data)) / sqrt(r)
currLoss <- BEclear:::loss(L,R, 1, data)$loss
print(currLoss)
for(iter in 1:iters){
# Now we iterate over the feature dimmension of L
for(i in 1:dim(L)[[1]]){
# We determine the revealed entries for the feature
# And subset the data and R so to only retain the revealed entries
revealedEntries <- !is.na(data[i,])
y <- as.matrix(data[i, revealedEntries])
x <- R[,revealedEntries]
# We solve the linear equation for the feature
L[i,] <- as.vector(solve(x %*% t(x) + diag(lambda, r), x %*% y))
}
# We iterate over the sample dimmension of R
for(j in 1:dim(R)[[2]]){
# We determine the revealed entries for the sample
# And subset the data and L so to only retain the revealed entries
revealedEntries <- !is.na(data[,j])
y <- as.matrix(data[revealedEntries, j])
x <- L[revealedEntries,]
# We solve the linear equation for the sample
R[,j] <- as.vector(solve(t(x) %*% x + diag(lambda, r), t(x) %*% y))
}
currLoss <- BEclear:::loss(L,R, 1, data)$loss
print(currLoss)
}
# L and R are multiplied to get the estimated values
D <- L %*% R
# Missing values are replaced with estimated value
for (i in seq_len(nrow(data)))
for (j in seq_len(ncol(data)))
{
if (is.na(data[i, j])) {
data[i, j] <- D[i, j]
}
}
return(data)
}
```

A further advantage of this implementation is, that it is relatively easy to write a parallelised version of it. Maybe I will show you that in one of my next posts. After I overheard a conversation at the university that R is apparently bad for this, I feel almost challenged to do so.

Now let’s take a look at the imputed values. We just take the sample data-set from before for this cause.

```
result <- imputeALScorrect(cleared.data, iters = 10)
```

## [1] 2571.072 ## [1] 109.301 ## [1] 99.38027 ## [1] 97.17519 ## [1] 95.42625 ## [1] 94.00547 ## [1] 92.83838 ## [1] 91.87368 ## [1] 91.07338 ## [1] 90.40794 ## [1] 89.85372

```
boxplot(result[is.na(cleared.data)])
```

Now that looks more like real data… Doesn’t it? But to be sure let’s compare it to the predicted values by the *BEclear* package. For the comparison we calculated the Root Mean Squared Error:

```
library(Metrics)
result.BEclear <- imputeMissingData(cleared.data)
```

## INFO [2019-02-08 12:17:10] Starting the imputation of missing values. ## INFO [2019-02-08 12:17:10] This might take a while. ## INFO [2019-02-08 12:17:10] BEclear imputation is started: ## INFO [2019-02-08 12:17:10] block size: 60 x 60 ## INFO [2019-02-08 12:17:10] Impute missing data for block 1 of 4 ## INFO [2019-02-08 12:17:10] Impute missing data for block 2 of 4 ## INFO [2019-02-08 12:17:11] Impute missing data for block 3 of 4 ## INFO [2019-02-08 12:17:11] Impute missing data for block 4 of 4

```
rmse(result.BEclear[is.na(cleared.data)], result[is.na(cleared.data)])
```

## [1] 0.03196931

Well the difference isn’t that big. But of course for assessing the accuracy of the method an elaborate evaluation would be needed. However for something I coded just for fun I’m satisfied with this first look.

## Addendum: Biases

Just for fun let’s also add biases to our model, like described by Koren et al, to further improve our algorithm.

The idea behind the bias is to capture the variability of the data that arises from the features or samples alone, while the two matrices L and R capture the variability that arises from the interaction of features and samples together.

In other words by introducing the biases we unburden L and R a bit.

We use a method, where the biases for each entry in the data-set are the sum of the overall average over all values and the average difference from this average of the corresponding column and row. And to save valuable computation time we just subtract this bias for each value from a copy of each value and use this transformed matrix for further calculations. Of course we have to add the bias later again.

And here we go with the improved implementation:

```
imputeALSBias<- function(data, lambda = 0.75, r = 5, iters = 10, use.biases=TRUE){
# We require the BEclear package, because we're using its loss function
require(BEclear)
# copy the data
D <- data
# We initialise L and R with random values
L <- matrix(rnorm(nrow(data) * r), nrow(data), r) / sqrt(r)
R <- matrix(rnorm(r * ncol(data)), r, ncol(data)) / sqrt(r)
currLoss <- BEclear:::loss(L,R, 1, D)$loss
print(currLoss)
if(use.biases){
# we calculate the biases
biasData<-mean(data, na.rm = TRUE)
biasRows<-rowMeans(data - biasData, na.rm= TRUE)
biasCols<-colMeans(data - biasData, na.rm= TRUE)
# subtract the biases from the data
D <- D - biasData - biasRows
D <- t(t(D) - biasCols)
}
for(iter in 1:iters){
# Now we iterate over the feature dimmension of L
for(i in 1:dim(L)[[1]]){
# We determine the revealed entries for the feature
# And subset the data and R so to only retain the revealed entries
revealedEntries <- !is.na(D[i,])
y <- as.matrix(D[i, revealedEntries])
x <- R[,revealedEntries]
# We solve the linear equation for the feature
L[i,] <- as.vector(solve(x %*% t(x) + diag(lambda, r), x %*% y))
}
# We iterate over the sample dimmension of R
for(j in 1:dim(R)[[2]]){
# We determine the revealed entries for the sample
# And subset the data and L so to only retain the revealed entries
revealedEntries <- !is.na(D[,j])
y <- as.matrix(D[revealedEntries, j])
x <- L[revealedEntries,]
# We solve the linear equation for the sample
R[,j] <- as.vector(solve(t(x) %*% x + diag(lambda, r), t(x) %*% y))
}
currLoss <- BEclear:::loss(L,R, 1, D)$loss
print(currLoss)
}
# L and R are multiplied to get the estimated values
D <- L %*% R
if(use.biases){
# we add the biases again
D <- t(t(D) + biasCols)
D <- D + biasData + biasRows
}
# Missing values are replaced with estimated value
for (i in seq_len(nrow(data)))
for (j in seq_len(ncol(data)))
{
if (is.na(data[i, j])) {
data[i, j] <- D[i, j]
}
}
return(data)
}
```

Testing this implementation, if you wish, is now your turn! 🙂

Maybe at some later point I will compare the performance and correctness of various different settings of this functions^{5}. But for now that’s enough. Of course there are more sophisticated bias models we could think of. But one could even think of bias models like biases that are also determined by the Alternating Least Squares method during each iteration.

So we won’t run out of things to do any time soon.

## Conclusion

Yea, what’s the conclusion? I think it’s quite simple… Don’t be lazy, while coding!

OK, OK… I will say a bit more. I think what you can learn from the faulty example is that you should always think what your code is actually doing and take a look at the results to see, if something is fishy. Other than that I hope that you learned something and I could show you that some methods used in Machine Learning aren’t that complicated.

For now my implementation of ALS is still worse, when it comes to run time, in comparison to the Gradient Descent implemented in the *BEclear* package. But I also spend a lot of time optimizing the second. And maybe I will show you in a future blog how to optimize it. As this is my first blog post I would highly welcome feedback from you! 🙂

So have a nice day and until next time!

## Availability Of The Code

You can access a maintained version of the code of the correct version in my GitHub repository *Raspository* under R/imputeALS.R.