# Or the Devil is in the Preprocessing

It’s been a while, eh? So I guess it’s time for another contribution to the Raspository. I didn’t really find the time to write something sensible last month, as I had a lot of things to do. More specifically I had to finish the essay I talked about in my last post. Then I had to learn for my exams and do some stuff regarding my old apartment, but now I have time again.

The topic of my essay will also be the topic of this post… More or less. The method I had to write about and for which I’m preparing a presentation right now is called netICS(10.1093/bioinformatics/bty148). It is a graph based method for the integration of multi-omics data.

I won’t go into detail about the core method itself in this post, but this will be a topic for a coming post… I just need to think about how to present the code I already coded to you! Instead of that I will tell you about the non-trivial step of preprocessing for the integration of multi-omics data. This is probably the part you’ll be spending most of your time on, if you’re planing to do so.

## Necessary simplifications

If you’re integrating data from various sources, what you need to do is to implement some sensible simplification. This does not only hold for multi-omics data^{1}, but basically to every case where you combine different kind of sources.

This is kind of necessary as you don’t have unlimited resources. To elaborate more on this… With resources I mean everything from samples to computing power and verified^{2} interactions in your network.

I’m talking about networks, because networks are one way of making sense of those data-sets. In this manner you can treat your features as nodes in the network which can interact in certain ways with each other.

Of course nodes of different kinds interact in different ways with each other. That’s something you either have to think about unless this is the part where you simplify your model. And it is a reasonable simplification to treat all nodes in a network the same. That is more or less what was done with netICS.

In more detail, in the network all different products of a gene^{3} are represented by the same node.

This is reasonable as we may know a lot about interactions between certain kinds of gene products ^{4}, but we don’t know enough about all kinds of possible interactions. But you always have to keep in mind your simplifications.

## Combination of different data types

Then there is the question on how to combine different data types. Because in the case of the above mentioned simplification you have to combine the data from the different gene products into one.

One example used in netICS is the combination of p-values according to the following formula, which is called Fisher’s method(10.2307/2681650):

\( X = -2 \cdot (\sum^{k}_{i = 1}log(p_{i})) \)You can then apply a Chi-Square test with 2k degrees of freedom on the variable X to get a new combined p-value. Of course this combination method comes with its own limitations.

One of them being that you can only apply it, when the different p-values you’re combing are independent. Independence is often assumed, but there are also methods for dependent cases. It however gets more complicated then.

Depending on what kind of omic data you want to combine it might also be sensible to combine them into a indicator variable. For example this indicator variable could take on the value of 1 if either one or both of two features from different data-sets are present in a sample and take on 0 otherwise.

## Integration of Multi-Omics Data is hard work

There is however no recipe that you can always just apply… Each data-set is different. And so is a combination of different data-sets. That’s why Data Science needs industrious people, that don’t shy away from boggling their mind.

But this work will also hold its own rewards. Because if you do so you will be able to use plethora of sweet^{5} algorithms on your now preprocessed data-set. netICS e.g. uses an approach from page ranking. The thing our friend Google does. And I will talk more about it very soon.

For now I hope I could give you a little impression into the struggle of data integration. Of course there are more things to keep in mind like unclean data. But we will save the cleaning of data-sets for another time. 😛

So see you soon!