Category Archives: Data Analysis

Data analysis is one of the big topics of the century. While its big brother data science engages with larger chunks of data. Data analysis on the other hand itself also occupies itself with normal data-sets.

I will show you here, how to conduct data analytics. I especially enjoy the visualisation of data. In my opinion there is nothing more pleasing than a nice plot! Othwise I will show you how to clean data and do other operations on it. Most of the time I will use the R programming laguage for this purpose, as it is well suited for the task of data analysis. Sometimes I will also make a detour to machine learning and data mining.
Albeit this category is called data analysis, I might also cover topics from big data in the future. As I already worked with big data in form of RNA sequencing data I might be qualified to do so.

Although I’m originally from Bioinformatics, I also plan to take a look at other fields of apllication. The mangy statistics are nevetheless always the same.

If you want me to cover a specific topic, just write me. One of the topics I already covered on demand was seat allocation methods in election. I like to play the explainer for mathematical concepts. So try me!

D’Hondt Method

And Also Adams

In my last post I showed you the seat allocation method, called Sainte-Laguë/Schepers method. I recommend you reading it, before you continue with this post, as this post about the D’Hondt method heavily builds upon it.

Yepp, this vulture is cute!

There are actually two seat allocation methods, that are pretty similar to Sainte-Laguë/Schepers. One of them being the D’Hondt 1 method. The other one being the Adam’s method2.
Both are really remarkably similar to the method from my last post. With really the only difference being the way they round the seats. While Sainte-Laguë/Schepers uses standard round, D’Hondt uses the floor function. Meaning that it always round to the next lower integer. And Adams uses the ceiling function, which round to the next higher integer.

Now immediately you should scream: “STOP! What?! Adams uses the ceiling function?! So does this mean, that every party, that gets votes, gets at least one seat?!” The answer would be yesish. Yes, if your election doesn’t have an election threshold, every party, that gets votes, would at least get one seat.
“Well isn’t that incredibly unfavorable?” Yepp… But there are cases, where an allocation method like this could make sense. Not for regular election in my opinion, but for elections in parliaments. Let’s say, you have 20 mandates in a parliament, you want to distribute in a parliament with 300 elected politicians. Then the consideration, that it would be fair, if every party will get at least one mandate, could be made.

However the Adams method is incredibly uncommon. This Wikipedia article, which also served as my source for the method, only mentions the French parliament as example.

The D’Hondt method on the other side is pretty common. It’s actually the most common one in this year’s EU election.
And my source was also the corresponding German Wikipedia article.

Implementation D’Hondt Method

Luckily I don’t have to do much to implement those two methods. I just have to change a little bit about my function from last time.

seatAllocation <- function(votes, seats, roundMethod = round){
    ## calculate the initial divisor
    divisor <- roundMethod(sum(votes) / seats)

    ## get the initial seats per party
    seatsPerParty <- roundMethod(votes / divisor)

    ## if they already satisfy the seats to be assigned, return the seat allocation
    if(sum(seatsPerParty) == seats){
        return(seatsPerParty)
    }

    ## otherwise increment or decrement the divisor until 
    ## the result fits and then return it
    if(sum(seatsPerParty) < seats){
        while(sum(seatsPerParty) < seats){
            divisor = divisor - 1
            seatsPerParty <- roundMethod(votes / divisor)
        }
        return(seatsPerParty)
    }else{
         while(sum(seatsPerParty) > seats){
            divisor = divisor + 1
            seatsPerParty <- roundMethod(votes / divisor)
        }
        return(seatsPerParty = seatsPerParty)
    }

}

You see, what I did there? And why I love functional programming? Now by default, it’s the Sainte-Laguë/Schepers method and through giving the parameter roundMethod either the floor or ceiling function, we can make the D’Hondt and respectively Adams method out of it.
And we could even come up with some other rounding function in the future and use it.

The classic!

Test and Compare The Methods

And without further a due let’s test and compare the methods on our previous example.

votes <- c(AP = 11345, CVP = 563342, EP = 618713, OSP = 305952, PDP = 95001)
seatsSLS <- seatAllocation(votes, seats = 310, roundMethod = round)
seatsDH <- seatAllocation(votes, seats = 310, roundMethod = floor)
seatsA <- seatAllocation(votes, seats = 310, roundMethod = ceiling)

library(data.table)
DT <- rbind(data.table(party = names(seatsA), seats = seatsA, method = "Adams"), 
            data.table(party = names(seatsSLS), seats = seatsSLS, method = "Sainte-Laguë/Schepers"),
            data.table(party = names(seatsDH), seats = seatsDH, method = "D'Hondt"))

library(ggplot2)
g <- ggplot(DT, aes(x = party, y = seats, fill = method)) 
g <- g + geom_bar(stat = "identity", position = "dodge") 
g <- g + geom_text(aes(label=seats), position=position_dodge(width=0.9), vjust=-0.25)
g
Comparison of the different methods

Thanks, stackoverflow!
And you see… The actual difference isn’t big at all. The only thing one could say, is that Adams give a bonus to the small parties. D’Hondt method favors the big ones a bit. And Sainte-Laguë/Schepers is somehow in the middle.

And for me at least it’s really hard to say, which one is favorable. Sainte-Laguë/Schepers seems like a good compromise. However the differences more or less only affect small parties. But for them the difference is important. This doesn’t mean, that there’s no difference for large parties. On seat could mean the difference between majority and well… Not majority. Especially if you factor coalitions into the mix.
Maybe we will talk about possible problems in one of my next posts. I’m beginning to like this topic. I’m already thinking about becoming a lobbyist… lol.

Please follow and like us:
error0

Sainte-Laguë/Schepers method

For Allocation of Seats in the EU Parliament

On Monday I had a talk over Discord with Boris Biba, who himself runs a blog. We wanted to do a cooperation for some time. The focus of his blog are philosophy and politics. And as I told him, that I’m interested in crunching numbers, the comming EU elections are the perfect opportunity for a cooperation.
First we talked about doing something regarding the Wahl-O-Mat. Now in hindsight it was probably good that we decided for something else, as the Wahl-O-Mat was taken offline just today.
Then Boris brought up that he wanted to a post about the seat allocation method, which is called Sainte-Laguë/Schepers method, for German votes in the EU election. And I thought to myself, that this is wonderful, as voting is basically a paradigm for statistics. So I would be able to implement a small algorithm.

So be also sure to check out the post, which you can find here, from Boris, if you’re able to read German!

More otter selfies means more creative common selfies! Like this one.

What I’ll be doing in this post, is explain you the seat allocation method called Sainte-Laguë/Schepers and then give you a demonstrative example for it. And as an easteregg I throw in some election posters for the imaginary parties, I’ll use in the example. I created those posters with Adobe Spark.

As a main source for my post, I took the corresponding article from the German Wahl-Lexikon.

Description of the Method

So there are basically three variants of this method, which all deliver the same result.
Two of them work by ranking the voting result. The other one by simple division, which is the one used for the German part of the EU election. It is either called iterative or divisor method.

The simple idea behind this divisor method is to find a divisor for the voting result, which delivers you the right amount of total seats, if you divide the voting results by it and then round them by standard rounding.

To find the right divisor, first the total amount of votes is divided by the number of seats to be assigned.

\(
divisor = \frac{\#votesTotal}{\#seats}
\)

The for each party the number of votes is divided by this divisor.

\(
seatsParty_{i} = \frac{\#votesParty_{i}}{divisor}
\)

And if the sum of the seats of all parties matches up with the amount to be assigned, we’re already done!
If not, we have to either increment or decrement the divisor depending on, if we have to few or to many seats.

Just think about that… If you increase the divisor, the amount of seats shrinks. And vice versa if you decrease the divisor, the amount of seats increases.

And so the divisor is adjusted and the final seats per party are obtained.

Who wouldn’t be proud of such a beautiful plumage?

Implementation of the Sainte-Laguë/Schepers method

And of course it wouldn’t be me, if I wouldn’t also implement the method.
Here we go…

seatAllocation <- function(votes, seats){
    ## calculate the initial divisor
    divisor <- round(sum(votes) / seats)

    ## get the initial seats per party
    seatsPerParty <- round(votes / divisor)

    ## if they already satisfy the seats to be assigned, return the seat allocation
    if(sum(seatsPerParty) == seats){
        return(list(divisor = divisor, seatsPerParty = seatsPerParty))
    }

    ## otherwise increment or decrement the divisor until 
    ## the result fits and then return it
    if(sum(seatsPerParty) < seats){
        while(sum(seatsPerParty) < seats){
            divisor = divisor - 1
            seatsPerParty <- round(votes / divisor)
        }
        return(list(divisor = divisor, seatsPerParty = seatsPerParty))
    }else{
         while(sum(seatsPerParty) > seats){
            divisor = divisor + 1
            seatsPerParty <- round(votes / divisor)
        }
        return(list(divisor = divisor, seatsPerParty = seatsPerParty))
    }

}

The function is basically the same as what I described under the last point in plain text. As always, if you have some questions or remarks regarding my implementation feel free to write me a comment!

Poor agenda or marketing gag? Hmmmm….

Example with the Sainte-Laguë/Schepers method

Now to test the method, let’s just come up with some arbitrary voting result for our imaginary parties introduced earlier. And of course plot them as a pie chart!

votes <- c(AP = 11345, CVP = 563342, EP = 618713, OSP = 305952, PDP = 95001)
votesSorted <- sort(votes, decreasing = TRUE)
names(votesSorted) <- paste0(names(votesSorted), " (", 
                               format(votesSorted/sum(votesSorted) * 100, digits = 2), "%)")
pie(votesSorted)
Pie chart of the voting result

Subsequently, let’s test what result the method delivers and if the percentages match up approximately.

result <- seatAllocation(votes, 310)

OK, first let’s visualize the result. But let’s not use a pie chart again. Because to be honest they can be misleading. This time we will use a waffle chart, which displays the actual seats.
Of course we also need to do some preprocessing. We want the parties ordered after their size and we won’t their percentage of seats in the legend.

seatsPerParty <- result$seatsPerParty
seatsPerParty <- sort(seatsPerParty, decreasing = TRUE)
names(seatsPerParty) <- paste0(names(seatsPerParty), " (", 
                               format(seatsPerParty/sum(seatsPerParty) * 100, digits = 2), "%)")
waffle::waffle(seatsPerParty)
Waffle diagram for the seat allocation

Well, there’s some difference in the percentage, but that’s to be expected as you can’t distribute fractions of seats between the parties.

Outlook

Of course there are many other methods for allocating seats in an election. Some that are equivalent to this one and others that are not. And if you’re interesting in them, I would encourage you to write me.
If you like, we can look at a bunch of them an then compare them. And we could also take a look at things like overhang seat or different kinds of voting. I think it’s a nice topic for making plots.

By the way if you also wanna read this post in German, check the following link out!

Please follow and like us:
error0

Map Plots About the Global Burden of Disease

A practical example

Like promised in another post I will show you how to do a map plots with R. For this purpose I will use the ggmap package, which makes this a relatively easy task.
But before I begin with the actual code, let me give you a short motivation

Why to use map plots

Motivations for using map plots can be various. For example if you’re a journalist and let’s say you want to visualize a kind of events (like say earthquakes) in a regional context, this is a very demonstrative way of visualizing your data.
Or if you want to present some kind of data about places or countries map plots are always a good option.

The first time I did a map plot was actually part of an awesome lecture I had back in Munich at the TUM. Afterwards I got the chance to use this skill right away in the next semester for my Bachelor’s thesis.
As some of you might know the area, where I applied the algorithm which I improved and implemented for my thesis, was mental disorders. During the writing process of my thesis, I found it a good starting point in my thesis and the accompanying presentation to emphasize the prevalence of mental disorders in the world. In order to do so I used a map plot.
That’s basically also what I will do now, but this time with the case of cancer.

But on a side note I’m not saying that you should do those kinds of plots for each thesis or presentation regarding a disease topic. It’s just one possible starting point for it and not necessarily the best.
So please don’t just mindlessly copy, what I’m doing here. 🙂

Getting the data for dissease related map plots

First let’s load all the packages we will need for this little exercise.

library(XLConnect)
library(data.table)
library(ggplot2)
library(ggthemes)
library(maps)

XLConnect is a package for loading excel sheet, which we will need.
That I like to use data.table you probably already noticed. It’s just super fast and comfy for some procedures and it has some nice synergies with ggplot2.
The maps package contains as the name suggests map data, which can be used to plot. Alternatively one could also use ggmap.
And ggthemes contains a neat theme for maps, which I will use.

First let’s load our world map. This data.table contains region names and the boundaries of those regions as longitudes and latitudes. ggplot can plot those as polygons.

mapdata <- data.table(map_data("world"))
knitr::kable(mapdata[1:5])
longlatgrouporderregionsubregion
-69.8991212.4520011ArubaNA
-69.8957112.4230012ArubaNA
-69.9421912.4385313ArubaNA
-70.0041512.5004914ArubaNA
-70.0661212.5469715ArubaNA

OK, done. Now we need to download the data on 2004’s mortality from the WHO.

download.file("www.who.int/entity/healthinfo/global_burden_disease/gbddeathdalycountryestimates2004.xls", "gbd.xls")
tmp <- readWorksheetFromFile(file = "gbd.xls", sheet = "Deaths 2004")

causes <- tmp$Col1[14:143]

countries <- unname(tmp[6,7:198])
deathRates <- tmp[14:143,7:198]

You should probably take a look at the Excel file yourself to understand it and what I’m doing later. The file is made for humans to look at it and not directly for machines to read it. Which is why we have to do some cleaning and transforming. In my experience as a Bioinformatics students this is something you have to do almost always. Even if you have a machine readable format, there’s no perfect data-set. You will always have some missing data or have to transform your data in some way.
And this isn’t necessarily a trivial step. Often you will spend a lot of time here. And that’s good. If cleaning data was trivial, then we wouldn’t need data scientist.

Cleaning data

To begin with we have to transform the death rates to numeric values… Because they’re characters (strings) right now. For this purpose we have to also replace the separating comma at the thousand position. You see? What’s done to make the data more human readable, makes it less machine readable. That’s often the case.
Then we set the column names to the countries and transform the matrix together with the vector of causes to a data.table.

deathRatesNum <- matrix(as.numeric(gsub(",", "", as.matrix(deathRates))), nrow = dim(deathRates)[1])
## Warning in matrix(as.numeric(gsub(",", "", as.matrix(deathRates))), nrow =
## dim(deathRates)[1]): NAs introduced by coercion
colnames(deathRatesNum) <- countries
DT <- data.table(causes = causes, deathRatesNum)

Now we want a clean or also called long data-set. In this new data set we will have only three columns. Two variables (causes and region), which uniquely identify the value death rate.
Similar to a database we can also set those variable columns as keys, which makes it very fast searchable.

DTclean <- melt(DT, id.vars = "causes",  variable.name = "region", value.name = "deathRate")
setkey(DTclean, causes, region)

Next let us see, if we have some regions in our data.table that aren’t in our map.

DTclean[!region %in% mapdata$region, unique(region)]
##  [1] Antigua and Barbuda                      
##  [2] Brunei Darussalam                        
##  [3] Congo                                    
##  [4] Côte d'Ivoire                            
##  [5] Democratic People's Republic of Korea    
##  [6] Iran (Islamic Republic of)               
##  [7] Lao People's Democratic Republic         
##  [8] Libyan Arab Jamahiriya                   
##  [9] Micronesia (Federated States of)         
## [10] Republic of Korea                        
## [11] Republic of Moldova                      
## [12] Russian Federation                       
## [13] Saint Kitts and Nevis                    
## [14] Saint Vincent and the Grenadines         
## [15] Serbia and Montenegro                    
## [16] Syrian Arab Republic                     
## [17] The former Yugoslav Republic of Macedonia
## [18] Trinidad and Tobago                      
## [19] Tuvalu                                   
## [20] United Kingdom                           
## [21] United Republic of Tanzania              
## [22] United States of America                 
## [23] Venezuela (Bolivarian Republic of)       
## [24] Viet Nam                                 
## 192 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe

As expected, there are 24 regions from the WHO sheet not in the mapdata. Even though there’s probably a more elegant solution, I will change them manually. It’s a work that has to be done once. For this purpose it’s probably only necessary to fill it in for the big countries. So this is bearable.

DTclean[region == "Brunei Darussalam", region := "Brunei"]
DTclean[region == "Congo", region := "Republic of Congo"]
DTclean[region == "Democratic People's Republic of Korea", region := "North Korea"]
DTclean[region == "Iran (Islamic Republic of)", region := "Iran"]
DTclean[region == "Côte d'Ivoire", region := "Ivory Coast"]
DTclean[region == "Lao People's Democratic Republic", region := "Laos"]
DTclean[region == "Libyan Arab Jamahiriya", region := "Libya"]
DTclean[region == "The former Yugoslav Republic of Macedonia", region := "Macedonia"]
DTclean[region == "Micronesia (Federated States of)", region := "Micronesia"]
DTclean[region == "Republic of Moldova", region := "Moldova"]
DTclean[region == "Republic of Korea", region := "South Korea"]
DTclean[region == "Russian Federation", region := "Russia"]
DTclean[region == "Serbia and Montenegro", region := "Serbia"]
DTclean[region == "Syrian Arab Republic", region := "Syria"]
DTclean[region == "United Republic of Tanzania", region := "Tanzania"]
DTclean[region == "United Kingdom", region := "UK"]
DTclean[region == "United States of America", region := "USA"]
DTclean[region == "Venezuela (Bolivarian Republic of)", region := "Venezuela"]
DTclean[region == "Viet Nam", region := "Vietnam"]

And yea of course the work isn’t done completely yet. We also should check if there are regions in the mapdata, that aren’t in the WHO data-set. This could be due to various reasons… One being, that a region isn’t a member of the WHO and therefore the WHO doesn’t publish data on them.
Or more likely that a country from the WHO data-set span more than one region on the map, Serbia and Montenegro being such a case.
However I’m lazy now and I won’t do this today. How about you doing it and writing me a comment? 😛 Let it be a team1 effort.

Making the map plots

OK, before we do the actual plotting let’s first calculate for how much percentage of all deaths in each country cancer is the cause. In detail I do this by joining the data.table with itself.
On a side note: W000 is the WHO code for all death causes combined and W060 for Malignant neoplasms, which is a more formal name for cancer.

Then we need to join the data.table with the map on the region name.

DTcaused <- DTclean[causes == "W000"][DTclean[causes == "W060"], on = "region"][, .(region, percentageCaused = i.deathRate / deathRate)]

deathrateMap <- mapdata[DTcaused, on = "region", allow.cartesian=TRUE, nomatch = 0]

And finally we can do our plot. For this purpose we first plot all regions in grey and as overlay we fill the countries, that we have data on, with a color between grey and red depending on how frequent cancer as a death cause is.

g <- ggplot() + geom_polygon(data = mapdata,  aes(long, lat, group = group), fill = "grey")  
g <- g +  geom_polygon(data = deathrateMap,  aes(long, lat, group = group, fill = percentageCaused)) 
g <- g + scale_fill_gradient(low = "grey", high = "red", aesthetics = "fill", name = "Percentage of\ndeaths caused\nby cancer") 
g + ggthemes::theme_map()
World map showing on a scale from grey to red the percentage of deaths cancer is responsible for in different countries. The USA, Canada, Australia and Europe according to it have the highest death rates due to cancer, with up to 30 percentage.
World map showing on a scale from grey to red the percentage of deaths cancer is responsible for in different countries

And of course there’s one thing about this plot that could be misleading. Given that regions with missing data and very low prevalence of cancer deaths will both be grey, you hopefully see the potential problem here?
It’s not necessarily wrong or bad to do so. But I hope you recognize how someone could make a plot this way to mislead his audience. That’s why I recommend when it comes to looking at plots not only to think about, what is shown, but also what isn’t shown. Since no large data-set is complete… So ask the person who presents it to you, how she/he handled missing data points.

So what does this map actually say? From my perspective I don’t think anything surprising. At the moment, this data set captured, cancer was (and probably still is) mostly a problem of industrialized countries and it doesn’t seem to be connected to geography primarily (Can you see how Israel, Japan and South Korea pop up?).
Although the difference between the USA and Canada could be something interesting.
But this map, in my opinion, shows very clearly that cancer is one of the leading causes of death in the developed world, which also is the reason, why we also spend so much money on researching it.

However the main purpose of this post was to show you, how to make such plots and not discuss the reasons of different causes of mortality.
Ultimately I hope that this post has helped you.

Cite/mention your sources

Of course it is important that you mention your sources (cite them if you write a paper). This is because your approach has to be reproducible and you have to give those people, who did the preliminary work, credit for it.

In R you can get the proper citations for the packages you used the following way:

citation("ggmap")
## 
## To cite ggmap in publications, please use:
## 
##   D. Kahle and H. Wickham. ggmap: Spatial Visualization with
##   ggplot2. The R Journal, 5(1), 144-161. URL
##   http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     author = {David Kahle and Hadley Wickham},
##     title = {ggmap: Spatial Visualization with ggplot2},
##     journal = {The R Journal},
##     year = {2013},
##     volume = {5},
##     number = {1},
##     pages = {144--161},
##     url = {https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf},
##   }
citation("maps")
## 
## To cite package 'maps' in publications use:
## 
##   Original S code by Richard A. Becker, Allan R. Wilks. R version
##   by Ray Brownrigg. Enhancements by Thomas P Minka and Alex
##   Deckmyn. (2018). maps: Draw Geographical Maps. R package version
##   3.3.0. https://CRAN.R-project.org/package=maps
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {maps: Draw Geographical Maps},
##     author = {Original S code by Richard A. Becker and Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn.},
##     year = {2018},
##     note = {R package version 3.3.0},
##     url = {https://CRAN.R-project.org/package=maps},
##   }
## 
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.

You get the idea. Also cite the other packages, if you use them in your publication or thesis.
The output is in bibtex format. So I hope you know what to do with it. 😛

Of course the data on the global burden of disease you have to cite as well. Thus I’ll give you the formatted citation for it:

WHO. (2004). The global burden of disease: 2004 update: causes of death. 2004 Update, 8–26.

And last, but not least, please also mention me. This however is not a necessity, but a sign of respect towards my work. By all means respect is an important thing, unfortunately not often enough given in our society.

Please follow and like us:
error0