# A practical example

Like promised in another post I will show you how to do a map plots with R. For this purpose I will use the ggmap package, which makes this a relatively easy task.
But before I begin with the actual code, let me give you a short motivation

## Why to use map plots

Motivations for using map plots can be various. For example if you’re a journalist and let’s say you want to visualize a kind of events (like say earthquakes) in a regional context, this is a very demonstrative way of visualizing your data.
Or if you want to present some kind of data about places or countries map plots are always a good option.

The first time I did a map plot was actually part of an awesome lecture I had back in Munich at the TUM. Afterwards I got the chance to use this skill right away in the next semester for my Bachelor’s thesis.
As some of you might know the area, where I applied the algorithm which I improved and implemented for my thesis, was mental disorders. During the writing process of my thesis, I found it a good starting point in my thesis and the accompanying presentation to emphasize the prevalence of mental disorders in the world. In order to do so I used a map plot.
That’s basically also what I will do now, but this time with the case of cancer.

But on a side note I’m not saying that you should do those kinds of plots for each thesis or presentation regarding a disease topic. It’s just one possible starting point for it and not necessarily the best.
So please don’t just mindlessly copy, what I’m doing here. 🙂

## Getting the data for dissease related map plots

First let’s load all the packages we will need for this little exercise.

library(XLConnect)
library(data.table)
library(ggplot2)
library(ggthemes)
library(maps)


XLConnect is a package for loading excel sheet, which we will need.
That I like to use data.table you probably already noticed. It’s just super fast and comfy for some procedures and it has some nice synergies with ggplot2.
The maps package contains as the name suggests map data, which can be used to plot. Alternatively one could also use ggmap.
And ggthemes contains a neat theme for maps, which I will use.

First let’s load our world map. This data.table contains region names and the boundaries of those regions as longitudes and latitudes. ggplot can plot those as polygons.

mapdata <- data.table(map_data("world"))
knitr::kable(mapdata[1:5])

longlatgrouporderregionsubregion
-69.8991212.4520011ArubaNA
-69.8957112.4230012ArubaNA
-69.9421912.4385313ArubaNA
-70.0041512.5004914ArubaNA
-70.0661212.5469715ArubaNA

OK, done. Now we need to download the data on 2004’s mortality from the WHO.

download.file("www.who.int/entity/healthinfo/global_burden_disease/gbddeathdalycountryestimates2004.xls", "gbd.xls")
tmp <- readWorksheetFromFile(file = "gbd.xls", sheet = "Deaths 2004")

causes <- tmp$Col1[14:143] countries <- unname(tmp[6,7:198]) deathRates <- tmp[14:143,7:198]  You should probably take a look at the Excel file yourself to understand it and what I’m doing later. The file is made for humans to look at it and not directly for machines to read it. Which is why we have to do some cleaning and transforming. In my experience as a Bioinformatics students this is something you have to do almost always. Even if you have a machine readable format, there’s no perfect data-set. You will always have some missing data or have to transform your data in some way. And this isn’t necessarily a trivial step. Often you will spend a lot of time here. And that’s good. If cleaning data was trivial, then we wouldn’t need data scientist. ## Cleaning data To begin with we have to transform the death rates to numeric values… Because they’re characters (strings) right now. For this purpose we have to also replace the separating comma at the thousand position. You see? What’s done to make the data more human readable, makes it less machine readable. That’s often the case. Then we set the column names to the countries and transform the matrix together with the vector of causes to a data.table. deathRatesNum <- matrix(as.numeric(gsub(",", "", as.matrix(deathRates))), nrow = dim(deathRates)[1])  ## Warning in matrix(as.numeric(gsub(",", "", as.matrix(deathRates))), nrow = ## dim(deathRates)[1]): NAs introduced by coercion  colnames(deathRatesNum) <- countries DT <- data.table(causes = causes, deathRatesNum)  Now we want a clean or also called long data-set. In this new data set we will have only three columns. Two variables (causes and region), which uniquely identify the value death rate. Similar to a database we can also set those variable columns as keys, which makes it very fast searchable. DTclean <- melt(DT, id.vars = "causes", variable.name = "region", value.name = "deathRate") setkey(DTclean, causes, region)  Next let us see, if we have some regions in our data.table that aren’t in our map. DTclean[!region %in% mapdata$region, unique(region)]

##  [1] Antigua and Barbuda
##  [2] Brunei Darussalam
##  [3] Congo
##  [4] Côte d'Ivoire
##  [5] Democratic People's Republic of Korea
##  [6] Iran (Islamic Republic of)
##  [7] Lao People's Democratic Republic
##  [8] Libyan Arab Jamahiriya
##  [9] Micronesia (Federated States of)
## [10] Republic of Korea
## [11] Republic of Moldova
## [12] Russian Federation
## [13] Saint Kitts and Nevis
## [14] Saint Vincent and the Grenadines
## [15] Serbia and Montenegro
## [16] Syrian Arab Republic
## [17] The former Yugoslav Republic of Macedonia
## [19] Tuvalu
## [20] United Kingdom
## [21] United Republic of Tanzania
## [22] United States of America
## [23] Venezuela (Bolivarian Republic of)
## [24] Viet Nam
## 192 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe


As expected, there are 24 regions from the WHO sheet not in the mapdata. Even though there’s probably a more elegant solution, I will change them manually. It’s a work that has to be done once. For this purpose it’s probably only necessary to fill it in for the big countries. So this is bearable.

DTclean[region == "Brunei Darussalam", region := "Brunei"]
DTclean[region == "Congo", region := "Republic of Congo"]
DTclean[region == "Democratic People's Republic of Korea", region := "North Korea"]
DTclean[region == "Iran (Islamic Republic of)", region := "Iran"]
DTclean[region == "Côte d'Ivoire", region := "Ivory Coast"]
DTclean[region == "Lao People's Democratic Republic", region := "Laos"]
DTclean[region == "Libyan Arab Jamahiriya", region := "Libya"]
DTclean[region == "The former Yugoslav Republic of Macedonia", region := "Macedonia"]
DTclean[region == "Micronesia (Federated States of)", region := "Micronesia"]
DTclean[region == "Republic of Moldova", region := "Moldova"]
DTclean[region == "Republic of Korea", region := "South Korea"]
DTclean[region == "Russian Federation", region := "Russia"]
DTclean[region == "Serbia and Montenegro", region := "Serbia"]
DTclean[region == "Syrian Arab Republic", region := "Syria"]
DTclean[region == "United Republic of Tanzania", region := "Tanzania"]
DTclean[region == "United Kingdom", region := "UK"]
DTclean[region == "United States of America", region := "USA"]
DTclean[region == "Venezuela (Bolivarian Republic of)", region := "Venezuela"]
DTclean[region == "Viet Nam", region := "Vietnam"]


And yea of course the work isn’t done completely yet. We also should check if there are regions in the mapdata, that aren’t in the WHO data-set. This could be due to various reasons… One being, that a region isn’t a member of the WHO and therefore the WHO doesn’t publish data on them.
Or more likely that a country from the WHO data-set span more than one region on the map, Serbia and Montenegro being such a case.
However I’m lazy now and I won’t do this today. How about you doing it and writing me a comment? 😛 Let it be a team1 effort.

## Making the map plots

OK, before we do the actual plotting let’s first calculate for how much percentage of all deaths in each country cancer is the cause. In detail I do this by joining the data.table with itself.
On a side note: W000 is the WHO code for all death causes combined and W060 for Malignant neoplasms, which is a more formal name for cancer.

Then we need to join the data.table with the map on the region name.

DTcaused <- DTclean[causes == "W000"][DTclean[causes == "W060"], on = "region"][, .(region, percentageCaused = i.deathRate / deathRate)]

deathrateMap <- mapdata[DTcaused, on = "region", allow.cartesian=TRUE, nomatch = 0]


And finally we can do our plot. For this purpose we first plot all regions in grey and as overlay we fill the countries, that we have data on, with a color between grey and red depending on how frequent cancer as a death cause is.

g <- ggplot() + geom_polygon(data = mapdata,  aes(long, lat, group = group), fill = "grey")
g <- g +  geom_polygon(data = deathrateMap,  aes(long, lat, group = group, fill = percentageCaused))
g <- g + scale_fill_gradient(low = "grey", high = "red", aesthetics = "fill", name = "Percentage of\ndeaths caused\nby cancer")
g + ggthemes::theme_map()


And of course there’s one thing about this plot that could be misleading. Given that regions with missing data and very low prevalence of cancer deaths will both be grey, you hopefully see the potential problem here?
It’s not necessarily wrong or bad to do so. But I hope you recognize how someone could make a plot this way to mislead his audience. That’s why I recommend when it comes to looking at plots not only to think about, what is shown, but also what isn’t shown. Since no large data-set is complete… So ask the person who presents it to you, how she/he handled missing data points.

So what does this map actually say? From my perspective I don’t think anything surprising. At the moment, this data set captured, cancer was (and probably still is) mostly a problem of industrialized countries and it doesn’t seem to be connected to geography primarily (Can you see how Israel, Japan and South Korea pop up?).
Although the difference between the USA and Canada could be something interesting.
But this map, in my opinion, shows very clearly that cancer is one of the leading causes of death in the developed world, which also is the reason, why we also spend so much money on researching it.

However the main purpose of this post was to show you, how to make such plots and not discuss the reasons of different causes of mortality.
Ultimately I hope that this post has helped you.

Of course it is important that you mention your sources (cite them if you write a paper). This is because your approach has to be reproducible and you have to give those people, who did the preliminary work, credit for it.

In R you can get the proper citations for the packages you used the following way:

citation("ggmap")

##
## To cite ggmap in publications, please use:
##
##   D. Kahle and H. Wickham. ggmap: Spatial Visualization with
##   ggplot2. The R Journal, 5(1), 144-161. URL
##   http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
##
## A BibTeX entry for LaTeX users is
##
##   @Article{,
##     author = {David Kahle and Hadley Wickham},
##     title = {ggmap: Spatial Visualization with ggplot2},
##     journal = {The R Journal},
##     year = {2013},
##     volume = {5},
##     number = {1},
##     pages = {144--161},
##     url = {https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf},
##   }

citation("maps")

##
## To cite package 'maps' in publications use:
##
##   Original S code by Richard A. Becker, Allan R. Wilks. R version
##   by Ray Brownrigg. Enhancements by Thomas P Minka and Alex
##   Deckmyn. (2018). maps: Draw Geographical Maps. R package version
##   3.3.0. https://CRAN.R-project.org/package=maps
##
## A BibTeX entry for LaTeX users is
##
##   @Manual{,
##     title = {maps: Draw Geographical Maps},
##     author = {Original S code by Richard A. Becker and Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn.},
##     year = {2018},
##     note = {R package version 3.3.0},
##     url = {https://CRAN.R-project.org/package=maps},
##   }
##
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.


You get the idea. Also cite the other packages, if you use them in your publication or thesis.
The output is in bibtex format. So I hope you know what to do with it. 😛

Of course the data on the global burden of disease you have to cite as well. Thus I’ll give you the formatted citation for it:

WHO. (2004). The global burden of disease: 2004 update: causes of death. 2004 Update, 8–26.

And last, but not least, please also mention me. This however is not a necessity, but a sign of respect towards my work. By all means respect is an important thing, unfortunately not often enough given in our society.