We all like R – a free statistical package with loads of supporting libraries, one of which is ggplot. I’m fairly new to ggplot but it does seem to produce *very* pretty visualisations. It has built-in the means to add a ribbon to a line plot – say if you want to show standard deviations on a series of points. Unfortunately this only works if you have a series of points, each of which has a matching pair of values specifying for far away the ribbon should be. What if you have 3 sets of values: lower, middle and upper, which don’t necessarily have matching numbers of data points? I had just this issue – here’s how I solved it, and made a visualisation like this:
While writing up a piece of work on multi-objective optimisation recently, I needed to present summary attainment curves graphically. The curves I had were generated from my data using Joshua Knowles’ tool at the link above (I’m calling these curves rather than surfaces because there are only two objectives). These are like Pareto fronts, but subtly different. Say you have done 30 repeat optimisation runs: the 30th attainment curve shows the area of the objective space reached by all 30 runs (so the worst-case performance for your algorithm). The 1st attainment curve shows the region reached by only 1 run (the best-case), and the 15th attainment curve shows the region reached by 15 runs (something like a median performance). I wanted to show these as ribbons on a plot: the median curve represented by a line and the min/max curves shown by a shaded area around the line. Most of the automated ways to do this assume a data frame with std deviations in it, so the boundaries of the ribbon have the same number of points as the line – but different attainment curves can have different numbers of points.
The solution was to draw two shaded polygons, bounded by the min/median curves and the median/max curves. To do this, we need to reverse the points in one of the curves, so we can go “down+right” with one curve, then back “up+left” with the other one. We then add the median curve as an ordinary line plot on top afterwards.
First up, here’s the full example code. If you want to try this yourself you’ll need the example data I’ve provided attcurve_data.
library(gglot2) # load data seeds<-read.table("seeds.txt", sep="\t") Twostage_att_1 <- read.table("Twostage_att_1.dat") Naive_att_1 <- read.table("Naive_att_1.dat") Twostage_att_15 <- read.table("Twostage_att_15.dat") Naive_att_15 <- read.table("Naive_att_15.dat") Twostage_att_30 <- read.table("Twostage_att_30.dat") Naive_att_30 <- read.table("Naive_att_30.dat") # sorting - see http://www.ats.ucla.edu/stat/r/faq/sort.htm sorted_seeds <- seeds[order(seeds$V1),] sorted_Twostage_att_1 <- Twostage_att_1[order(Twostage_att_1$V1,-Twostage_att_1$V2),] sorted_Naive_att_1 <- Naive_att_1[order(Naive_att_1$V1,-Naive_att_1$V2),] sorted_Twostage_att_15 <- Twostage_att_15[order(Twostage_att_15$V1,-Twostage_att_15$V2),] sorted_Naive_att_15 <- Naive_att_15[order(Naive_att_15$V1,-Naive_att_15$V2),] sorted_Twostage_att_30 <- Twostage_att_30[order(Twostage_att_30$V1,-Twostage_att_30$V2),] sorted_Naive_att_30 <- Naive_att_30[order(Naive_att_30$V1,-Naive_att_30$V2),] # combine coordinates for lower and upper polygons poly_df_lower_sorted_Twostage_att_all <- rbind(setNames(sorted_Twostage_att_15[,1:2],c('x','y')), setNames(sorted_Twostage_att_1[length(sorted_Twostage_att_1[,1]):1,1:2],c('x','y'))) poly_df_lower_sorted_Naive_att_all <- rbind(setNames(sorted_Naive_att_15[,1:2],c('x','y')), setNames(sorted_Naive_att_1[length(sorted_Naive_att_1[,1]):1,1:2],c('x','y'))) poly_df_upper_sorted_Twostage_att_all <- rbind(setNames(sorted_Twostage_att_15[,1:2],c('x','y')), setNames(sorted_Twostage_att_30[length(sorted_Twostage_att_30[,1]):1,1:2],c('x','y'))) poly_df_upper_sorted_Naive_att_all <- rbind(setNames(sorted_Naive_att_15[,1:2],c('x','y')), setNames(sorted_Naive_att_30[length(sorted_Naive_att_30[,1]):1,1:2],c('x','y'))) # line with shading on bounds and various line styles # dotted = moead # red = orange = seeded naive, blue = seeded 2stage ggplot() + geom_polygon(data = poly_df_lower_sorted_Twostage_att_all,aes(x = x, y = y),fill = "blue", alpha=0.3) + geom_polygon(data = poly_df_lower_sorted_Naive_att_all,aes(x = x, y = y),fill = "orange", alpha=0.3) + geom_polygon(data = poly_df_upper_sorted_Twostage_att_all,aes(x = x, y = y),fill = "blue", alpha=0.3) + geom_polygon(data = poly_df_upper_sorted_Naive_att_all,aes(x = x, y = y),fill = "orange", alpha=0.3) + geom_path(aes(sorted_Twostage_att_15$V1,sorted_Twostage_att_15$V2,colour="bms",linetype="bms")) + geom_path(aes(sorted_Naive_att_15$V1,sorted_Naive_att_15$V2,colour="nms",linetype="nms")) + geom_point(aes(sorted_seeds$V1,sorted_seeds$V2,shape="seeds")) + xlab("Energy (GWh)") + ylab("Cost (£million)") + theme(legend.position = c(0.7, 0.7), legend.box='horizontal') + scale_linetype_manual(name = 'Attainment Curves', values=c('bms'='dashed','nms'='dashed'), labels = c('bms'='2-stage MOEA/D Seeded','nms'='Naive MOEA/D Seeded')) + scale_colour_manual(name = 'Attainment Curves', values =c('bms'='blue','nms'='orange'), labels = c('bms'='2-stage MOEA/D Seeded','nms'='Naive MOEA/D Seeded')) + scale_shape_manual(name = 'Seeds', values=c('seeds'=16),labels=c('seeds'='Seeds group 1')) + scale_x_continuous(labels=function(x)x/1000000)+ scale_y_continuous(labels=function(y)y/1000000)
What’s going on? Well, the first section is easy, this is just loading the files into data frames.
Next up, we sort each of the data sets: as there might be some data points that are tied in one of the objectives we sort ascending by the first objective, then descending by the second, like this:
The rbind command is then used to join the pairs of top/median and median/bottom curves together into a series of points that we’ll use for the polygons. The second set of points added uses length(data[,1]):1 to show that we want those points in reverse order, like this:
Finally, we build up our ggplot layers, using geom_polygon() to draw the polygons, geom_path() to draw the lines for the median curves, and geom_point() to add a separate plot of points for comparison. Finally the axes and legends are configured.
Also a few things I learned mainly by trial and error:
- You’ll see I have two separate legends. This is maybe unnecessary for the example data but the plot I’m putting into my paper did need it. The legends are actually used by ggplot to specify colour, shading and line types as well – so one of the legends (for the attainment curves) is specified using two separate lines of code. Getting these right was a nightmare! You MUST:
- Specify the names of legends to be combined as one exactly the same (case and whitespace too)
- Specify the keys in the “values” and “labels” part of each legend definition perfectly: they must be identical to each other, and to those in the colour/linetype attributes further up, and the set of keys for the legends to be merged must be identical in each legend.
- Likewise if a polygon/line appears in multiple legends that are to be merged, the labels themselves must be identical (‘2-stage MOEA/D Seeded’ and ‘Naive MOEA/D Seeded’ in my case). In practice this means that the labels attribute for different legends to be merged will be completely identical.
- It doesn’t seem to be possible to add the shading colour for the ploygons to the legend without messing something else up. Maybe there’s a way but I’ve not found it yet.
- Use an alpha <1 so that overlapping polygons don’t obscure each other
to transform the scales
- Moving the legends around with the code below seems to move them both together. legend.box controls whether they are side-by-side or arranged vertically.
legend.position = c(0.7, 0.7), legend.box='horizontal'
Finally, some credit is due to these other pages which got me most of the way to solving this problem: