Sometimes you come across a plot that looks like the following:
And you might think:
Something does not look right but I have no idea what is going on here
And that’s OK.
So, what’s the problem with the plot, and how do you solve it?
Well, the problem is we have these “sawtooth” patterns in the data, where the data goes up and down.
Typically, we can solve this problem by including some grouping characteristic into the data visualisation.
It is also worth noting that this doesn’t always mean a plot is bad - this could actually be the exact type of plot that you might expect to see (for example in a time series with very high periodicity, perhaps).
But, in our case, we need to understand what our data is first, and what we expect. We are looking at ozbabynames - the names at birth of people in Australia. So we are plotting the number of names of a person at birth for each year. In our example we can look at the occurrences of the name, “Kim”, like so:
ggplot(oz_kim,
aes(x = year,
y = count)) +
geom_line()
We don’t expect the name “kim” to suddenly crash down each year - especially since this looks to be an exact vertical drop.
So what do we do?
This vis problem often means there is some grouping characteristic missing from the graphic. For example, in this case, “sex” is not shown in the data. In showing it, we get:
library(colorspace)
ggplot(oz_kim,
aes(x = year,
y = count,
colour = sex)) +
geom_line() +
scale_colour_discrete_qualitative()
So we see that there is still some sawtooth patterns going on. Let’s look at the data to see if there are other variables we are missing:
oz_kim
#> # A tibble: 164 x 5
#> name sex year count state
#> <chr> <chr> <int> <int> <chr>
#> 1 Kim Female 2017 2 South Australia
#> 2 Kim Female 2016 1 South Australia
#> 3 Kim Female 2015 1 South Australia
#> 4 Kim Female 2014 2 South Australia
#> 5 Kim Male 2014 1 South Australia
#> 6 Kim Female 2012 1 South Australia
#> 7 Kim Female 2011 3 South Australia
#> 8 Kim Female 2010 1 South Australia
#> 9 Kim Female 2009 1 South Australia
#> 10 Kim Female 2008 3 South Australia
#> # … with 154 more rows
Aha! We can see that there is another grouping characteristic going on - State. Let’s facet the graph for each state, giving us:
ggplot(oz_kim,
aes(x = year,
y = count,
colour = sex)) +
geom_line() +
facet_wrap(~state) +
scale_colour_discrete_qualitative()
Setting group correctly (Addition as of 2020/06/22)
Emma Vitz had an interesting example of another sawtooth type problem shared on twitter:
Hey #rstats, why do I get this weird single line that's made up of 2 colours rather than 2 separate lines when I use ggplot?
— Emma Vitz (@EmmaVitz) June 15, 2020
If I filter the data to one gender it works fine (2nd screenshot). If I remove the group = 1 it gives me an error and a blank plot. pic.twitter.com/cNZDrVv1pZ
The solution was discussed in the thread, but let’s unpack this. Let’s first recreate the data used (taken by eyeballing the graphic):
pageviews <- tibble(
age = factor(rep(c("18-24",
"25-34",
"35-44",
"45-54",
"55-64",
"65"), 2)),
gender = factor(x = c(rep("Female", 6),
rep("Male", 6))),
pageviews = c(2750, 4200, 1750, 750, 450, 500,
2500, 4200, 900, 350, 180, 150)
)
pageviews
#> # A tibble: 12 x 3
#> age gender pageviews
#> <fct> <fct> <dbl>
#> 1 18-24 Female 2750
#> 2 25-34 Female 4200
#> 3 35-44 Female 1750
#> 4 45-54 Female 750
#> 5 55-64 Female 450
#> 6 65 Female 500
#> 7 18-24 Male 2500
#> 8 25-34 Male 4200
#> 9 35-44 Male 900
#> 10 45-54 Male 350
#> 11 55-64 Male 180
#> 12 65 Male 150
So here is the warning given for the first of Emma’s plots:
ggplot(pageviews,
aes(x = age,
y = pageviews,
colour = gender)) +
geom_line()
#> geom_path: Each group consists of only one observation. Do you need to adjust
#> the group aesthetic?
What to do? One way to get the lines to appear is to set group = 1
ggplot(pageviews,
aes(x = age,
y = pageviews,
colour = gender,
group = 1)) +
geom_line()
But then we get this! That isn’t ideal. The solution proposed on twitter was to set group = gender
as well as colour = gender
.
ggplot(pageviews,
aes(x = age,
y = pageviews,
colour = gender,
group = gender)) +
geom_line()
The answer was provided by Peter Green, who said:
Looks like since x=age is a factor, ggplot is “helpfully” making age the group instead of gender? Which would explain why fixing it with the explicit group=gender works?
My take on this is that since there are two factors here, it causes ggplot
some confusion. The default behaviour of ggplot
when setting colour
is to use the same grouping, but in this case, as there are two factors, it doesn’t know what to pick. By setting the group
explicitly, you get the right plot.
Wrapping Up
So, how to remove sawtooth patterns in a plot?
- Understand what sort of graphic you are expecting
- Explore and potentially include all grouping features into the graphic
- Ensure that if you have factors in some of your aesthetics (
x
,y
,colour
,size
), that you specifygroup
to the right variable in your dataset.