Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2020-06-14

Just Quickly: Removing Sawtooth Patterns in Line Graphs

Nicholas Tierney

Categories: rstats data visualisation Tags: rstats

5 minute read

Sometimes you come across a plot that looks like the following:

And you might think:

Something does not look right but I have no idea what is going on here

And that’s OK.

So, what’s the problem with the plot, and how do you solve it?

Well, the problem is we have these “sawtooth” patterns in the data, where the data goes up and down.

Typically, we can solve this problem by including some grouping characteristic into the data visualisation.

It is also worth noting that this doesn’t always mean a plot is bad - this could actually be the exact type of plot that you might expect to see (for example in a time series with very high periodicity, perhaps).

But, in our case, we need to understand what our data is first, and what we expect. We are looking at ozbabynames - the names at birth of people in Australia. So we are plotting the number of names of a person at birth for each year. In our example we can look at the occurrences of the name, “Kim”, like so:

ggplot(oz_kim,
       aes(x = year,
           y = count)) + 
  geom_line()

We don’t expect the name “kim” to suddenly crash down each year - especially since this looks to be an exact vertical drop.

So what do we do?

This vis problem often means there is some grouping characteristic missing from the graphic. For example, in this case, “sex” is not shown in the data. In showing it, we get:

library(colorspace)
  ggplot(oz_kim,
         aes(x = year,
             y = count,
             colour = sex)) + 
  geom_line() +
  scale_colour_discrete_qualitative()

So we see that there is still some sawtooth patterns going on. Let’s look at the data to see if there are other variables we are missing:

oz_kim
#> # A tibble: 164 x 5
#>    name  sex     year count state          
#>    <chr> <chr>  <int> <int> <chr>          
#>  1 Kim   Female  2017     2 South Australia
#>  2 Kim   Female  2016     1 South Australia
#>  3 Kim   Female  2015     1 South Australia
#>  4 Kim   Female  2014     2 South Australia
#>  5 Kim   Male    2014     1 South Australia
#>  6 Kim   Female  2012     1 South Australia
#>  7 Kim   Female  2011     3 South Australia
#>  8 Kim   Female  2010     1 South Australia
#>  9 Kim   Female  2009     1 South Australia
#> 10 Kim   Female  2008     3 South Australia
#> # … with 154 more rows

Aha! We can see that there is another grouping characteristic going on - State. Let’s facet the graph for each state, giving us:

ggplot(oz_kim,
       aes(x = year,
           y = count,
           colour = sex)) + 
  geom_line() + 
  facet_wrap(~state) +
  scale_colour_discrete_qualitative()

Setting group correctly (Addition as of 2020/06/22)

Emma Vitz had an interesting example of another sawtooth type problem shared on twitter:

Hey #rstats, why do I get this weird single line that's made up of 2 colours rather than 2 separate lines when I use ggplot?
If I filter the data to one gender it works fine (2nd screenshot). If I remove the group = 1 it gives me an error and a blank plot. pic.twitter.com/cNZDrVv1pZ

— Emma Vitz (@EmmaVitz) June 15, 2020

The solution was discussed in the thread, but let’s unpack this. Let’s first recreate the data used (taken by eyeballing the graphic):


pageviews <- tibble(
  age = factor(rep(c("18-24",
                     "25-34",
                     "35-44",
                     "45-54",
                     "55-64",
                     "65"), 2)),
  gender = factor(x = c(rep("Female", 6),
                        rep("Male", 6))),
  pageviews = c(2750, 4200, 1750, 750, 450, 500,
                2500, 4200, 900, 350, 180, 150)
)

pageviews
#> # A tibble: 12 x 3
#>    age   gender pageviews
#>    <fct> <fct>      <dbl>
#>  1 18-24 Female      2750
#>  2 25-34 Female      4200
#>  3 35-44 Female      1750
#>  4 45-54 Female       750
#>  5 55-64 Female       450
#>  6 65    Female       500
#>  7 18-24 Male        2500
#>  8 25-34 Male        4200
#>  9 35-44 Male         900
#> 10 45-54 Male         350
#> 11 55-64 Male         180
#> 12 65    Male         150

So here is the warning given for the first of Emma’s plots:

ggplot(pageviews,
       aes(x = age,
           y = pageviews,
           colour = gender)) + 
  geom_line()
#> geom_path: Each group consists of only one observation. Do you need to adjust
#> the group aesthetic?

What to do? One way to get the lines to appear is to set group = 1

ggplot(pageviews,
       aes(x = age,
           y = pageviews,
           colour = gender,
           group = 1)) + 
  geom_line()

But then we get this! That isn’t ideal. The solution proposed on twitter was to set group = gender as well as colour = gender.

ggplot(pageviews,
       aes(x = age,
           y = pageviews,
           colour = gender,
           group = gender)) + 
  geom_line()

The answer was provided by Peter Green, who said:

Looks like since x=age is a factor, ggplot is “helpfully” making age the group instead of gender? Which would explain why fixing it with the explicit group=gender works?

My take on this is that since there are two factors here, it causes ggplot some confusion. The default behaviour of ggplot when setting colour is to use the same grouping, but in this case, as there are two factors, it doesn’t know what to pick. By setting the group explicitly, you get the right plot.

Wrapping Up

So, how to remove sawtooth patterns in a plot?

  1. Understand what sort of graphic you are expecting
  2. Explore and potentially include all grouping features into the graphic
  3. Ensure that if you have factors in some of your aesthetics (x, y, colour, size), that you specify group to the right variable in your dataset.