if ifelse() had more if’s, AND an else

Problem

The case_when() function in dplyr is great for dealing with multiple complex conditions (if’s). But how do you specify an “else” condition in case_when()?

Context

Last month, I was super excited to discover the case_when() function in dplyr. But when I showed my blog post to a friend, he pointed out a problem: there seemed to be no way to specify a “background” case, like the “else” in ifelse(). In the previous post, I gave an example with three outcomes based on test results. The implication was that there would be roughly equal numbers of people in each group. But what if the vast majority of people failed both tests, and we really just wanted to filter out the ones who didn’t?

Today, I came across exactly this problem in my research. I’m analyzing morphometric data for about 500 tadpoles, and I made a PCA score plot that looked like this:

Screen Shot 2019-11-22 at 12.48.01 PM

Before continuing my analysis, I wanted to take a closer look at those outlier points, to make sure they represent real measurements and not mistakes in the data. Specifically, I wanted to take a look at these ones:

Screen Shot 2019-11-22 at 12.48.01 PM copy.png

To figure out which tadpoles to investigate, I’d have to pull out their names based on their scores on the PC1 and PC2 axes.

Solution

I decided to add a column called investigate to the PCA scores data frame, set to “investigate” or “ok” depending on whether the observation in question needed to be looked at.

scores <- scores %>% 
          mutate(investigate = case_when(PC1 > 0.2 ~ "investigate",
                                         PC2 > 0.15 ~ "investigate",
                                         PC1 < -0.1 & PC2 > 0.1 ~ "investigate,
                                         TRUE ~ "ok"))

What’s up with that weird TRUE ~ "ok" line at the end of the ​​case_when() statement? Basically, that’s the equivalent of else.  It translates, roughly, to “assign anything that’s left to “ok.”

I’m really not sure why the equivalent of else here is TRUE, and the ​case_when documentation doesn’t really explain it. The only way I figured out that this worked was by reading through the examples in the documentation and noticing that they all seemed to end with this TRUE ~ statement, so I tried it, and voilà. If anyone has an understanding of why this works, under the hood, I’d love to know!

One thing to note is that the order of arguments matters here. If we had started off with the TRUE ~ "ok" statement and then specified the other conditions, it wouldn’t have worked: everything would just get assigned to “ok.”

I’m really glad I figured out how to add an else to case_when()! Before I started using dplyr, I would have attempted this problem like this:

scores$investigate <- "ok" # Create a whole column filled with "ok"
scores$investigate[scores$PC1 > 0.2] <- "investigate"
scores$investigate[scores$PC2 > 0.15] <- "investigate"
scores$investigate[scores$PC1 < -0.1 & scores$PC2 > 0.1] <- "investigate"

Or maybe I would have used some really long and complex boolean statement to get all those conditions in one line of code. Or nested ifelse‘s. But that’s annoying and hard to read. This is so much neater, and saves typing!

Outcome

It turns out that if you read the documentation closely, case_when()is a fully-functioning version of ifelse that allows for multiple if statements AND a background condition (else). The more I learn about the tidyverse, the more I love it.

 

15 thoughts on “if ifelse() had more if’s, AND an else”

  1. The other day I had to consult the case_when documentation as well, because I had forgotten about the ‘TRUE ~ expr’ to trigger the else equivalent. It would really be helpful if that essential part of case_when would be explained in a more prominent position in the documentation. You should post an issue on the dplyr github or make a pull request – I think this would be a great improvement for everyone learning about case_when!

    I agree that the TRUE is semantically not intuitive in the context of re-coding a variable, but what happens whenever you use a TRUE as a condition, that condition is always true, no matter what. And as you noticed, when you have the ‘TRUE ~ “ok”‘ construct as the first argument all values are “ok”. Again, this behavior could be documented better, it is somehow described though in one of the examples: “Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general.” Since the TRUE condition is always true it is the most general condition possible so it has to be the last argument.

    After reading your post I got curious about why case_when works this way and looked at the source code. These are the lines that are causing this behavior.

    for (i in seq_len(n)) {
    out <- replace_with(out, query[[i]] & !replaced, value[[i]], NULL)
    replaced <- replaced | (query[[i]] & !is.na(query[[i]]))
    }

    This loops through all 'cond ~ expr' arguments in the order they are typed, and replaces those values where the condition is true AND have not been replaced yet. So when 'TRUE ~ "ok"' is the first argument, all values are replaced in the first run of the loop, hence the remaining runs have nothing left to replace.

    Like

  2. Each of the statements (i.e., PC1 > 0.2) tests whether a condition is TRUE or FALSE. The last statement does as well, but TRUE is always TRUE, so it will always evaluate when nothing else does…just like “else”!

    Like

  3. Surely there’s a case for a case_when_else function that would be more intuitive and simply replace TRUE ~ with an ELSE parameter to be set?
    Would enhance readability and intuitive understanding.

    Like

  4. It’s just a natural consequence of the logic of case_when(): if the expression on the LHS evaluates to TRUE fill in the value on the RHS, if not move on to the next expression. TRUE at the last expression will always evaluate to TRUE and therefore act like an `else`.

    Like

  5. It did seem a little bit like a magical incantation at first to me as well. I make sense of it by thinking that the left hand side of each expression is just something that has to evaluate to TRUE for the right hand side to be selected. So simply having TRUE for the left hand side guarantees the right side will be selected, unconditionally. i.e. it’s not really a special “else” clause in any way, it’s just like all the expressions above it, but formulated to always apply unless one of the preceding ones has.

    Like

  6. Each of the case_when lines is an if then statement. Only if all the lines preceding the last line evaluate to false, the last line will be evaluated.
    In order for the last line to function like the else part of an if else statement, you have to make sure that it evaluates to true. You could use “4 == 4”, since that evaluates to true, or “16 < 17”. But the simplest thing is to just use “TRUE”, which evaluates to … true. (Try typing “TRUE” at the command line, and you’ll see that it’s evaluated like any logical expression.)

    Like

  7. The TRUE line works because the left formulae are evaluated in order, and the logic drops out of the case_when when it finds a formula that evaluates to TRUE. TRUE obviously always evaluates to TRUE, so by placing it last it acts as an else condition – nothing else before was TRUE, so do this.

    The documentation uses a fizzbuzz example for this:
    x <- 1:50
    case_when(
    x %% 35 == 0 ~ "fizz buzz",
    x %% 5 == 0 ~ "fizz",
    x %% 7 == 0 ~ "buzz",
    TRUE ~ as.character(x)
    )

    When x is 35, all 4 conditions will evaluate to TRUE. Because x %% 35 == 0 is the first condition, it will return "fizz buzz", not any of the other options.
    When x is 1, the first 3 conditions are FALSE, so it returns as.character(x) – that is, "1". It can't just return x because all the right-hand formulae have to produce the same type – character in this case.

    Like

  8. Why iterate at all though?

    # (make up some data):
    PC1 = rexp(100) / 12
    PC2 = rexp(100) / 12
    data <- data.frame(PC1, PC2)

    # add some rules:
    rules <- data$PC1 < .05 | data$PC2 < .1

    # get the subset:
    investigate <- data[rules,]

    Cheers!! -Jess Sullivan

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s