Problem
The ifelse()
function only allows for one “if” statement, two cases. You could add nested “if” statements, but that’s just a pain, especially if the 3+ conditions you want to use are all on the same level, conceptually. Is there a way to specify multiple conditions at the same time?
Context
I was recently given some survey data to clean up. It looked something like this (but obviously much larger):
I needed to classify people in this data set based on whether they had passed or failed certain tests.
I wanted to separate the people into three groups:
- People who passed both tests: Group A
- People who passed one test: Group B
- People who passed neither test: Group C
I thought about using a nested ifelse
statement, and I certainly could have done that. But that approach didn’t make sense to me. The tests are equivalent and not given in any order; I simply want to sort the people into three equal groups. Any nesting of “if” statements would seem to imply a hierarchy that doesn’t really exist in the data. Not to mention that I hate nesting functions. It’s confusing and hard to read.
Solution
Once again, dplyr to the rescue! I’m becoming more and more of a tidyverse fan with each passing day.
Turns out, dplyr has a function for exactly this purpose: case_when()
. It’s also known as “a general vectorised if,” but I like to think of it as “if ifelse() had more if’s.”
Here’s the syntax:
library(dplyr)
df <- df %>%
mutate(group = case_when(test1 & test2 ~ "A", # both tests: group A
xor(test1, test2) ~ "B", # one test: group B
!test1 & !test2 ~ "C" # neither test: group C
))
Output:
Let me translate the above into English. After loading the package, I reassign df
, the name of my data frame, to a modified version of the old df
. Then (%>%
), I use the mutate
function to add a new column called group
. The contents of the column will be defined by the case_when()
function.
case_when()
, in this example, took three conditions, which I’ve lined up so you can read them more easily. The condition is on the left side of the ~
, and the resulting category (A, B, or C) is on the right. I used logical operators for my conditions. The newest one to me was the xor()
function, which is an exclusive or: only one of the conditions in the parentheses can be TRUE, not both.
Outcome
Easily make conditional assignments within a data frame. This function is a little less succinct than ifelse()
, so I’m probably not going to use it for applications with only two cases, where ifelse()
would work fine. But for three or more cases, it can’t be beat. Notice that I could have added any number of conditions to my case_when()
statement, with no other caveats.
I love this function, and I think we should all be using it.
Hi kaijagahm,
in the given example, wouldn’t it be easier to use rowsums on the two columns?
LETTERS[1:3][rowSums(df[ , 2:3])+1]
It’s a one-liner, no need to use any add-on library, surely much faster, and easy to extend to more columns.
LikeLike
Thanks for the suggestion! Yes, that would definitely have worked. I like that case_when can be extended to cases that don’t involve logicals, and that it’s integrable with other dplyr commands, since I use dplyr for a lot of data cleaning. In the actual data, too, there were lots of columns interspersed with the ones I needed to refer to, and I think the indexing would have become hard to follow. It’s a good idea to keep both alternatives in mind!
LikeLike
That works in a simple example but not in a complex case, such as survey data or disease classification
LikeLike