%$% : upping your pipe game


What do I do when %>% doesn’t work?


I love the %>%  pipe. Originally from magrittr, it’s now characteristic of tidy code. Using %>% has revolutionized how I write code in R (pssst! coming soon: an interactive pipe tutorial!). But sometimes the basic pipe falls short.

table() is one of my favorite functions for exploring data in R: it creates a frequency table of values in a vector. I use table() to do sanity checks on my data, make sure that all factor levels are present, and generally get a sense of how my observations are distributed.

A while back, though, I noticed that table() didn’t play nice with the %>% pipe.

I’ve collected some data on my friends’ pets. Here it is (using pseudonyms, in case anyone has a secret pet they don’t want the world to know about…).

This is one of the cats in the data frame below. She would like to hold your hand.
# Load magrittr
> library(magrittr)

# Create data
> pets <- data.frame(friend = c("Mark", "Mark", "Kyle", "Kyle", "Miranda", "Kayla", "Kayla", "Kayla", "Adriana", "Adriana", "Alex", "Randy", "Nancy"),
                   pet = c("cat", "cat", "cat", "cat", "cat", "dog", "cat", "lizard", "cat", "cat", "dog", "dog", "woodpecker"),
                   main_pet_color = c("brown", "brown", "multi", "multi", "brown", "brown", "brown", "orange", "black", "white", "multi", "white", "multi"))

# Look at the data
> pets
    friend        pet main_pet_color
1     Mark        cat          brown
2     Mark        cat          brown
3     Kyle        cat          multi
4     Kyle        cat          multi
5  Miranda        cat          brown
6    Kayla        dog          brown
7    Kayla        cat          brown
8    Kayla     lizard         orange
9  Adriana        cat          black
10 Adriana        cat          white
11    Alex        dog          multi
12   Randy        dog          white
13   Nancy woodpecker          multi

Unsurprisingly, it looks like there are a lot of cats and dogs! There are also a lot of brown pets and a lot of multicolored ones. Let’s say I want to see a frequency table of the pet colors. I know that I can do this with table(), like so:

# Make a frequency table of pet colors
> table(pets$main_pet_color)
 black  brown  multi orange  white 
     1      5      4      1      2

But if I want to use tidy syntax, I’d try to do it this way instead:

# Make a frequency table of pet colors
> pets %>% table(main_pet_color)
Error in table(., main_pet_color) : object 'main_pet_color' not found

What’s up with this? The syntax should work. pet is definitely a valid variable name in the data frame pets, and if I had used a different function, like arrange(), I would have had no problems:

# Arrange the data frame by pet color
> pets %>% arrange(main_pet_color)
    friend        pet main_pet_color
1  Adriana        cat          black
2     Mark        cat          brown
3     Mark        cat          brown
4  Miranda        cat          brown
5    Kayla        dog          brown
6    Kayla        cat          brown
7     Kyle        cat          multi
8     Kyle        cat          multi
9     Alex        dog          multi
10   Nancy woodpecker          multi
11   Kayla     lizard         orange
12 Adriana        cat          white
13   Randy        dog          white

So why doesn’t this work with table()?? This problem has driven me crazy on several occasions. I always ended up reverting back to the table(pets$main_pet_color) syntax, but I was not happy about it.

Turns out, there’s a simple fix.

Continue reading “%$% : upping your pipe game”

Some lessons from rstudio::conf

Today I’m departing a little from the problem/context/solution format of these posts to share some things I learned from last week’s rstudio::conf.

When I started in R a few years ago, I never thought I would have any place at a coding conference for computer people. But thanks to some help from my lab and my college, last week I flew out to sunny San Francisco for the 2020 rstudio::conf, and I had a blast! Here are some things I learned and some avenues I’m excited to explore in the future.

1. The R community is (actually) awesome.

This art is by RStudio artist-in-residence Allison Horst. She does amazing illustrations! You can find her work on her GitHub page, and she’s on twitter @allison_horst.

I was pretty nervous to attend this conference. I’ve never really considered myself a computer person, and I learned R pretty much accidentally through my work in biology. In my experience, people who do a lot of computer programming tend to use a lot of intimidating jargon, and I was scared I’d flounder.

I was surprised at how genuinely welcome I felt. At meals and talks, I sat down next to random people and asked what they did or how they knew R. I met people with so many diverse interests! Among others:

A linguist now working for a nonprofitA neuroscience PhD now working for FacebookSeveral young professors teaching data science at their schoolsA woman who flew from Brazil (!!) to attend the conferenceSomeone just getting to know RTwo young people who interned at RStudio, despite not being experts in RAn RStudio bigwig (swoon!)A professor working on genomic dataA researcher at the SmithsonianAn aquatic ecologist who uses R for workSo many people! You get the gist.

And they were all happy to talk to me!

2. So much happens on Twitter.

I joined Twitter on a whim this fall, and it has been awesome. I learned about this conference through Twitter. I’ve found some internships through Twitter. And by following the #rstats hashtag and some key people in the R community, I’ve learned all sorts of tips and tricks about the kinds of things you can do with Twitter.

Apart from that, lots of people were live-tweeting the rstudio::conf, and I tried my hand at that, too! A highlight was when the one and only Hadley Wickham liked one of my tweets.

3. I need to start making Shiny apps.

As I said in my tweet (the one that Hadley liked!), this conference convinced me that I really should start building apps with RShiny as soon as possible.

Why haven’t I done this already?

The main reason is that the word “app” strikes fear into my heart. Surely I can’t develop an app??

Shiny apps are web apps, which is a little less intimidating, somehow. The basic gist of Shiny apps is that they are ways to explore your data interactivelyThat’s it. Some Shiny apps can get pretty complicated. For example, here’s a cool dashboard for visualizing tweets about the 2019 rstudio::conf. As you can see, it’s pretty fancy.

Screen Shot 2020-02-03 at 16.07.03

But Shiny apps can also be pretty simple, like this one, which shows a simple histogram of the duration of eruptions of the Old Faithful geyser:

Screen Shot 2020-02-03 at 16.09.35

This app just lets the user change the number of histogram bins, include a density curve, and show the individual observations on the x axis. Pretty simple, but still so much better than a static visualization! I really have no excuse not to make something like this.

In particular, I can’t wait to build an app for the Yale Grammatical Diversity Project (YGDP). I’m currently working with the YGDP to help organize their database of linguistic survey data. I’ve already created some reports in RMarkdown to help visualize the data (you can see an example here). But wouldn’t this be so much better if it were interactive??

Screen Shot 2020-02-03 at 16.14.22
Note: I’m loving the viridis color palette in this graph. So much better than the ggplot2 default!

4. Bonus: fun with hex stickers

I finally got to experience the hype about hexagon stickers for R packages. They are so pretty and fun, and they fit together so nicely! I picked up a whole bunch:


And, funnily enough, I’ve been playing with hexagons as wall decorations for a while now, before I even knew about hex stickers…


…so obviously, the possibilities now are truly endless. Hexagons on hexagons? A hex collage on the wall and one on my computer? Wow.

That’s all for now! Soon (I hope), I’ll post highlights from talks at the conference, because, well, I guess those were cool too.

if ifelse() had more if’s, AND an else


The case_when() function in dplyr is great for dealing with multiple complex conditions (if’s). But how do you specify an “else” condition in case_when()?


Last month, I was super excited to discover the case_when() function in dplyr. But when I showed my blog post to a friend, he pointed out a problem: there seemed to be no way to specify a “background” case, like the “else” in ifelse(). In the previous post, I gave an example with three outcomes based on test results. The implication was that there would be roughly equal numbers of people in each group. But what if the vast majority of people failed both tests, and we really just wanted to filter out the ones who didn’t?

Today, I came across exactly this problem in my research. I’m analyzing morphometric data for about 500 tadpoles, and I made a PCA score plot that looked like this:

Screen Shot 2019-11-22 at 12.48.01 PM

Before continuing my analysis, I wanted to take a closer look at those outlier points, to make sure they represent real measurements and not mistakes in the data. Specifically, I wanted to take a look at these ones:

Screen Shot 2019-11-22 at 12.48.01 PM copy.png

To figure out which tadpoles to investigate, I’d have to pull out their names based on their scores on the PC1 and PC2 axes.


I decided to add a column called investigate to the PCA scores data frame, set to “investigate” or “ok” depending on whether the observation in question needed to be looked at.

scores <- scores %>% 
          mutate(investigate = case_when(PC1 > 0.2 ~ "investigate",
                                         PC2 > 0.15 ~ "investigate",
                                         PC1 < -0.1 & PC2 > 0.1 ~ "investigate,
                                         TRUE ~ "ok"))

What’s up with that weird TRUE ~ "ok" line at the end of the ​​case_when() statement? Basically, that’s the equivalent of else.  It translates, roughly, to “assign anything that’s left to “ok.”

I’m really not sure why the equivalent of else here is TRUE, and the ​case_when documentation doesn’t really explain it. The only way I figured out that this worked was by reading through the examples in the documentation and noticing that they all seemed to end with this TRUE ~ statement, so I tried it, and voilà. If anyone has an understanding of why this works, under the hood, I’d love to know!

One thing to note is that the order of arguments matters here. If we had started off with the TRUE ~ "ok" statement and then specified the other conditions, it wouldn’t have worked: everything would just get assigned to “ok.”

I’m really glad I figured out how to add an else to case_when()! Before I started using dplyr, I would have attempted this problem like this:

scores$investigate <- "ok" # Create a whole column filled with "ok"
scores$investigate[scores$PC1 > 0.2] <- "investigate"
scores$investigate[scores$PC2 > 0.15] <- "investigate"
scores$investigate[scores$PC1 < -0.1 & scores$PC2 > 0.1] <- "investigate"

Or maybe I would have used some really long and complex boolean statement to get all those conditions in one line of code. Or nested ifelse‘s. But that’s annoying and hard to read. This is so much neater, and saves typing!


It turns out that if you read the documentation closely, case_when()is a fully-functioning version of ifelse that allows for multiple if statements AND a background condition (else). The more I learn about the tidyverse, the more I love it.


Loading packages efficiently


Especially in a project with many different scripts, it can be challenging to keep track of all the packages you need to load. It’s also easy to lose track of whether or not you’ve incorporated package loading into the script itself until you switch to a new computer or restart R and all of a sudden, your packages need to be re-loaded.


When I was first starting out in R, I learned quickly to load packages all together at the top of a script, not along the way as I needed them. But it took a while, until I started using R Projects, before I decided to centralize package loading above the script level. I was sick of having to deal with loading the right packages at the right times, so I decided to just streamline the whole thing.


Make a separate R script, called “libraries.R” or “packages.R” or something. Keep it consistent. Mine is always called “libraries,” and I keep it in my project folder.


It looks something like this (individual libraries may vary, of course):


Then, at the top of each analysis script, I can simply source the libraries script, and all the libraries I need load automatically.

loading the libraries.PNG


I can easily load libraries in the context of a single R Project, keep track of which ones are loaded, and not have to worry about making my scripts look messy with a whole chunk of library() commands at the top of each one. It’s also straightforward to pop open the “libraries” script whenever I want to add a new library or delete one.


if ifelse() had more if’s


The ifelse() function only allows for one “if” statement, two cases. You could add nested “if” statements, but that’s just a pain, especially if the 3+ conditions you want to use are all on the same level, conceptually. Is there a way to specify multiple conditions at the same time?


I was recently given some survey data to clean up. It looked something like this (but obviously much larger):


I needed to classify people in this data set based on whether they had passed or failed certain tests.

I wanted to separate the people into three groups:

  • People who passed both tests: Group A
  • People who passed one test: Group B
  • People who passed neither test: Group C

I thought about using a nested ifelse statement, and I certainly could have done that. But that approach didn’t make sense to me. The tests are equivalent and not given in any order; I simply want to sort the people into three equal groups. Any nesting of “if” statements would seem to imply a hierarchy that doesn’t really exist in the data. Not to mention that I hate nesting functions. It’s confusing and hard to read. 

Continue reading “if ifelse() had more if’s”

Initializing an empty list


How do I initialize an empty list for use in a for-loop or function?


Sometimes I’m writing a for-loop (I know, I know, I shouldn’t use for-loops in R, but sometimes it’s just easier. I’m a little less comfortable with apply functions than I’d like to be) and I know I’ll need to store the output in a list. Once in a while, the new list will be similar in form to an existing one, but more often, I just need to start from scratch, knowing only the number of elements I want to include.

This isn’t a totally alien thing to need to do––it’s pretty familiar if you’re used to initializing empty vectors before for-loops. There’s a whole other debate to be had about whether or not it’s acceptable to start with a truly empty vector and append to it on every iteration of the loop or whether you should always know the length beforehand, but I’ll just focus on the latter case for now.

Anyway, initializing a vector of a given length is easy enough; I usually do it like this:

> desired_length <- 10 # or whatever length you want
> empty_vec <- rep(NA, desired_length)

I couldn’t immediately figure out how to replicate this for a list, though. The solution turns out to be relatively simple, but it’s just different enough that I can never seem to remember the syntax. This post is more for my records than anything, then.

Continue reading “Initializing an empty list”

(Automatically Show Output)


It’s annoying to have to type the name of an object I just created in order to print its output in a script.


A certain lightsaber-wielding stats professor of mine liked to point out that R doesn’t go out of its way to be helpful. If you write a line of code that creates an object and then run that line of code, there’s no message to tell you that the object has been successfully created. R doesn’t say “Task complete! What’s next?” or otherwise give you any indication that anything has happened. To actually view the object you just created, you have to type its name or run some other command on it.

Once in a while, this lack of transparency can be frustrating. What if I want to save objects and also view them in real time as they are created? Say I’ve used the handy prop.table function to transform a frequency table into a proportion table. I’d like to be able to view prop, prop.1 and prop.2 without typing their names and adding extra lines of code.

Continue reading “(Automatically Show Output)”



How can I convert a frequency table into proportions?


This is a continuation of the data manipulation discussed in the ​`​with()` post. I had just finished making a table

# Load data from GitHub
> polygon <- read.csv("https://raw.githubusercontent.com/kaijagahm/general/master/polygon_sampling_data_UMR.csv") 

# Two-way table by pool and revetment 
> with(polygon, table(revetment, pool))

Screen Shot 2018-07-20 at 2.48.21 PM.png

What if I want to see this table broken down by proportion of polygons, not counts?

Continue reading “prop.table()”

with( )


Making graphics with base R is annoying for many reasons, but a big one is having to type the name of the data frame over and over again to reference different columns.


Back to our Mississippi River fish data. I’ve aggregated my sampling points into polygons, and now I want to explore some of their characteristics. To do that, I’d like to make some tables and plots, and because these are just quick, exploratory plots, I don’t feel like dealing with ggplot.

Load in the data (accessible on GitHub).

# Load data from GitHub
> polygon <- read.csv("https://raw.githubusercontent.com/kaijagahm/general/master/polygon_sampling_data_UMR.csv")

# Look at what we're dealing with
> dim(polygon) # How big is the data set?
[1] 527  21

> head(polygon, 3) # Look at the first few rows
     poly_id propsnag n_points habitat_code
1 P04_CFL_13      0.8        5          CFL
2 P04_CFL_14      0.2        5          CFL
  pool      Area Perimeter max_depth
1    4 105288.80  2067.890      1.30
2    4  42668.28  1770.465      0.74
  avg_depth tot_vol shoreline_density_index
1 0.3625869   33955                1.797759
2 0.3291391    5953                2.417852
  pct_aqveg pct_terr pct_prm_wetf
1  19.13396 93.87983     79.67522
2  41.25270 94.76871     42.44244
  med_dist_to_land med_dist_to_forest
1         34.13379           34.13379
2         18.90166           32.64112
  med_current wingdam revetment tributary
1        0.02       0         1         0
2        0.02       0         0         0
1        0.9278354
2        1.0000000

First, I’d like to see how total volume tot_vol of the aquatic area scales with its Area.
Continue reading “with( )”

Changing individual column names


How do I change the name of just one column in a data frame?


This is a simple one that keeps coming up. Sometimes, whoever put together my data decided to capitalize the first letter of some column names and not others. Sometimes I’ve merged several data frames together and I need to distinguish the columns from each other.

Say my data frame is p8_0 and I’d like to change the column Area to area.

In the past, I’ve done this in one of two ways. Either I change all of the column names at once (if all of them need to be changed), or I use numerical column indexing. The latter makes a lot more sense if I have a lot of columns to deal with, but it means I have to know the number of the column whose name I have to change.

To find this out, I first have to look at all of the column names. Okay, no problem.

# See column names and numerical indices
> names(p8_0)
[1] "FID" "Join_Count" "TARGET_FID" 
 [4] "Field1" "barcode" "stratum" 
 [7] "lcode" "sdate" "utm_e" 
 [10] "utm_n" "snag" "OBJECTID" 
 [13] "uniq_id" "aa_num" "AQUA_CODE" 
 [16] "AQUA_DESC" "pool" "Area" 
 [19] "Perimeter" "bath_pct" "max_depth" 
 [22] "avg_depth" "sd_depth" "tot_vol" 
 [25] "area_gt50" "area_gt100" "area_gt200" 
 [28] "area_gt300" "avg_fetch" "shoreline_density_index"
 [31] "econ" "sill" "min_rm" 
 [34] "max_rm" "len_met" "len_prm_lotic" 
 [37] "pct_prm_lotic" "num_lotic_outl" "len_prm_lentic" 
 [40] "pct_prm_lentic" "num_lentic_outl" "pct_aqveg" 
 [43] "pct_opwat" "len_terr" "pct_terr" 
 [46] "pct_aq" "len_wetf" "pct_prm_wetf" 
 [49] "pct_terr_shore_wetf" "len_wd" "wdl_p_m2" 
 [52] "num_wd" "scour_wd" "psco_wd" 
 [55] "len_revln" "rev_p_m2" "num_rev" 
 [58] "pct_terr_shore_rev" "pct_prm_rev" "area_tpi1" 
 [61] "pct_tpi1" "area_tpi2" "pct_tpi2" 
 [64] "area_tpi3" "pct_tpi3" "area_tpi4" 
 [67] "pct_tpi4" "sinuosity" "year_phot" 
 [88] "year.p" "depth.p" "current.p" 
 [91] "gear.p" "stageht.p" "substrt.p" 
 [94] "wingdike.p" "riprap.p" "trib.p" 
 [97] "snagyn" "area_le50" "area_le100" 
[100] "area_le200" "area_le300" "pct_area_le100" 
[103] "pct_area_le50" "pct_area_le200" "pct_area_le300" 
[106] "stratum_name"

Okay, yes problem.

It’s not that hard to see that Area is the 18th column. But there are a bunch of columns that start with NEAR_TERR_ and NEAR_FOREST_ that would be easy to confuse. And what if I later modify my data cleaning script, insert new columns, and mess up the numerical indexing?

Continue reading “Changing individual column names”