Loading packages efficiently

Problem

Especially in a project with many different scripts, it can be challenging to keep track of all the packages you need to load. It’s also easy to lose track of whether or not you’ve incorporated package loading into the script itself until you switch to a new computer or restart R and all of a sudden, your packages need to be re-loaded.

Context

When I was first starting out in R, I learned quickly to load packages all together at the top of a script, not along the way as I needed them. But it took a while, until I started using R Projects, before I decided to centralize package loading above the script level. I was sick of having to deal with loading the right packages at the right times, so I decided to just streamline the whole thing.

Solution

Make a separate R script, called “libraries.R” or “packages.R” or something. Keep it consistent. Mine is always called “libraries,” and I keep it in my project folder.

libraries.PNG

It looks something like this (individual libraries may vary, of course):

librariesscript

Then, at the top of each analysis script, I can simply source the libraries script, and all the libraries I need load automatically.

loading the libraries.PNG

Outcome

I can easily load libraries in the context of a single R Project, keep track of which ones are loaded, and not have to worry about making my scripts look messy with a whole chunk of library() commands at the top of each one. It’s also straightforward to pop open the “libraries” script whenever I want to add a new library or delete one.

 

if ifelse() had more if’s

Problem

The ifelse() function only allows for one “if” statement, two cases. You could add nested “if” statements, but that’s just a pain, especially if the 3+ conditions you want to use are all on the same level, conceptually. Is there a way to specify multiple conditions at the same time?

Context

I was recently given some survey data to clean up. It looked something like this (but obviously much larger):

tabletest.png

I needed to classify people in this data set based on whether they had passed or failed certain tests.

I wanted to separate the people into three groups:

  • People who passed both tests: Group A
  • People who passed one test: Group B
  • People who passed neither test: Group C

I thought about using a nested ifelse statement, and I certainly could have done that. But that approach didn’t make sense to me. The tests are equivalent and not given in any order; I simply want to sort the people into three equal groups. Any nesting of “if” statements would seem to imply a hierarchy that doesn’t really exist in the data. Not to mention that I hate nesting functions. It’s confusing and hard to read. 

Continue reading “if ifelse() had more if’s”

Initializing an empty list

Problem

How do I initialize an empty list for use in a for-loop or function?

Context

Sometimes I’m writing a for-loop (I know, I know, I shouldn’t use for-loops in R, but sometimes it’s just easier. I’m a little less comfortable with apply functions than I’d like to be) and I know I’ll need to store the output in a list. Once in a while, the new list will be similar in form to an existing one, but more often, I just need to start from scratch, knowing only the number of elements I want to include.

This isn’t a totally alien thing to need to do––it’s pretty familiar if you’re used to initializing empty vectors before for-loops. There’s a whole other debate to be had about whether or not it’s acceptable to start with a truly empty vector and append to it on every iteration of the loop or whether you should always know the length beforehand, but I’ll just focus on the latter case for now.

Anyway, initializing a vector of a given length is easy enough; I usually do it like this:

> desired_length <- 10 # or whatever length you want
> empty_vec <- rep(NA, desired_length)

I couldn’t immediately figure out how to replicate this for a list, though. The solution turns out to be relatively simple, but it’s just different enough that I can never seem to remember the syntax. This post is more for my records than anything, then.

Continue reading “Initializing an empty list”

(Automatically Show Output)

Problem

It’s annoying to have to type the name of an object I just created in order to print its output in a script.

Context

A certain lightsaber-wielding stats professor of mine liked to point out that R doesn’t go out of its way to be helpful. If you write a line of code that creates an object and then run that line of code, there’s no message to tell you that the object has been successfully created. R doesn’t say “Task complete! What’s next?” or otherwise give you any indication that anything has happened. To actually view the object you just created, you have to type its name or run some other command on it.

Once in a while, this lack of transparency can be frustrating. What if I want to save objects and also view them in real time as they are created? Say I’ve used the handy prop.table function to transform a frequency table into a proportion table. I’d like to be able to view prop, prop.1 and prop.2 without typing their names and adding extra lines of code.

Continue reading “(Automatically Show Output)”

prop.table()

Problem

How can I convert a frequency table into proportions?

Context

This is a continuation of the data manipulation discussed in the ​`​with()` post. I had just finished making a table

# Load data from GitHub
> polygon <- read.csv("https://raw.githubusercontent.com/kaijagahm/general/master/polygon_sampling_data_UMR.csv") 

# Two-way table by pool and revetment 
> with(polygon, table(revetment, pool))

Screen Shot 2018-07-20 at 2.48.21 PM.png

What if I want to see this table broken down by proportion of polygons, not counts?

Continue reading “prop.table()”

with( )

Problem

Making graphics with base R is annoying for many reasons, but a big one is having to type the name of the data frame over and over again to reference different columns.

Context

Back to our Mississippi River fish data. I’ve aggregated my sampling points into polygons, and now I want to explore some of their characteristics. To do that, I’d like to make some tables and plots, and because these are just quick, exploratory plots, I don’t feel like dealing with ggplot.

Load in the data (accessible on GitHub).

# Load data from GitHub
> polygon <- read.csv("https://raw.githubusercontent.com/kaijagahm/general/master/polygon_sampling_data_UMR.csv")

# Look at what we're dealing with
> dim(polygon) # How big is the data set?
[1] 527  21

> head(polygon, 3) # Look at the first few rows
     poly_id propsnag n_points habitat_code
1 P04_CFL_13      0.8        5          CFL
2 P04_CFL_14      0.2        5          CFL
  pool      Area Perimeter max_depth
1    4 105288.80  2067.890      1.30
2    4  42668.28  1770.465      0.74
  avg_depth tot_vol shoreline_density_index
1 0.3625869   33955                1.797759
2 0.3291391    5953                2.417852
  pct_aqveg pct_terr pct_prm_wetf
1  19.13396 93.87983     79.67522
2  41.25270 94.76871     42.44244
  med_dist_to_land med_dist_to_forest
1         34.13379           34.13379
2         18.90166           32.64112
  med_current wingdam revetment tributary
1        0.02       0         1         0
2        0.02       0         0         0
  pct_shallow_area
1        0.9278354
2        1.0000000

First, I’d like to see how total volume tot_vol of the aquatic area scales with its Area.
Continue reading “with( )”

Changing individual column names

Problem

How do I change the name of just one column in a data frame?

Context

This is a simple one that keeps coming up. Sometimes, whoever put together my data decided to capitalize the first letter of some column names and not others. Sometimes I’ve merged several data frames together and I need to distinguish the columns from each other.

Say my data frame is p8_0 and I’d like to change the column Area to area.

In the past, I’ve done this in one of two ways. Either I change all of the column names at once (if all of them need to be changed), or I use numerical column indexing. The latter makes a lot more sense if I have a lot of columns to deal with, but it means I have to know the number of the column whose name I have to change.

To find this out, I first have to look at all of the column names. Okay, no problem.

# See column names and numerical indices
> names(p8_0)
[1] "FID" "Join_Count" "TARGET_FID" 
 [4] "Field1" "barcode" "stratum" 
 [7] "lcode" "sdate" "utm_e" 
 [10] "utm_n" "snag" "OBJECTID" 
 [13] "uniq_id" "aa_num" "AQUA_CODE" 
 [16] "AQUA_DESC" "pool" "Area" 
 [19] "Perimeter" "bath_pct" "max_depth" 
 [22] "avg_depth" "sd_depth" "tot_vol" 
 [25] "area_gt50" "area_gt100" "area_gt200" 
 [28] "area_gt300" "avg_fetch" "shoreline_density_index"
 [31] "econ" "sill" "min_rm" 
 [34] "max_rm" "len_met" "len_prm_lotic" 
 [37] "pct_prm_lotic" "num_lotic_outl" "len_prm_lentic" 
 [40] "pct_prm_lentic" "num_lentic_outl" "pct_aqveg" 
 [43] "pct_opwat" "len_terr" "pct_terr" 
 [46] "pct_aq" "len_wetf" "pct_prm_wetf" 
 [49] "pct_terr_shore_wetf" "len_wd" "wdl_p_m2" 
 [52] "num_wd" "scour_wd" "psco_wd" 
 [55] "len_revln" "rev_p_m2" "num_rev" 
 [58] "pct_terr_shore_rev" "pct_prm_rev" "area_tpi1" 
 [61] "pct_tpi1" "area_tpi2" "pct_tpi2" 
 [64] "area_tpi3" "pct_tpi3" "area_tpi4" 
 [67] "pct_tpi4" "sinuosity" "year_phot" 
 [70] "NEAR_TERR_FID" "NEAR_TERR_DIST" "NEAR_TERR_CLASS_31" 
 [73] "NEAR_TERR_CLASS_15" "NEAR_TERR_CLASS_7" "NEAR_TERR_CLASS_31_N" 
 [76] "NEAR_TERR_CLASS_15_N" "NEAR_TERR_CLASS_7_N" "NEAR_TERR_HEIGHT_N" 
 [79] "NEAR_FOREST_FID" "NEAR_FOREST_DIST" "NEAR_FOREST_CLASS_31" 
 [82] "NEAR_FOREST_CLASS_15" "NEAR_FOREST_CLASS_7" "NEAR_FOREST_CLASS_31_N" 
 [85] "NEAR_FOREST_CLASS_15_N" "NEAR_FOREST_CLASS_7_N" "NEAR_FOREST_HEIGHT_N" 
 [88] "year.p" "depth.p" "current.p" 
 [91] "gear.p" "stageht.p" "substrt.p" 
 [94] "wingdike.p" "riprap.p" "trib.p" 
 [97] "snagyn" "area_le50" "area_le100" 
[100] "area_le200" "area_le300" "pct_area_le100" 
[103] "pct_area_le50" "pct_area_le200" "pct_area_le300" 
[106] "stratum_name"

Okay, yes problem.

It’s not that hard to see that Area is the 18th column. But there are a bunch of columns that start with NEAR_TERR_ and NEAR_FOREST_ that would be easy to confuse. And what if I later modify my data cleaning script, insert new columns, and mess up the numerical indexing?

Continue reading “Changing individual column names”