Initializing an empty list

Problem

How do I initialize an empty list for use in a for-loop or function?

Context

Sometimes I’m writing a for-loop (I know, I know, don’t use for-loops, but sometimes it’s just easier. I’m a little less good at apply functions than I’d like to be) and I know I’ll need to store the output in a list. Once in a while, the new list will be the same size as an existing one, but more often, I just need to start from scratch, knowing only the number of elements I want to include.

This isn’t a totally alien thing to need to do––it’s pretty familiar if you’re used to initializing empty vectors before for-loops. There’s a whole other debate to be had about whether or not it’s acceptable to start with a truly empty vector and append to it on every iteration of the loop or whether you should always know the length beforehand, but I’ll just focus on the latter case for now.

Anyway, initializing a vector of a given length is easy enough; I usually do it like this:

> desired_length <- 10 # or whatever length you want
> empty_vec <- rep(NA, desired_length)

I couldn’t immediately figure out how to replicate this for a list, though. The solution turns out to be relatively simple, but it’s just different enough that I can never seem to remember the syntax. This post is more for my records than anything, then.

Continue reading “Initializing an empty list”

Advertisements

(Automatically Show Output)

Problem

It’s annoying to have to type the name of an object I just created in order to print its output in a script.

Context

A certain lightsaber-wielding stats professor of mine liked to point out that R doesn’t go out of its way to be helpful. If you write a line of code that creates an object and then run that line of code, there’s no message to tell you that the object has been successfully created. R doesn’t say “Task complete! What’s next?” or otherwise give you any indication that anything has happened. To actually view the object you just created, you have to type its name or run some other command on it.

Once in a while, this lack of transparency can be frustrating. What if I want to save objects and also view them in real time as they are created? Say I’ve used the handy prop.table function to transform a frequency table into a proportion table. I’d like to be able to view prop, prop.1 and prop.2 without typing their names and adding extra lines of code.

Continue reading “(Automatically Show Output)”

prop.table()

Problem

How can I convert a frequency table into proportions?

Context

This is a continuation of the data manipulation discussed in the ​`​with()` post. I had just finished making a table

# Load data from GitHub
> polygon <- read.csv("https://raw.githubusercontent.com/kaijagahm/general/master/polygon_sampling_data_UMR.csv") 

# Two-way table by pool and revetment 
> with(polygon, table(revetment, pool))

Screen Shot 2018-07-20 at 2.48.21 PM.png

What if I want to see this table broken down by proportion of polygons, not counts?

Continue reading “prop.table()”

with( )

Problem

Making graphics with base R is annoying for many reasons, but a big one is having to type the name of the data frame over and over again to reference different columns.

Context

Back to our Mississippi River fish data. I’ve aggregated my sampling points into polygons, and now I want to explore some of their characteristics. To do that, I’d like to make some tables and plots, and because these are just quick, exploratory plots, I don’t feel like dealing with ggplot.

Load in the data (accessible on GitHub).

# Load data from GitHub
> polygon <- read.csv("https://raw.githubusercontent.com/kaijagahm/general/master/polygon_sampling_data_UMR.csv")

# Look at what we're dealing with
> dim(polygon) # How big is the data set?
[1] 527  21

> head(polygon, 3) # Look at the first few rows
     poly_id propsnag n_points habitat_code
1 P04_CFL_13      0.8        5          CFL
2 P04_CFL_14      0.2        5          CFL
  pool      Area Perimeter max_depth
1    4 105288.80  2067.890      1.30
2    4  42668.28  1770.465      0.74
  avg_depth tot_vol shoreline_density_index
1 0.3625869   33955                1.797759
2 0.3291391    5953                2.417852
  pct_aqveg pct_terr pct_prm_wetf
1  19.13396 93.87983     79.67522
2  41.25270 94.76871     42.44244
  med_dist_to_land med_dist_to_forest
1         34.13379           34.13379
2         18.90166           32.64112
  med_current wingdam revetment tributary
1        0.02       0         1         0
2        0.02       0         0         0
  pct_shallow_area
1        0.9278354
2        1.0000000

First, I’d like to see how total volume tot_vol of the aquatic area scales with its Area.
Continue reading “with( )”

Changing individual column names

Problem

How do I change the name of just one column in a data frame?

Context

This is a simple one that keeps coming up. Sometimes, whoever put together my data decided to capitalize the first letter of some column names and not others. Sometimes I’ve merged several data frames together and I need to distinguish the columns from each other.

Say my data frame is p8_0 and I’d like to change the column Area to area.

In the past, I’ve done this in one of two ways. Either I change all of the column names at once (if all of them need to be changed), or I use numerical column indexing. The latter makes a lot more sense if I have a lot of columns to deal with, but it means I have to know the number of the column whose name I have to change.

To find this out, I first have to look at all of the column names. Okay, no problem.

# See column names and numerical indices
> names(p8_0)
[1] "FID" "Join_Count" "TARGET_FID" 
 [4] "Field1" "barcode" "stratum" 
 [7] "lcode" "sdate" "utm_e" 
 [10] "utm_n" "snag" "OBJECTID" 
 [13] "uniq_id" "aa_num" "AQUA_CODE" 
 [16] "AQUA_DESC" "pool" "Area" 
 [19] "Perimeter" "bath_pct" "max_depth" 
 [22] "avg_depth" "sd_depth" "tot_vol" 
 [25] "area_gt50" "area_gt100" "area_gt200" 
 [28] "area_gt300" "avg_fetch" "shoreline_density_index"
 [31] "econ" "sill" "min_rm" 
 [34] "max_rm" "len_met" "len_prm_lotic" 
 [37] "pct_prm_lotic" "num_lotic_outl" "len_prm_lentic" 
 [40] "pct_prm_lentic" "num_lentic_outl" "pct_aqveg" 
 [43] "pct_opwat" "len_terr" "pct_terr" 
 [46] "pct_aq" "len_wetf" "pct_prm_wetf" 
 [49] "pct_terr_shore_wetf" "len_wd" "wdl_p_m2" 
 [52] "num_wd" "scour_wd" "psco_wd" 
 [55] "len_revln" "rev_p_m2" "num_rev" 
 [58] "pct_terr_shore_rev" "pct_prm_rev" "area_tpi1" 
 [61] "pct_tpi1" "area_tpi2" "pct_tpi2" 
 [64] "area_tpi3" "pct_tpi3" "area_tpi4" 
 [67] "pct_tpi4" "sinuosity" "year_phot" 
 [70] "NEAR_TERR_FID" "NEAR_TERR_DIST" "NEAR_TERR_CLASS_31" 
 [73] "NEAR_TERR_CLASS_15" "NEAR_TERR_CLASS_7" "NEAR_TERR_CLASS_31_N" 
 [76] "NEAR_TERR_CLASS_15_N" "NEAR_TERR_CLASS_7_N" "NEAR_TERR_HEIGHT_N" 
 [79] "NEAR_FOREST_FID" "NEAR_FOREST_DIST" "NEAR_FOREST_CLASS_31" 
 [82] "NEAR_FOREST_CLASS_15" "NEAR_FOREST_CLASS_7" "NEAR_FOREST_CLASS_31_N" 
 [85] "NEAR_FOREST_CLASS_15_N" "NEAR_FOREST_CLASS_7_N" "NEAR_FOREST_HEIGHT_N" 
 [88] "year.p" "depth.p" "current.p" 
 [91] "gear.p" "stageht.p" "substrt.p" 
 [94] "wingdike.p" "riprap.p" "trib.p" 
 [97] "snagyn" "area_le50" "area_le100" 
[100] "area_le200" "area_le300" "pct_area_le100" 
[103] "pct_area_le50" "pct_area_le200" "pct_area_le300" 
[106] "stratum_name"

Okay, yes problem.

It’s not that hard to see that Area is the 18th column. But there are a bunch of columns that start with NEAR_TERR_ and NEAR_FOREST_ that would be easy to confuse. And what if I later modify my data cleaning script, insert new columns, and mess up the numerical indexing?

Continue reading “Changing individual column names”

The %notin% operator

Problem

I keep forgetting how to select all elements of an object except a few, by name. I get the ! operator confused with the - operator and I find both of them less than intuitive to use. How can I negate the %in% operator?

Context

I have a data frame called electrofishing that contains observations from a fish sampling survey. One column, stratum, gives the aquatic habitat type of the sampling site. I’d like to exclude observations sampled in the “Tailwater Zone” or “Impounded-Offshore” areas.

My instinct would be to do this:

> electrofishing <- electrofishing[electrofishing$stratum !%in% c("Tailwater Zone", "Impounded-Offshore"),]

But that doesn’t work. You can’t negate the %in% operator directly. Instead, you have to wrap the %in% statement in parentheses and negate the entire statement, returning the opposite of the original boolean vector.

I’m not saying this doesn’t make sense, but I can never remember it. My English-speaking brain would much rather say “rows whose stratum is not included in c(“Tailwater Zone”, “Impounded-Offshore”)” than “not rows whose stratum is included in c(“Tailwater Zone”, “Impounded-Offshore”)”.

Continue reading “The %notin% operator”

Where are my NA’s?

Problem

How can I (quickly and intuitively) figure out how many NA’s are in my dataset and which columns they’re in?

Context

When I tried to run PCA (Principal Components Analysis) on some USGS fish sampling data, I noticed that I had a bunch of missing values. PCA needs complete observations, so this was a problem.

One option would have been to remove any observations with missing values from my data set:

# Select only "complete" rows from the data frame `df`  
> noNAs <- df[complete.cases(df),]

The problem was, I had over 30 variables and who knows how many missing values. The data frame had only ~2000 observations. By using only complete cases, I might lose a lot of observations and reduce my sample size by a huge amount.

In fact, I pretty often find myself in this situation. It would be really nice to have a quick way to see where those NA’s are located so I can get a better sense of my dataset and figure out how to move forward.

Continue reading “Where are my NA’s?”