Some helpful R functions

What’s this about?

It’s been a couple months since my last post, and I’ve been wondering what would make a good blog post. I decided to cameo a few R functions that I have found really helpful.

The structure and setup of any data analysis is much more important than the technology that is used to accomplish the individual steps involved. However, it helps to have handy functions readily available in your mental toolkit, ready to apply to new use cases that come up. You may only use a particular function sparingly, say, once every few months, but then discover that it’s indispensable and hard to find when you need it.

If you are like me, you may find yourself digging through your codebase, churning through all the R project folders that you have created in the past few months, racking your brain to figure out which script holds the best example of your use of this function or set of functions, so that you can “recycle” your code (or remember its syntax!). You may also find yourself returning to the same online article or blog post to grab the same code over and over again. (I have gone here repeatedly whenever I want to squash multiple rows per group into one. I have this tip below as well.)

I’m not sure if there’s a better solution to this issue of digging through code other than documenting these functions and packages as you use them in a personal central repository. That’s why I feel inspired to write this blog. It is an an attempt to gather some of the ones I use into a short tutorial. Hopefully by sharing them, other folks can learn too!

Finally, forgive me if the code looks a little washed out. I haven’t taken the time to customize this part of my website’s theme yet.

Reading one or multiple files into one dataframe

I run into a situation frequently where there are dozens of CSVs sitting somewhere that need to be read in and analyzed together. Assuming your data is local or on a shared drive (and not hadoop), combining fs::dir_ls, purrr::map_dfr and readr::read_csv is a nifty, quick way to get your data loaded in.

The fs 📦 has a ton of cool file and folder manipulation features. dir_ls lists the files in a directory, and dir_info gives a ton of extra information about all of the files in a tibble (including modified dates!) The glob argument below allows you to specify a ’wildcard aka globbing pattern (e.g. *.csv) passed on to grep() to filter paths.’ In this case, I know any file(s) sitting in the folder have VIN in the name, so I’m grabbing a list of file paths that match and passing each one to read_csv sequentially with map_dfr, which returns a dataframe. (The data are also in the “static/data/” folder.)

## Parsed with column specification:
## cols(
##   VIN = col_character(),
##   date = col_date(format = ""),
##   `Odometer Reading` = col_integer()
## )
## Parsed with column specification:
## cols(
##   VIN = col_character(),
##   date = col_date(format = ""),
##   `Odometer Reading` = col_integer()
## )

Cleaning up names

When data comes in with spaces in the name, it’s more difficult to work with. Specifying variable names with backticks is annoying. janitor::clean_names is your best friend! It cleans pesky names by adding underscores and making everything lowercase:

## [1] "vin"              "date"             "odometer_reading"

Squishing vins

This one is specific to working with vehicle data, since VIN-level data is proprietary. Vehicles have a 17 digit VIN (Vehicle Identification Number), and a subset of that is a VIN select pattern or a “Squished VIN” that only decodes to certain information about the VIN like Year Make and Model. To replace VINs with that squished VIN, this is a helpful function. (It also allows for joining to certain CARFAX tables for easy VIN decoding!)

## # A tibble: 6 x 3
##   vin       date       odometer_reading
##   <chr>     <date>                <int>
## 1 JM1FD331P 1994-01-27              144
## 2 JM1FD331P 1995-01-17             6326
## 3 JM1FD331P 1997-01-08            15839
## 4 JM1FD331P 1999-01-12            33707
## 5 JM1FD331P 2001-01-15            43117
## 6 JM1FD331P 2001-10-08            45358

Lag & lead

Sometimes it is helpful to be able to filter rows of data based on their comparison to other rows on some variable. For instance, I was recently working with some vehicle history data and needed to find the amount by which some vehicles’ odometers had been rolled back. The easiest way to do this is to group the data by VIN, arrange it by VIN & date, and use the lag function to calculate the difference of the odometer on the current record to the odometer on the row above:

## # A tibble: 6 x 4
##   vin       date       odometer_reading odometer_increase
##   <chr>     <date>                <int>             <int>
## 1 1C4RJFDJF 2014-11-12               40                NA
## 2 1C4RJFDJF 2015-05-21            10206             10166
## 3 1C4RJFDJF 2016-07-18            24250             14044
## 4 1C4RJFDJF 2016-07-25            24260                10
## 5 1C4RJFDJF 2017-02-01            22988             -1272
## 6 1C4RJFDJF 2017-03-15            22989                 1

Now, we could filter on the negative values and get the date, odometer reading, and (negative) increase of all records where an odometer rollback was identified. The lead function works the same way, but instead of selecting a value from the previous row, it selects the value from the next row.

Collapse and separate rows

Suppose that you need to group the odometer readings onto one line, in order to have one row per VIN, and one column to contain all the readings for a VIN. We can easily do this with paste and collapse.

## # A tibble: 6 x 2
##   vin       odometer_readings                                             
##   <chr>     <chr>                                                         
## 1 1C4RJFDJF 40|10206|24250|24260|22988|22989|22999|23906|24020|29340      
## 2 1FTYR10D4 10|3293|9649|13356|26248|27939|109802|110940|111167|89111|111~
## 3 2FTHF25GM 45000|69157|73871|87756|107898|157500|24272|24284|24308|44728 
## 4 4T3BA3BBC 10|18440|23487|30015|30288|37736|57734|59630|54729|57568|5778~
## 5 5FPYK1F4B 4|10|47|12550|23591|24416|32961|36415|36416|54689|58857|58861~
## 6 JA4MW31R1 1120|4110|8550|18320|72801|76390|80000|95762|96990|109780|378~

Now, suppose we needed to do the opposite. Someone has given us a column with a variable number of odometer readings in it, and we need them all to be separate records! A nifty tidyr function separate_rows does this with ease!

## # A tibble: 6 x 2
##   vin       odometer_readings
##   <chr>     <chr>            
## 1 1C4RJFDJF 40               
## 2 1C4RJFDJF 10206            
## 3 1C4RJFDJF 24250            
## 4 1C4RJFDJF 24260            
## 5 1C4RJFDJF 22988            
## 6 1C4RJFDJF 22989

Mutate and case_when

I’m always creating new variables to filter with. dplyr::case_when is a simple way to do this. For instance, if we wanted a variable to place odometer_readings into buckets, we can create a buckets variable. The general syntax is case_when(condition ~ value, condition ~ value, TRUE ~ value). TRUE is the catchall “everything else” condition. Each condition is evaluated one at a time, so only rows that don’t meet the first conditions are considered for the next conditions. mutate will create our column, and we will fill it with values from a case_when statement:

In our case, if the odometer_reading is under 25000, then we want case_when to assign 'Under 25K' to bucket, otherwise, continue checking down the list. Finally, assign "Above 100K' to everything else.

## # A tibble: 6 x 3
##   vin       odometer_readings bucket   
##   <chr>                 <int> <chr>    
## 1 1C4RJFDJF                40 Under 25K
## 2 1C4RJFDJF             10206 Under 25K
## 3 1C4RJFDJF             24250 Under 25K
## 4 1C4RJFDJF             24260 Under 25K
## 5 1C4RJFDJF             22988 Under 25K
## 6 1C4RJFDJF             22989 Under 25K

Min date

There are many cases where I need to filter a dataframe by the earliest date by variable. For instance, returning to our odometer_increases table:

## # A tibble: 6 x 4
##   vin       date       odometer_reading odometer_increase
##   <chr>     <date>                <int>             <int>
## 1 1C4RJFDJF 2014-11-12               40                NA
## 2 1C4RJFDJF 2015-05-21            10206             10166
## 3 1C4RJFDJF 2016-07-18            24250             14044
## 4 1C4RJFDJF 2016-07-25            24260                10
## 5 1C4RJFDJF 2017-02-01            22988             -1272
## 6 1C4RJFDJF 2017-03-15            22989                 1

If we want to select the earliest odometer reading by date, there are a few ways to do this. I am not convinced that one is better than the other for speed or efficiency, but I will include them here as they all work:

## # A tibble: 6 x 4
## # Groups:   vin [6]
##   vin       date       odometer_reading odometer_increase
##   <chr>     <date>                <int>             <int>
## 1 1C4RJFDJF 2014-11-12               40                NA
## 2 1FTYR10D4 2003-12-08               10                NA
## 3 2FTHF25GM 1994-04-06            45000                NA
## 4 4T3BA3BBC 2012-06-22               10                NA
## 5 5FPYK1F4B 2010-12-23                4                NA
## 6 JA4MW31R1 2000-10-27             1120                NA
## # A tibble: 6 x 4
## # Groups:   vin [6]
##   vin       date       odometer_reading odometer_increase
##   <chr>     <date>                <int>             <int>
## 1 1C4RJFDJF 2014-11-12               40                NA
## 2 1FTYR10D4 2003-12-08               10                NA
## 3 2FTHF25GM 1994-04-06            45000                NA
## 4 4T3BA3BBC 2012-06-22               10                NA
## 5 5FPYK1F4B 2010-12-23                4                NA
## 6 JA4MW31R1 2000-10-27             1120                NA
## # A tibble: 6 x 4
## # Groups:   vin [6]
##   vin       date       odometer_reading odometer_increase
##   <chr>     <date>                <int>             <int>
## 1 1C4RJFDJF 2014-11-12               40                NA
## 2 1FTYR10D4 2003-12-08               10                NA
## 3 2FTHF25GM 1994-04-06            45000                NA
## 4 4T3BA3BBC 2012-06-22               10                NA
## 5 5FPYK1F4B 2010-12-23                4                NA
## 6 JA4MW31R1 2000-10-27             1120                NA

Easily assemble dataframes

Sometimes, I create various statistics throughout my analysis and then I want to collect them into a dataframe so I can easily write it out to a csv or a table for an email. tribble is amazing for this. You preface column names with a tilde (~) and add a new line for each row of data:

## # A tibble: 3 x 2
##   Statistic                         Number
##   <chr>                              <dbl>
## 1 Average Minimum Odometer Reading  19416 
## 2 Average Odometer Increase          7283.
## 3 Average Odometer Rollback        -54395.

That’s all for now! I hope these tips were helpful!

I plan to return for future posts on other topics including:

  • How to work with the future and furrr packages to multithread graphql calls
  • How to connect to and join data between HDFS & Oracle using SparklyR and RStudio Connect
  • How to send monthly reports with graphs and data tables via mailR::send.mail