Day 30 - Benchmarks

Spring 2023

Smith College

Overview

Timeline

  • Pull Request Review
  • What is Benchmarking
  • Benchmark Tools
  • Macro Profiling

Goal

Learn some tools to measure and compare code segments.

Pull Request Review

Common Pull Request Flow



  1. Work on your branch
  2. Push your changes on that branch
  3. Go to GitHub and open a pull request to merge that branch into main
  4. Describe the work you have done
  5. Request review
  6. Act on feedback
  7. Confirm pull request

What is Benchmarking

Benchmarks in a Nutshell

A benchmark is a controlled test to see how well your code runs.


You are trying to quantify the efficiency of your code, so you can compare it against other methods.


You are also often looking for bottlenecks, or parts of your code that slow down the whole process.

Previous Uses

You may recall we’ve actually done some simple benchmarks before!


When we were talking about parallelizing code, we used tictoc to see how long it took code to run.


That’s pretty helpful on it’s own!

library(tictoc)

num_vec = c(1, 2, 3, 4, 5, 6, 7, 8)

tic()

sapply(num_vec, FUN = function(num){
  
  # Say number
  print(paste0("The number is ", num,
          "! I'll wait that many seconds."))
  
  # wait that many seconds
  Sys.sleep(num)
  
  # return number
  return(num)
  
})

toc()

Running sequentially, takes about 37 seconds on my desktop.

Not Just Time

We want efficient code, not just fast.

CPU

RAM

Harddrive

Benchmark Tools

The bench Package

The bench package provides a simple function, mark() to track how long code takes to run.


It will run the contained code several times to get an average.


It will also tell you how much memory the code uses.

num_vec = c(1, 2, 3, 4, 5, 6, 7, 8)

bench::mark({
  
  sapply(num_vec, FUN = function(num){
  
  # Say number
  print(paste0("The number is ", num,
          "! I'll wait that many seconds."))
  
  # wait that many seconds
  Sys.sleep(num)
  
  # return number
  return(num)
  
})
  
})

Still takes about 37 seconds on my desktop, but will run multiple times to average.

Who will win?

Apply

apply_bench = bench::mark({
  apply_list = lapply(survey[, grepl("_days", colnames(survey))], comma_split,
                      possible_columns = c("monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"))
}, iterations = 200)

For Loop

loop_bench = bench::mark({
  loop_list = list()
  days_df = survey[, grepl("_days", colnames(survey))]
  
  for(i in 1:ncol(days_df)){
    loop_list[[i]] = comma_split(days_df[, i],
                                 possible_columns = c("monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"))
  }
}, iterations = 200)

Tidy

tidy_bench = bench::mark({
  purrr_list = purrr::map(survey[, grepl("_days", colnames(survey))], comma_split,
           possible_columns = c("monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"))
}, iterations = 200)

Results




Type min median itr/sec mem_alloc
Apply 4.84ms 4.98ms 196.80240 121.9KB
For Loop 14.43ms 14.85ms 65.95779 96.1KB
purrr 4.92ms 5.06ms 193.52338 393.6KB

Macro Profiling

Benchmarks vs Profile

The difference is a semantic one.


Broadly, benchmarks look at individual functions, profiling looks at whole code flows.


Think of it as the difference between tracking a race and a marathon.

Benchmark == Race

Profile == Marathon

Profile Example

Here is the profile of a section of my research code on asset forfeiture.


The top part of the diagram shows how long each line of code takes, as well as memory usage.


The bottom plots how many functions are being run nested inside each other.

Profiling in R

There are several tools to profile with in R. I will reccomend profvis.


It creates interactive profile plots that give you an at-a-glance view of code sequences.


I’ve included a small example here:

profvis::profvis({
  # read in class survey
  survey = readRDS(url("https://github.com/Adv-R-Programming/Adv-R-Reader/raw/main/class_survey.rds"))
  # wait a moment
  Sys.sleep(1)
  # use comma_split
  comma_split(survey$pets,
              possible_columns = c("dog", "cat", "fish", "bird", "reptile", "rock", "none"))
})

Hints for Efficient Code

A few rules of thumb for more efficient code:

  • Don’t repeat yourself!
  • Subset data to only what you will use before running analyses (no need to analyse data that will be tossed out)
  • Pre-specify anything you can (e.g. manually set levels in factors, column names when reading in large data, what method you want for generic functions)
  • Pre-allocate for outputs (never amend to lists/dataframes)
  • Always work in a vector if possible

Code-Along

For Next Time

Topic

Lab 8

To-Do

  • Finish Worksheet