Day 14 - Parallel

Spring 2023

Smith College

Overview

Timeline

  • Parallelization Overview
  • Parallelization in R
  • No Free Lunch

Goal

To understand the pros and cons of parallelizing code.

Parallelization Overview

What is Parallelization

Why Parallelize?

Credit: Matt Jones

Why Parallelize?

Credit: Matt Jones

Why Parallelize?

Credit: Matt Jones

The Costs of Parallelization

Credit: Daniels220 @ WikiCommons

Use Cases of Parallel

I needed parallelization when doing network simulations. 10,000 simulations would have taken weeks otherwise.

Parallelization in R

A Primer on Computers

Every hardware component in a computer has a maximum speed.

CPU

RAM

Harddrive

Approaches to Parallelization

Processes

Works everywhere, slightly less efficient

Cores

Does not work on Windows

Parallel Packages in R

Several packages let you run code in parallel


parallel
Built-in with R, but slightly clunky. Everything else builds off this.


foreach
Slightly easier, lets you parallelize for loops (kinda)


future
Easier to use with hot-swap parallel methods

future Package

The future package lets you write parallelized code now, and decide how it will be parallelized later.


This also means we can all write the same parallelized code, regardless of operating system.


In the future, you can also take the same code, and say “go run this on some other machine”!

future Plans

Name OSes Description
synchronous: non-parallel:
sequential all sequentially and in the current R process
asynchronous: parallel:
multisession all background R sessions (on current machine)
multicore not Windows/not RStudio forked R processes (on current machine)
cluster all external R sessions on current, local, and/or remote machines

No Free Lunch

Toy Example

How long will each of these take?

Regular R Code

library(tictoc)

num_vec = c(1, 2, 3, 4, 5, 6, 7, 8)

tic()

sapply(num_vec, FUN = function(num){
  
  # Say number
  print(paste0("The number is ", num,
          "! I'll wait that many seconds."))
  
  # wait that many seconds
  Sys.sleep(num)
  
  # return number
  return(num)
  
})

toc()

Parallelized R Code

library(future.apply)
plan(multisession)

num_vec = c(1, 2, 3, 4, 5, 6, 7, 8)

tic()

future_sapply(num_vec, FUN = function(num){
  
  # Say number
  print(paste0("The number is ", num,
          "! I'll wait that many seconds."))
  
  # wait that many seconds
  Sys.sleep(num)
  
  # return number
  return(num)
  
})

toc()

Where it Falls Short

When running multiple streams of code at the same time, all streams need all the data. This is where our first big slow-down appears.

Meaty Example

How long will each of these take?

Regular R Code

num_vec = c(1, 2, 3, 4, 5, 6, 7, 8)
large_data = c(1:999999999)

tic()

sapply(num_vec, FUN = function(num, dead_weight){
  
  # Say number
  print(paste0("The number is ", num,
          "! I'll wait that many seconds."))
  
  # wait that many seconds
  Sys.sleep(num)
  
  # return number
  return(num)
  
}, dead_weight = large_data)

toc()

Parallelized R Code

tic()

future_sapply(num_vec, FUN = function(num, dead_weight){
  
  # Say number
  print(paste0("The number is ", num,
          "! I'll wait that many seconds."))
  
  # wait that many seconds
  Sys.sleep(num)
  
  # return number
  return(num)
  
}, dead_weight = large_data)

toc()

818.9 Seconds (13.65 Mins) and froze my desktop the entire time.

Actual Warning

Parallelization can be really hand in some situations, but it is not a silver bullet.


It is also one of the few areas of coding where there is a real danger of causing harm.


If you parallelize your code poorly, you can soak up all the resources in shared environments.


At worst you can crash your own machine, or a shared machine.

Code-Along

For Next Time