Spring 2023
Smith College
To understand the pros and cons of parallelizing code.
I needed parallelization when doing network simulations. 10,000 simulations would have taken weeks otherwise.
Every hardware component in a computer has a maximum speed.
Works everywhere, slightly less efficient
Does not work on Windows
Several packages let you run code in parallel
PackageThe future
package lets you write parallelized code now, and decide how it will be parallelized later.
This also means we can all write the same parallelized code, regardless of operating system.
In the future, you can also take the same code, and say “go run this on some other machine”!
PlansName | OSes | Description |
synchronous: | non-parallel: | |
sequential | all | sequentially and in the current R process |
asynchronous: | parallel: | |
multisession | all | background R sessions (on current machine) |
multicore | not Windows/not RStudio | forked R processes (on current machine) |
cluster | all | external R sessions on current, local, and/or remote machines |
How long will each of these take?
When running multiple streams of code at the same time, all streams need all the data. This is where our first big slow-down appears.
How long will each of these take?
num_vec = c(1, 2, 3, 4, 5, 6, 7, 8)
large_data = c(1:999999999)
sapply(num_vec, FUN = function(num, dead_weight){
# Say number
print(paste0("The number is ", num,
"! I'll wait that many seconds."))
# wait that many seconds
# return number
}, dead_weight = large_data)
818.9 Seconds (13.65 Mins) and froze my desktop the entire time.
Parallelization can be really hand in some situations, but it is not a silver bullet.
It is also one of the few areas of coding where there is a real danger of causing harm.
If you parallelize your code poorly, you can soak up all the resources in shared environments.
At worst you can crash your own machine, or a shared machine.
PDF Data Extraction
SDS 270: Advanced Programming for Data Science