Day 9 - Lists and Apply

Spring 2023

Smith College

Overview

Timeline

  • Iteration Review
  • Loops vs Apply
  • The Apply Family
  • Use Case

Goal

To learn the differences and use cases for lists and the apply family of functions.

Iteration Review

In R, iterating on something is working through a vector one element at a time.

Vector = c(2, 4, 6, 8, 10)

  • Iteration 1: c(2, 4, 6, 8, 10)
  • Iteration 2: c(2, 4, 6, 8, 10)
  • Iteration 3: c(2, 4, 6, 8, 10)
  • Iteration 4: c(2, 4, 6, 8, 10)
  • Iteration 5: c(2, 4, 6, 8, 10)

for(X in Y) { Do Z }


Useful when:

  • We want to repeat the same operation several times
  • There is dependence on the outcome of previous operations

Loops vs Apply

Logic of Apply

The apply family of functions take every element of a sequence, and does the same thing to all parts.

Anatomy of an Apply Function



apply(X, FUN = function)

Apply does the same thing to each element (roughly) all at once.

Apply FUN to element 1 in X.

Apply FUN to element 2 in X.

Apply FUN to element 3 in X.

Apply FUN to element 4 in X.

Apply FUN to element 5 in X.

Apply FUN to element 6 in X.

Apply FUN to element 7 in X.

Loops vs Apply

Loops

Loops iterate through every element of a sequence one element at a time.

This allows dependence.

  • Iteration 1: c(2, 4, 6, 8, 10)
  • Iteration 2: c(2, 4, 6, 8, 10)
  • Iteration 3: c(2, 4, 6, 8, 10)
  • Iteration 4: c(2, 4, 6, 8, 10)
  • Iteration 5: c(2, 4, 6, 8, 10)

Apply

Apply functions apply the given functions to every element (roughly) at the same time.

This does not allow dependence.

c( 2, 4, 6, 8, 10 )

Apply Environments

for Loops

Apply

Apply only works on the content of the iterated object. No attributes!

Apply Family

lapply

lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

For every column in mtcars, apply the mean() function.

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```{r}
lapply(X = mtcars, FUN = mean)
```
$mpg
[1] 20.09062

$cyl
[1] 6.1875

$disp
[1] 230.7219

$hp
[1] 146.6875

$drat
[1] 3.596563

$wt
[1] 3.21725

$qsec
[1] 17.84875

$vs
[1] 0.4375

$am
[1] 0.40625

$gear
[1] 3.6875

$carb
[1] 2.8125

sapply

sapply is similar to lapply, but it returns a vector if it can. Be careful as it’s results can surprise you!

For every column in mtcars, apply the mean() function.

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```{r}
sapply(X = mtcars, FUN = mean, simplify = TRUE)
```
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

apply

apply is used for matrices or dataframes. You can supply the MARGIN argument to make it work over rows or columns.

For every column and then every row in mtcars, apply the mean() function.

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Columns

```{r}
apply(X = mtcars, MARGIN = 2, FUN = mean)
```
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

Rows

```{r}
apply(X = head(mtcars), MARGIN = 1, FUN = mean)
```
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
         29.90727          29.98136          23.59818          38.73955 
Hornet Sportabout           Valiant 
         53.66455          35.04909 

You can write FUN!

You can pass any function to FUN, including one you write!


This means you can do anything over a large collection of data.

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```{r}
lapply(X = mtcars, FUN = function(car){
  
  # get the largest value
  largest = max(car)
  
  # get the smallest value
  smallest = min(car)
  
  # get the difference
  result = largest - smallest
  
  # return the difference
  return(result)
})
```
$mpg
[1] 23.5

$cyl
[1] 4

$disp
[1] 400.9

$hp
[1] 283

$drat
[1] 2.17

$wt
[1] 3.911

$qsec
[1] 8.4

$vs
[1] 1

$am
[1] 1

$gear
[1] 2

$carb
[1] 7

Use Case

Benchmarks

penguins = palmerpenguins::penguins

rbenchmark::benchmark(
"dplyr" = {
  penguins |>
    dplyr::group_by(species) |>
    dplyr::group_map(~lm(bill_length_mm ~ bill_depth_mm, data = .x))
},
"purrr" = {
  penguins |>
    dplyr::group_split(species) |>
    purrr::map(lm, formula = bill_length_mm ~ bill_depth_mm)
},
"apply" = {
  by(penguins, INDICES = factor(penguins$species), FUN = lm,
     formula = bill_length_mm ~ bill_depth_mm)
},
"for" = {
  for(species in unique(penguins$species)){
    lm(bill_length_mm ~ bill_depth_mm, data = penguins[penguins$species == species, ])
  }
}, replications = 100)
   test replications elapsed relative user.self sys.self user.child sys.child
3 apply          100   0.261    1.000     0.261    0.000          0         0
1 dplyr          100   1.134    4.345     1.115    0.020          0         0
4   for          100   0.519    1.989     0.519    0.000          0         0
2 purrr          100   0.777    2.977     0.773    0.004          0         0

Parallel

Flexibility

Because lapply() works via lists, it can be used on anything.

Vectors
One element at a time


Dataframes
Dataframes are lists, works one column at a time


Lists
One element at a time. Can be a list of dataframes!


Output of lapply() is always a list.

Code-Along

For Next Time

Topic

Lab 3 & Quiz 1

To-Do

  • Finish Worksheet