Iteration allows us to tell R to work on a whole sets of things at once: multiple files, multiple columns, multiple whatever. This can save quite a bit of time. It also lets us to work on problems with dependence, where the decisions of each step depends on the result of the previous step.
For our worksheet today, we are going to be solving some annoyances of
the past. I am going to walk you though modifying our pet_split()
function from last week’s functions worksheet to make it even more
generalizable.
We are going to be using class survey data for lab today. Please load it in using the following:
survey = readRDS(url("https://github.com/Adv-R-Programming/Adv-R-Reader/raw/main/class_survey.rds"))
pet_split()
FunctionRecall from last week that our pet_split()
function looked at the
pets
column in our survey
dataframe, and tidy-ed up the column so
that instead of having a single character with multiple pets in it, we
had a dataframe with TRUE
and FALSE
for each pet type, along with
“other.” You can see the finished function below:
pet_split = function(pet_vector) {
# make new dataframe for output
pet_output = data.frame(
"id" = 1:length(pet_vector),
"dog" = NA,
"cat" = NA,
"fish" = NA,
"bird" = NA,
"reptile" = NA,
"rock" = NA,
"none" = NA,
"other" = NA)
# get a binary for each known pet type
pet_output$dog = grepl(pattern = "dog", x = pet_vector, ignore.case = TRUE)
pet_output$cat = grepl(pattern = "cat", x = pet_vector, ignore.case = TRUE)
pet_output$fish = grepl(pattern = "fish", x = pet_vector, ignore.case = TRUE)
pet_output$bird = grepl(pattern = "bird", x = pet_vector, ignore.case = TRUE)
pet_output$reptile = grepl(pattern = "reptile", x = pet_vector, ignore.case = TRUE)
pet_output$rock = grepl(pattern = "rock", x = pet_vector, ignore.case = TRUE)
pet_output$none = grepl(pattern = "none", x = pet_vector, ignore.case = TRUE)
# remove all known pets and clean remaining text
pet_vector = gsub(pattern = "dog", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "cat", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "fish", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "bird", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "reptile", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "rock", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "none", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = ",", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = trimws(pet_vector)
# Fill in "other"
pet_output$other = pet_vector
# Turn blanks into NAs
pet_output[pet_output$other == "", "other"] = NA
# return
return(pet_output)
}
pet_split(pet_vector = survey$pets)
id dog cat fish bird reptile rock none other
1 1 FALSE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
2 2 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
3 3 TRUE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
4 4 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
5 5 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
6 6 TRUE TRUE FALSE FALSE FALSE TRUE FALSE <NA>
7 7 FALSE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
8 8 TRUE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
9 9 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
10 10 FALSE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
11 11 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
12 12 FALSE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
13 13 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
14 14 TRUE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
15 15 TRUE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
16 16 FALSE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
17 17 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
18 18 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
19 19 FALSE FALSE FALSE FALSE TRUE FALSE FALSE <NA>
20 20 FALSE FALSE FALSE FALSE FALSE TRUE TRUE <NA>
21 21 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
22 22 FALSE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
23 23 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
24 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE Robot Vacuum named Tobie
Now, that is cool, but we can make it better. Specifically, if we look
at our survey dataframe, we have the exact same problem in our
<DRINK>_days
columns and the recreation
column. By the end of this
worksheet, our pet_split()
function will work on any column with
comma separated values.
Our first step is going to be writing the code to accomplish what we
want, then we can package it as a function. I’ve gutted our
pet_split()
function below. We will be starting from that base and
working to make it so that we never call for anything specific to pets
in our code. For example, instead of coding all of the possibilities of
pet (dog, cat, fish, bird, reptile, rock, none) inside the function
itself, we want to write our code such that it can accept any list of
possibilities as an argument and work from that.
# set up a psudo argument
pet_vector = survey$pets
# make new dataframe for output
pet_output = data.frame(
"id" = 1:length(pet_vector),
"dog" = NA,
"cat" = NA,
"fish" = NA,
"bird" = NA,
"reptile" = NA,
"rock" = NA,
"none" = NA,
"other" = NA)
# get a binary for each known pet type
pet_output$dog = grepl(pattern = "dog", x = pet_vector, ignore.case = TRUE)
pet_output$cat = grepl(pattern = "cat", x = pet_vector, ignore.case = TRUE)
pet_output$fish = grepl(pattern = "fish", x = pet_vector, ignore.case = TRUE)
pet_output$bird = grepl(pattern = "bird", x = pet_vector, ignore.case = TRUE)
pet_output$reptile = grepl(pattern = "reptile", x = pet_vector, ignore.case = TRUE)
pet_output$rock = grepl(pattern = "rock", x = pet_vector, ignore.case = TRUE)
pet_output$none = grepl(pattern = "none", x = pet_vector, ignore.case = TRUE)
# remove all known pets and clean remaining text
pet_vector = gsub(pattern = "dog", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "cat", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "fish", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "bird", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "reptile", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "rock", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "none", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = ",", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = trimws(pet_vector)
# Fill in "other"
pet_output$other = pet_vector
# Turn blanks into NAs
pet_output[pet_output$other == "", "other"] = NA
The first step of our current code is to create a dataframe for our
outputs. We still want to do that, but without us defining each
possibility ourselves inside the function. Instead, we will provide a
vector of possibilities, and have R iterate through those to make our
columns. We can use a for()
loop for that.
First, we’ll create a vector of our possibilities, in this case our pets:
possible_columns = c("dog", "cat", "fish", "bird", "reptile", "rock", "none")
Next, we need code to iterate through those possibilities, and create a
dataframe from them. We’ll start with making what we know, a column for
IDs which has as many rows as our intended input, pet_vector
from
above. Next, we will iterate through all possible options, and make a
new column for each. Here I iterate through our possible_columns
vector, and for each element (option
in the loop) I create a column of
NA
s.
# make a base dataframe with rows for each of our cases.
pet_output = data.frame(
"id" = 1:length(pet_vector)
)
# iterate through all options and create a column with NAs for it
for(option in possible_columns){
# make a new column with a character version of each possible option.
pet_output[, as.character(option)] = NA
}
If we look at out output now, it is exactly the same as if we made each column ourselves, but now it is done by providing a vector of options. And we can change those options to whatever we want. This will come in handy later.
pet_output
id dog cat fish bird reptile rock none
1 1 NA NA NA NA NA NA NA
2 2 NA NA NA NA NA NA NA
3 3 NA NA NA NA NA NA NA
4 4 NA NA NA NA NA NA NA
5 5 NA NA NA NA NA NA NA
6 6 NA NA NA NA NA NA NA
7 7 NA NA NA NA NA NA NA
8 8 NA NA NA NA NA NA NA
9 9 NA NA NA NA NA NA NA
10 10 NA NA NA NA NA NA NA
11 11 NA NA NA NA NA NA NA
12 12 NA NA NA NA NA NA NA
13 13 NA NA NA NA NA NA NA
14 14 NA NA NA NA NA NA NA
15 15 NA NA NA NA NA NA NA
16 16 NA NA NA NA NA NA NA
17 17 NA NA NA NA NA NA NA
18 18 NA NA NA NA NA NA NA
19 19 NA NA NA NA NA NA NA
20 20 NA NA NA NA NA NA NA
21 21 NA NA NA NA NA NA NA
22 22 NA NA NA NA NA NA NA
23 23 NA NA NA NA NA NA NA
24 24 NA NA NA NA NA NA NA
Our next step is to test for each possible option (in this case types of pets) and fill in the respective columns. We will use iteration here as well.
Using the same principle as above, iterate over each option in
possible_columns
and use grepl()
to test if the pet appeared in that
case. Fill the respective columns.
The following code will iterate through possible_columns
and replace
the pattern grepl()
is looking for with each option. It will test for
that option, and save the results in the corresponding column.
for(option in possible_columns){
# fill dataframe iterativly.
pet_output[ , option] = grepl(option, pet_vector, ignore.case = TRUE)
}
Once we have our “knowns” taken care of, we can work on the others. The
process is nearly identical, just swap grepl()
with gsub()
and apply
it to pet_vector
like before.
Iterate over each option in possible_columns
and use gsub()
to
remove all of our known possibilities (and commas) from pet_vector
.
You can then use trimws()
to remove the extra spaces. Assign the
remaining values to the “other” column of pet_output
.
The following will remove all known possibilities, clean the remainder, and assign it to the ‘other’ column.
for(option in possible_columns){
# remove all known options
pet_vector = gsub(pattern = option, pet_vector, replacement = '', ignore.case = TRUE)
}
# clear commas and whitespace
pet_vector = gsub(pattern = ',', pet_vector, replacement = '', ignore.case = TRUE)
pet_vector = trimws(pet_vector)
# Fill in 'other'
pet_output$other = pet_vector
# Turn blanks into NAs
pet_output[pet_output$other == '', 'other'] = NA
If we look at our code all together now, it looks like the following. If
we run it, it will return the exact same thing as our old pet_split()
function, but instead of each option being hand-coded by us, it knows
how to work with any given vector of options and create our desired
output.
# make dummy argument
pet_vector = survey$pets
# set all known options
possible_columns = c("dog", "cat", "fish", "bird", "reptile", "rock", "none")
# make a base dataframe with rows for each of our cases.
pet_output = data.frame(
"id" = 1:length(pet_vector)
)
# iterate through all options and create a column with NAs for it
for(option in possible_columns){
# make a new column with a character version of each possible option.
pet_output[, as.character(option)] = NA
}
# fill output df
for(option in possible_columns){
# fill dataframe iterativly.
pet_output[ , option] = grepl(option, pet_vector, ignore.case = TRUE)
}
# clear all know options
for(option in possible_columns){
# remove all known options
pet_vector = gsub(pattern = option, pet_vector, replacement = '', ignore.case = TRUE)
}
# clear commas and whitespace
pet_vector = gsub(pattern = ',', pet_vector, replacement = '', ignore.case = TRUE)
pet_vector = trimws(pet_vector)
# Fill in 'other'
pet_output$other = pet_vector
# Turn blanks into NAs
pet_output[pet_output$other == '' & !is.na(pet_output$other), 'other'] = NA
Convert our code back into a function, call the function
comma_split()
.
comma_split = function(vector_to_split, possible_columns){
# make a base dataframe with rows for each of our cases.
output = data.frame(
"id" = 1:length(vector_to_split)
)
# iterate through all options and create a column with NAs for it
for(option in possible_columns){
# make a new column with a character version of each possible option.
output[, as.character(option)] = NA
}
# fill output df
for(option in possible_columns){
# fill dataframe iterativly.
output[ , option] = grepl(option, vector_to_split, ignore.case = TRUE)
}
# clear all know options
for(option in possible_columns){
# remove all known options
vector_to_split = gsub(pattern = option, vector_to_split, replacement = "", ignore.case = TRUE)
}
# clear commas and whitespace
vector_to_split = gsub(pattern = ",", vector_to_split, replacement = "", ignore.case = TRUE)
vector_to_split = trimws(vector_to_split)
# Fill in "other"
output$other = vector_to_split
# Turn blanks into NAs
output[output$other == "" & !is.na(output$other), "other"] = NA
# return output
return(output)
}
comma_split(vector_to_split = survey$pets,
possible_columns = c("dog", "cat", "fish", "bird", "reptile", "rock", "none"))
Once you have the function created, try it on another column! Your output should match mine.
comma_split(vector_to_split = survey$tea_days,
possible_columns = c("monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"))
id monday tuesday wednesday thursday friday saturday sunday other
1 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
2 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
3 3 TRUE TRUE TRUE FALSE FALSE TRUE TRUE <NA>
4 4 TRUE TRUE FALSE FALSE FALSE FALSE TRUE <NA>
5 5 FALSE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
6 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
7 7 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
8 8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
9 9 TRUE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
10 10 FALSE FALSE FALSE TRUE TRUE FALSE FALSE <NA>
11 11 FALSE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
12 12 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
13 13 TRUE TRUE TRUE TRUE TRUE FALSE TRUE <NA>
14 14 TRUE FALSE FALSE TRUE FALSE FALSE TRUE <NA>
15 15 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
16 16 FALSE FALSE FALSE FALSE TRUE TRUE FALSE <NA>
17 17 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
18 18 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
19 19 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
20 20 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
21 21 TRUE TRUE TRUE FALSE TRUE TRUE TRUE <NA>
22 22 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
23 23 FALSE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
24 24 TRUE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
While it now has a bit of an odd name, our function can now work on any column! It is hard to express how big of a deal that is. We now have a single general tool that can adapt itself to several situations. The input is arbitrary, as long as it is formatted the same way (values separated by commas), we can put anything through this function and get a nice tidy dataframe back. A whole new universe of possibilities just opened.
Try our function on some other columns in the survey
dataframe!