Functions are the backbone of how R does things; we have refereed to
them previously as “verbs” in the language of R. We have used several
functions so far, either from base R, or from packages we have loaded in
using library()
. However, those are not all the functions available to
us. The true power in R comes from our ability to make our own
functions to do anything we want. That’s what we’ll be practicing
today.
We are going to be using class survey data for lab today. Please load it in using the following:
survey = readRDS(url("https://github.com/Adv-R-Programming/Adv-R-Reader/raw/main/class_survey.rds"))
We can peek behind the curtain on any of the functions we have used so
far an see how they tick. All you have to do is enter the function into
the console, and remove the ()
where the arguments go. Let’s look at
our friend table()
by typing table
into the console and hitting
enter. You should see the following (click the green “Code” fold button
to see):
function (..., exclude = if (useNA == "no") c(NA, NaN), useNA = c("no",
"ifany", "always"), dnn = list.names(...), deparse.level = 1)
{
list.names <- function(...) {
l <- as.list(substitute(list(...)))[-1L]
if (length(l) == 1L && is.list(..1) && !is.null(nm <- names(..1)))
return(nm)
nm <- names(l)
fixup <- if (is.null(nm))
seq_along(l)
else nm == ""
dep <- vapply(l[fixup], function(x) switch(deparse.level +
1, "", if (is.symbol(x)) as.character(x) else "",
deparse(x, nlines = 1)[1L]), "")
if (is.null(nm))
dep
else {
nm[fixup] <- dep
nm
}
}
miss.use <- missing(useNA)
miss.exc <- missing(exclude)
useNA <- if (miss.use && !miss.exc && !match(NA, exclude,
nomatch = 0L))
"ifany"
else match.arg(useNA)
doNA <- useNA != "no"
if (!miss.use && !miss.exc && doNA && match(NA, exclude,
nomatch = 0L))
warning("'exclude' containing NA and 'useNA' != \"no\"' are a bit contradicting")
args <- list(...)
if (length(args) == 1L && is.list(args[[1L]])) {
args <- args[[1L]]
if (length(dnn) != length(args))
dnn <- paste(dnn[1L], seq_along(args), sep = ".")
}
if (!length(args))
stop("nothing to tabulate")
bin <- 0L
lens <- NULL
dims <- integer()
pd <- 1L
dn <- NULL
for (a in args) {
if (is.null(lens))
lens <- length(a)
else if (length(a) != lens)
stop("all arguments must have the same length")
fact.a <- is.factor(a)
if (doNA)
aNA <- anyNA(a)
if (!fact.a) {
a0 <- a
op <- options(warn = 2)
a <- factor(a, exclude = exclude)
options(op)
}
add.na <- doNA
if (add.na) {
ifany <- (useNA == "ifany")
anNAc <- anyNA(a)
add.na <- if (!ifany || anNAc) {
ll <- levels(a)
if (add.ll <- !anyNA(ll)) {
ll <- c(ll, NA)
TRUE
}
else if (!ifany && !anNAc)
FALSE
else TRUE
}
else FALSE
}
if (add.na)
a <- factor(a, levels = ll, exclude = NULL)
else ll <- levels(a)
a <- as.integer(a)
if (fact.a && !miss.exc) {
ll <- ll[keep <- which(match(ll, exclude, nomatch = 0L) ==
0L)]
a <- match(a, keep)
}
else if (!fact.a && add.na) {
if (ifany && !aNA && add.ll) {
ll <- ll[!is.na(ll)]
is.na(a) <- match(a0, c(exclude, NA), nomatch = 0L) >
0L
}
else {
is.na(a) <- match(a0, exclude, nomatch = 0L) >
0L
}
}
nl <- length(ll)
dims <- c(dims, nl)
if (prod(dims) > .Machine$integer.max)
stop("attempt to make a table with >= 2^31 elements")
dn <- c(dn, list(ll))
bin <- bin + pd * (a - 1L)
pd <- pd * nl
}
names(dn) <- dnn
bin <- bin[!is.na(bin)]
if (length(bin))
bin <- bin + 1L
y <- array(tabulate(bin, pd), dims, dimnames = dn)
class(y) <- "table"
y
}
That’s pretty complex looking; and it is. However, the point is that
this is the same code that runs when we call table
. We could, it we
wanted, type this out ourselves and create our own table
function
using the function()
function.
Copy the code from the table()
function I showed above, and create a
new function called my_table()
using that code. Run both table()
and
my_table()
on the mint_choc
column from our survey data. Does the
output make sense?
my_table = table
table(survey$mint_choc) my_table(survey$mint_choc)
Not all functions are as transparent. Try looking at the code behind the
sum()
function and you may be disappointed. The most basic functions
in R are “primitives,” and are actually calling a lower level
programming language (C) than we can access through R. Looking at sum
we see:
function (..., na.rm = FALSE) .Primitive("sum")
You may also be surprised to learn that pretty much everything that
isn’t data in R is a function. For example, we can look at the +
sign
as a function by warping it in ticks like so:
`+`
function (e1, e2) .Primitive("+")
Thus, we could also call:
`+`(1, 5)
[1] 6
Neat.
It’s cool that we can make a copy of table, but why would we? We wouldn’t really. What we would do is create our own functions. While there a ton of packages out there to do all sorts of things, sometimes you will just need to make something yourself. This is often the case when working with your own unique data sets.
Let’s take a look at our survey data, specifically the pets
column. As
you can see below, we have some issues that violate tidy data
principles. Specifically, in our one column of pets, we have values on
multiple types of pets. Ideally, we would like a column for each type of
pet, and then a TRUE
or FALSE
if people had that kind of pet.
survey$pets
[1] "Cat" "None"
[3] "Dog" "None"
[5] "None" "Dog, Cat, Rock"
[7] "Bird" "Dog"
[9] "None" "Cat"
[11] "None" "Bird"
[13] "None" "Dog, Cat"
[15] "Dog, Bird" "Bird"
[17] "None" "None"
[19] "Reptile" "Rock, None"
[21] "None" "Cat"
[23] "None" "Robot Vacuum named Tobie"
Here’s where a custom function can help us. We know that each of the
entries in pets
is separated by a comma, and we can exploit that
structure. We’re going to write a function that will split those values
for us by commas, and create a new pets dataframe that we can then
attach to our current survey
dataframe.
The first step of creating a good function is clearly defining the key
components: the input, output, and arguments. In our case,
the input will be our pets column, the output will be a dataframe with
one column per pet type, and a value of TRUE
or FALSE
depending on
if the case had that pet. We won’t need any arguments aside from the
input for now.
When I am creating a function, I typically write the code to do what I want first, and then convert it into a function. Let’s write some code to create our output dataframe, and then we can work to fill it. I want to stress that this is one way to go about solving this problem. There are plenty of other valid (and easier, once we learn some more skills) options.
First, we’ll create a new dataframe with a column for each of our
possible pets, and a row for all 15 of our cases. I’ll fill the
dataframe with NA
s for now.
pet_output = data.frame(
"id" = 1:nrow(survey),
"dog" = NA,
"cat" = NA,
"fish" = NA,
"bird" = NA,
"reptile" = NA,
"rock" = NA,
"none" = NA,
"other" = NA)
Now, we need to figure out a way to test if the person listed one of our
options in their response. We’ll use a new function called grepl()
to
test this. The name “grepl” is a product of very old programmer speak,
but practically it will search through a character vector and give us a
TRUE
or FALSE
if it finds a match for the pattern we give it. Thus,
we can run the following on our pets
column for each of our possible
pets to test if they are included. We’re going to tell it to ignore
case, so that capital and lower case letters don’t matter.
Run the following code and explain it’s output.
grepl(pattern = "dog", x = survey$pets, ignore.case = TRUE)
It says TRUE
if the value included ‘dog’ and FALSE
if it did not.
We can use the same function to test for each of our possible pets, and assign the results to our new dataframe.
pet_output$dog = grepl(pattern = "dog", x = survey$pets, ignore.case = TRUE)
pet_output$cat = grepl(pattern = "cat", x = survey$pets, ignore.case = TRUE)
pet_output$fish = grepl(pattern = "fish", x = survey$pets, ignore.case = TRUE)
pet_output$bird = grepl(pattern = "bird", x = survey$pets, ignore.case = TRUE)
pet_output$reptile = grepl(pattern = "reptile", x = survey$pets, ignore.case = TRUE)
pet_output$rock = grepl(pattern = "rock", x = survey$pets, ignore.case = TRUE)
pet_output$none = grepl(pattern = "none", x = survey$pets, ignore.case = TRUE)
pet_output
id dog cat fish bird reptile rock none other
1 1 FALSE TRUE FALSE FALSE FALSE FALSE FALSE NA
2 2 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
3 3 TRUE FALSE FALSE FALSE FALSE FALSE FALSE NA
4 4 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
5 5 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
6 6 TRUE TRUE FALSE FALSE FALSE TRUE FALSE NA
7 7 FALSE FALSE FALSE TRUE FALSE FALSE FALSE NA
8 8 TRUE FALSE FALSE FALSE FALSE FALSE FALSE NA
9 9 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
10 10 FALSE TRUE FALSE FALSE FALSE FALSE FALSE NA
11 11 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
12 12 FALSE FALSE FALSE TRUE FALSE FALSE FALSE NA
13 13 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
14 14 TRUE TRUE FALSE FALSE FALSE FALSE FALSE NA
15 15 TRUE FALSE FALSE TRUE FALSE FALSE FALSE NA
16 16 FALSE FALSE FALSE TRUE FALSE FALSE FALSE NA
17 17 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
18 18 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
19 19 FALSE FALSE FALSE FALSE TRUE FALSE FALSE NA
20 20 FALSE FALSE FALSE FALSE FALSE TRUE TRUE NA
21 21 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
22 22 FALSE TRUE FALSE FALSE FALSE FALSE FALSE NA
23 23 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
24 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE NA
Now, that works for everything except “other.”
Why would we need a different strategy for the “other” cases?
We don’t know what will be in ‘other’ so we can’t test explicitly for it.
To resolve our “other” issue, we will need to take another approach.
First, we’ll make a copy of our pets
column we can work with, and make
sure we don’t alter our original data.
work_pets = survey$pets
Next, we will look through that copy, and remove everything that fits
into one of our other categories using gsub()
, or “general
substitution.” gsub()
is similar to grepl()
in that it asks for a
pattern to look for and where to look for it, but also asks what to
substitute for that pattern. Let’s look at the following example. Here I
ask gsub()
to look in our work_pets
vector, find all the times “dog”
appears, and to replace it with ““, or nothing.
gsub(pattern = "dog", work_pets, replacement = "", ignore.case = TRUE)
[1] "Cat" "None"
[3] "" "None"
[5] "None" ", Cat, Rock"
[7] "Bird" ""
[9] "None" "Cat"
[11] "None" "Bird"
[13] "None" ", Cat"
[15] ", Bird" "Bird"
[17] "None" "None"
[19] "Reptile" "Rock, None"
[21] "None" "Cat"
[23] "None" "Robot Vacuum named Tobie"
We can see in the output that now all the instances of “dog” have been
removed. We will repeat this process for each of our known categories,
each time saving our results back to work_pets
. We will also remove
commas, and use trimws()
, or “trim white space,” to delete extra
spaces from the start and end of our characters.
work_pets = gsub(pattern = "dog", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "cat", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "fish", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "bird", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "reptile", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "rock", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "none", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = ",", work_pets, replacement = "", ignore.case = TRUE)
work_pets = trimws(work_pets)
work_pets
[1] "" ""
[3] "" ""
[5] "" ""
[7] "" ""
[9] "" ""
[11] "" ""
[13] "" ""
[15] "" ""
[17] "" ""
[19] "" ""
[21] "" ""
[23] "" "Robot Vacuum named Tobie"
If we look at work_pets
now, all that is left are things not in our
pets categories. We can now either turn this into a logical, or assign
these values to our new other
column in pet_output
. I’ll do the
latter. I’ll also convert our blanks into proper NA
s.
pet_output$other = work_pets
pet_output[pet_output$other == "", "other"] = NA
pet_output
id dog cat fish bird reptile rock none other
1 1 FALSE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
2 2 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
3 3 TRUE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
4 4 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
5 5 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
6 6 TRUE TRUE FALSE FALSE FALSE TRUE FALSE <NA>
7 7 FALSE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
8 8 TRUE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
9 9 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
10 10 FALSE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
11 11 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
12 12 FALSE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
13 13 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
14 14 TRUE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
15 15 TRUE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
16 16 FALSE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
17 17 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
18 18 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
19 19 FALSE FALSE FALSE FALSE TRUE FALSE FALSE <NA>
20 20 FALSE FALSE FALSE FALSE FALSE TRUE TRUE <NA>
21 21 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
22 22 FALSE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
23 23 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
24 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE Robot Vacuum named Tobie
If we look at out pet_output
dataframe now, we can see we have a
column for each pet type, a TRUE
or FALSE
for each known pet, and
the text for other pets. All in all, the code to do this looks like:
# create an empty dataframe for our intended output
pet_output = data.frame(
"id" = 1:nrow(survey),
"dog" = NA,
"cat" = NA,
"fish" = NA,
"bird" = NA,
"reptile" = NA,
"rock" = NA,
"none" = NA,
"other" = NA)
# get a binary for each known pet type
pet_output$dog = grepl(pattern = "dog", x = survey$pets, ignore.case = TRUE)
pet_output$cat = grepl(pattern = "cat", x = survey$pets, ignore.case = TRUE)
pet_output$fish = grepl(pattern = "fish", x = survey$pets, ignore.case = TRUE)
pet_output$bird = grepl(pattern = "bird", x = survey$pets, ignore.case = TRUE)
pet_output$reptile = grepl(pattern = "reptile", x = survey$pets, ignore.case = TRUE)
pet_output$rock = grepl(pattern = "rock", x = survey$pets, ignore.case = TRUE)
pet_output$none = grepl(pattern = "none", x = survey$pets, ignore.case = TRUE)
# make a copy of the pets vector to work on
work_pets = survey$pets
# remove all known pets and clean remaining text
work_pets = gsub(pattern = "dog", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "cat", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "fish", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "bird", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "reptile", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "rock", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = "none", work_pets, replacement = "", ignore.case = TRUE)
work_pets = gsub(pattern = ",", work_pets, replacement = "", ignore.case = TRUE)
work_pets = trimws(work_pets)
# Fill in "other"
pet_output$other = work_pets
# Turn blanks into NAs
pet_output[pet_output$other == "", "other"] = NA
Now that we know what to do, let’s get to work converting this into a
function. The key advantage of doing so is that we can make out code
more generalizable. Rather than hard-coding each element, we can
instead use variables for stand-ins that can be swapped later on. For
example, instead of performing all of these checks on survey$pets
, we
can write the function to look at any pets
vector we pass as an
argument. This means we can re-use the function later!
To start this process, let’s write a skeleton for our function. Below
I’ve included a skeleton for a new function I am calling pet_split
. In
this function, I have an argument called pet_vector
. Now, whenever I
use pet_vector
inside the body of the function, I will be telling R to
use whatever is given to the function in the pet_vector
argument.
pet_vector
is just like an ordinary object in your R environment,
except it only exists within the function. You can think about it
like R opening a little mini-R, running all the code in the function top
to bottom, giving you the result, then closing it and deleting
everything else inside.
I’ll say it again because it is so important to understand.You can think about functions as if they were opening a little mini-R universe, running all the code in the function top to bottom with the arguments you provided as objects, giving you the result, then destroying that universe and deleting everything else inside.
pet_split = function(pet_vector) {
}
Now let’s add some substance to this function. I’ll start by adding in
the code to create a dataframe for our output, and a return()
at the
end. Whatever I put inside return()
will be the result of the function
when it is run, and everything else will be deleted when the mini-R
inside the function is closed. I also changed the code of our “id”
column a bit. Instead of hard-coding 15 IDs, I made it more
generalizable by asking R to make IDs 1 through the number of elements,
or the length, of our pet_vector
argument. That means the dataframe
will always have the same number of rows as the pet_vector
input, no
matter how many elements it has.
The only thing that will come out of your function
is whatever you put in the return()
function. Messages (like
print()
) are exceptions.
pet_split = function(pet_vector) {
# make new dataframe for output
pet_output = data.frame(
"id" = 1:length(pet_vector),
"dog" = NA,
"cat" = NA,
"fish" = NA,
"bird" = NA,
"reptile" = NA,
"rock" = NA,
"none" = NA,
"other" = NA)
# return
return(pet_output)
}
pet_split(pet_vector = survey$pets)
id dog cat fish bird reptile rock none other
1 1 NA NA NA NA NA NA NA NA
2 2 NA NA NA NA NA NA NA NA
3 3 NA NA NA NA NA NA NA NA
4 4 NA NA NA NA NA NA NA NA
5 5 NA NA NA NA NA NA NA NA
6 6 NA NA NA NA NA NA NA NA
7 7 NA NA NA NA NA NA NA NA
8 8 NA NA NA NA NA NA NA NA
9 9 NA NA NA NA NA NA NA NA
10 10 NA NA NA NA NA NA NA NA
11 11 NA NA NA NA NA NA NA NA
12 12 NA NA NA NA NA NA NA NA
13 13 NA NA NA NA NA NA NA NA
14 14 NA NA NA NA NA NA NA NA
15 15 NA NA NA NA NA NA NA NA
16 16 NA NA NA NA NA NA NA NA
17 17 NA NA NA NA NA NA NA NA
18 18 NA NA NA NA NA NA NA NA
19 19 NA NA NA NA NA NA NA NA
20 20 NA NA NA NA NA NA NA NA
21 21 NA NA NA NA NA NA NA NA
22 22 NA NA NA NA NA NA NA NA
23 23 NA NA NA NA NA NA NA NA
24 24 NA NA NA NA NA NA NA NA
Next, let’s get this function to output something we actually want. I’ll
copy more of our code from above into the function. You’ll notice that I
replace any calls for survey$pets
with our generic argument
pet_vector
. If we run this version, we start to see our desired
output!
pet_split = function(pet_vector) {
# make new dataframe for output
pet_output = data.frame(
"id" = 1:length(pet_vector),
"dog" = NA,
"cat" = NA,
"fish" = NA,
"bird" = NA,
"reptile" = NA,
"rock" = NA,
"none" = NA,
"other" = NA)
# get a binary for each known pet type
pet_output$dog = grepl(pattern = "dog", x = pet_vector, ignore.case = TRUE)
pet_output$cat = grepl(pattern = "cat", x = pet_vector, ignore.case = TRUE)
pet_output$fish = grepl(pattern = "fish", x = pet_vector, ignore.case = TRUE)
pet_output$bird = grepl(pattern = "bird", x = pet_vector, ignore.case = TRUE)
pet_output$reptile = grepl(pattern = "reptile", x = pet_vector, ignore.case = TRUE)
pet_output$rock = grepl(pattern = "rock", x = pet_vector, ignore.case = TRUE)
pet_output$none = grepl(pattern = "none", x = pet_vector, ignore.case = TRUE)
# return
return(pet_output)
}
pet_split(pet_vector = survey$pets)
id dog cat fish bird reptile rock none other
1 1 FALSE TRUE FALSE FALSE FALSE FALSE FALSE NA
2 2 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
3 3 TRUE FALSE FALSE FALSE FALSE FALSE FALSE NA
4 4 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
5 5 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
6 6 TRUE TRUE FALSE FALSE FALSE TRUE FALSE NA
7 7 FALSE FALSE FALSE TRUE FALSE FALSE FALSE NA
8 8 TRUE FALSE FALSE FALSE FALSE FALSE FALSE NA
9 9 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
10 10 FALSE TRUE FALSE FALSE FALSE FALSE FALSE NA
11 11 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
12 12 FALSE FALSE FALSE TRUE FALSE FALSE FALSE NA
13 13 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
14 14 TRUE TRUE FALSE FALSE FALSE FALSE FALSE NA
15 15 TRUE FALSE FALSE TRUE FALSE FALSE FALSE NA
16 16 FALSE FALSE FALSE TRUE FALSE FALSE FALSE NA
17 17 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
18 18 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
19 19 FALSE FALSE FALSE FALSE TRUE FALSE FALSE NA
20 20 FALSE FALSE FALSE FALSE FALSE TRUE TRUE NA
21 21 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
22 22 FALSE TRUE FALSE FALSE FALSE FALSE FALSE NA
23 23 FALSE FALSE FALSE FALSE FALSE FALSE TRUE NA
24 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE NA
To finish it up, let’s copy the rest of our code into this function.
You’ll notice that I don’t need to make a new vector work_pets
as
before. That’s because I’m working on the pet_vector
object directly.
pet_split = function(pet_vector) {
# make new dataframe for output
pet_output = data.frame(
"id" = 1:length(pet_vector),
"dog" = NA,
"cat" = NA,
"fish" = NA,
"bird" = NA,
"reptile" = NA,
"rock" = NA,
"none" = NA,
"other" = NA)
# get a binary for each known pet type
pet_output$dog = grepl(pattern = "dog", x = pet_vector, ignore.case = TRUE)
pet_output$cat = grepl(pattern = "cat", x = pet_vector, ignore.case = TRUE)
pet_output$fish = grepl(pattern = "fish", x = pet_vector, ignore.case = TRUE)
pet_output$bird = grepl(pattern = "bird", x = pet_vector, ignore.case = TRUE)
pet_output$reptile = grepl(pattern = "reptile", x = pet_vector, ignore.case = TRUE)
pet_output$rock = grepl(pattern = "rock", x = pet_vector, ignore.case = TRUE)
pet_output$none = grepl(pattern = "none", x = pet_vector, ignore.case = TRUE)
# remove all known pets and clean remaining text
pet_vector = gsub(pattern = "dog", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "cat", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "fish", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "bird", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "reptile", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "rock", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = "none", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = gsub(pattern = ",", pet_vector, replacement = "", ignore.case = TRUE)
pet_vector = trimws(pet_vector)
# Fill in "other"
pet_output$other = pet_vector
# Turn blanks into NAs
pet_output[pet_output$other == "", "other"] = NA
# return
return(pet_output)
}
pet_split(pet_vector = survey$pets)
id dog cat fish bird reptile rock none other
1 1 FALSE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
2 2 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
3 3 TRUE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
4 4 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
5 5 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
6 6 TRUE TRUE FALSE FALSE FALSE TRUE FALSE <NA>
7 7 FALSE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
8 8 TRUE FALSE FALSE FALSE FALSE FALSE FALSE <NA>
9 9 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
10 10 FALSE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
11 11 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
12 12 FALSE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
13 13 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
14 14 TRUE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
15 15 TRUE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
16 16 FALSE FALSE FALSE TRUE FALSE FALSE FALSE <NA>
17 17 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
18 18 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
19 19 FALSE FALSE FALSE FALSE TRUE FALSE FALSE <NA>
20 20 FALSE FALSE FALSE FALSE FALSE TRUE TRUE <NA>
21 21 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
22 22 FALSE TRUE FALSE FALSE FALSE FALSE FALSE <NA>
23 23 FALSE FALSE FALSE FALSE FALSE FALSE TRUE <NA>
24 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE Robot Vacuum named Tobie
Our new pet_split
function works! It will create a new dataframe we
can combine with our survey
data to get tidy data about pets. Now we
could save this function somewhere, and re-use it on every survey with a
question about pets without having to re-code all of those steps each
time.
Take some time to study the above function. Make sure you understand all of the steps it is taking, and how it produces the output that it does.
Now, we could make this even more general. In reality here we are solving the problem of splitting values by commas; several of our columns have that problem! We will go over how to make this function even better next week when we learn about iteration.
Let’s try creating a function on your own now.
Create a function that will intake a ROW from the survey
dataframe, and output a number showing how many total NA
s there were
in there row. The is.na()
function will come in handy here. An example
input and output should look like:
total_na(survey_row = survey[1,])
OUTPUT: 4
total_na = function(survey_row) {
# get the sum of NAs output = sum(is.na(survey_row))
# output return(output) }
total_na(survey_row = survey[1,])