Day 12 - Regular Expressions (RegEx)

Spring 2023

Dr. Jared Joseph

Smith College

Overview

Timeline

RegEx Overview
RegEx Uses
RegEx Tools in R
Case Studies

Goal

To understand what RegEx is and some of its applications.

RegEx Overveiw

What are Regular Expressions (RegEx)?

Regular expressions are like the search function on steroids.

They allow you to request very specific patterns in text.

While extremely powerful, you should explore other options before turning to regex due to its complexity and fragility

Uses of RegEx

Regex is like the ultimate multi-tool for text

Give me all the sentences that start with “Hello”: ^Hello.+\.
Give me all the numbers in a paragraph: (\d)
Select all possible white space characters and multiple spaces: \s*
Give me only the numbers that appear before the letters “PM”: \d+(?=PM)

You can use regex in R, python, Word, terminal, and more

RegEx Uses

Detecting Words

The most basic function of RegEx work just like any other search.

Here I search in the text of our class survey to see who included the word “favorite” in their answer regarding their favorite work of art.

library(stringr)

str_detect(survey$fav_art, "favorite")

 [1]    NA FALSE  TRUE  TRUE FALSE  TRUE  TRUE    NA    NA  TRUE  TRUE FALSE
[13] FALSE  TRUE    NA  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE    NA

Survey Response #3:

One of my favorite works of art, is Claude Monet's Water Lilies paintings. There are so many variations of the subject, yet each one is distinct enough to stand on its own. I enjoy how simply beautiful each piece is, and its ability to evoke pleasant feelings from the viewer.

Single Characters

Often, your searches will need to be more complex.

The building blocks of RegEx searches are individual characters.

This lets you do handy things like search for all numbers in a blob of text and remove them.

Select every digit in the text blob.

regexplain::view_regex(sherlock, "\\d+\\. ")

1. To Sherlock Holmes she is always _the_ woman. I have seldom heard him 2. mention her under any other name. In his eyes she eclipses and 3. predominates the whole of her sex. It was not that he felt any emotion 4. akin to love for Irene Adler. All emotions, and that one particularly, 5. were abhorrent to his cold, precise but admirably balanced mind. He ...

Specify Parameters

You can specify specific parameters for the text you want to search for, like length.

Here I select all 4 digit numbers, which I assume are years.

You can perform a similar trick for phone numbers or other structured text.

Select all 4 digit numbers.

regexplain::view_regex(zelda, "\\d{4}")

Since the original Legend of Zelda was released in 1986, the series has expanded to include 19 entries on all of Nintendo's major game consoles, as well as a number of spin-offs. An American animated TV series based on the games aired in 1989 and individual manga adaptations commissioned by Nintendo have been produced in Japan since 1997. The Legend of Zelda ...

Wildcards

Sometimes you won’t know exactly what you want to find.

In this case, you can use “wildcards,” which let you say things like: “give me everything after this word until the end of a sentence.”

You always have to be careful though, otherwise it might just select everything!

Use some wildcards to find any email address in the text.

regexplain::view_regex(email,
  "([[:alnum:]_.-]+)@([[:alnum:].-]+)\\.([[:alpha:].]{2,6})")

Thank you for considering my recommendation. Please do not hesitate to contact me at jjoseph34@smith.edu if you would like to hear more.

Groups

Sometimes you will want to get data out of large blobs of unstructured text.

Using the building blocks we’ve covered so far, you can specify “groups” you want to capture.

Take this blob of text from the spring schedule at Smith.

[1] "Thursday, January 26\nClasses begin at 8 a.m.\nWednesday, February 1\nLast day to add a course online\nWednesday, February 8\nLast day to drop a course online; last day to add a Five College course\nWednesday, February 15\nLast day to add a Smith course\nThursday, February 23: Rally Day\nRally Day is highlighted by an all-college convocation at which distinguished alumnae are awarded the Smith College Medal. Afternoon classes are canceled.\nSaturday, March 11–Sunday, March 19\nSpring recess. Houses close at 10 a.m. on Saturday, March 11, and reopen at 1 p.m. on Sunday, March 19.\nMonday, April 3–Friday, April 14\nAdvising and course registration for the fall 2023 semester"

We can use RegEx to capture and return important dates.

regexplain::view_regex(spring,
  "([A-Z][a-z]+, [A-Z][a-z]+ \\d{1,2})")

Thursday, January 26 Classes begin at 8 a.m. Wednesday, February 1 Last day to add a course online Wednesday, February 8 Last day to drop a course online; last day to add a Five College course Wednesday, February 15 Last day to add a Smith course Thursday, February 23: Rally Day Rally Day is highlighted by an all-college convocation at which distinguished alumnae are awarded the Smith College Medal. Afternoon classes are canceled. Saturday, March 11–Sunday, March 19 Spring recess. Houses close at 10 a.m. on Saturday, March 11, and reopen at 1 p.m. on Sunday, March 19. Monday, April 3–Friday, April 14 Advising and course registration for the fall 2023 semester

RegEx Tools in R

Detect

stringr::str_detect() will give you a TRUE or FALSE if any part of a string matches the pattern you provide.

It will work on a vector, returning a single TRUE or FALSE per element.

Notice how it still said TRUE for “don’t like.”

Code is dumb!

test_vec = c("I like turtles",
             "I like cats",
             "I don't like raptors")

stringr::str_detect(test_vec, "like")

[1] TRUE TRUE TRUE

Case matters! (by default)

stringr::str_detect(test_vec, "Like")

[1] FALSE FALSE FALSE

Count

stringr::str_count() will return a number for each element in a vector, telling you how many times a match was found.

This could be useful for counting dates, key words, or anything else you can imagine.

For example, consider searching through a record of newspapers for how often a specific politician appears.

test_vec2 = c("I really like turtles",
             "I like cats",
             "I really, really, don't like raptors")

stringr::str_count(test_vec2, "really")

[1] 1 0 2

Extract

stringr::str_extract_all() will extract your matches from the vector elements.

The result will be a list of length X, matching how many vectors you have it.

The elements of that list will be vectors of length Y containing all the matches.

test_vec3 = "Thursday, January 26
Classes begin at 8 a.m.
Wednesday, February 1
Last day to add a course online
Wednesday, February 8
Last day to drop a course online;
last day to add a Five College course"

stringr::str_extract_all(test_vec3,
    "([A-Z][a-z]+, [A-Z][a-z]+ \\d{1,2})")

[[1]]
[1] "Thursday, January 26"  "Wednesday, February 1" "Wednesday, February 8"

Case Studies

Cleaning Messy Report Data

My research involves working with a lot of messy government reports.

Here is some code I used to standardize some common contractions for data consistency.

For example, the last line looks for the letters “bev” if they appear after a word boundary, optionally looks for a period, and then requires another word boundary after it. If all those things are true, it replaces “bev” or “bev.” with the word “beverage.”

gsub("\\bpd\\b", "police department", report_years$name)
gsub("\\bco so\\b", "sheriff department", report_years$name)
gsub("\\bco da\\b", "district attorney", report_years$name)
gsub("^chp-?", "california highway patrol ", report_years$name)
gsub("^bne-?", "bureau of narcotic enforcement ", report_years$name)
gsub("ca doj-bne ", " bureau of narcotic enforcement ", report_years$name)
gsub("\\buc\\b", "university of california", report_years$name)
gsub("\\bbev\\.?\\b", "beverage", x = report_years$name)

Remove Unwanted URLs

While doing a project that involved a large collection of online news media, I needed to remove all of the URLs in text so that did not muddy my text analysis.

The following RegEx looks for the pattern of a URL and removes them.

parsed$clean_text = str_remove_all(parsed$clean_text,
  "(http:\\/\\/www\\.|https:\\/\\/www\\.|http:\\/\\/|https:\\/\\/)?[a-z0-9]+([\\-\\.]{1}[a-z0-9]+)*\\.[a-z]{2,5}(:[0-9]{1,5})?(\\/.*)?")