[1] NA FALSE TRUE TRUE FALSE TRUE TRUE NA NA TRUE TRUE FALSE
[13] FALSE TRUE NA TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE NA
Spring 2023
Smith College
To understand what RegEx is and some of its applications.
Regular expressions are like the search function on steroids.
They allow you to request very specific patterns in text.
While extremely powerful, you should explore other options before turning to regex due to its complexity and fragility
Regex is like the ultimate multi-tool for text
^Hello.+\.
(\d)
\s*
\d+(?=PM)
You can use regex in R, python, Word, terminal, and more
The most basic function of RegEx work just like any other search.
Here I search in the text of our class survey to see who included the word “favorite” in their answer regarding their favorite work of art.
[1] NA FALSE TRUE TRUE FALSE TRUE TRUE NA NA TRUE TRUE FALSE
[13] FALSE TRUE NA TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE NA
Survey Response #3:
One of my favorite works of art, is Claude Monet's Water Lilies paintings. There are so many variations of the subject, yet each one is distinct enough to stand on its own. I enjoy how simply beautiful each piece is, and its ability to evoke pleasant feelings from the viewer.
Often, your searches will need to be more complex.
The building blocks of RegEx searches are individual characters.
This lets you do handy things like search for all numbers in a blob of text and remove them.
Select every d
igit in the text blob.
1. To Sherlock Holmes she is always _the_ woman. I have seldom heard him 2. mention her under any other name. In his eyes she eclipses and 3. predominates the whole of her sex. It was not that he felt any emotion 4. akin to love for Irene Adler. All emotions, and that one particularly, 5. were abhorrent to his cold, precise but admirably balanced mind. He ...
You can specify specific parameters for the text you want to search for, like length.
Here I select all 4 digit numbers, which I assume are years.
You can perform a similar trick for phone numbers or other structured text.
Select all 4 digit numbers.
Since the original Legend of Zelda was released in 1986, the series has expanded to include 19 entries on all of Nintendo's major game consoles, as well as a number of spin-offs. An American animated TV series based on the games aired in 1989 and individual manga adaptations commissioned by Nintendo have been produced in Japan since 1997. The Legend of Zelda ...
Sometimes you won’t know exactly what you want to find.
In this case, you can use “wildcards,” which let you say things like: “give me everything after this word until the end of a sentence.”
You always have to be careful though, otherwise it might just select everything!
Sometimes you will want to get data out of large blobs of unstructured text.
Using the building blocks we’ve covered so far, you can specify “groups” you want to capture.
Take this blob of text from the spring schedule at Smith.
[1] "Thursday, January 26\nClasses begin at 8 a.m.\nWednesday, February 1\nLast day to add a course online\nWednesday, February 8\nLast day to drop a course online; last day to add a Five College course\nWednesday, February 15\nLast day to add a Smith course\nThursday, February 23: Rally Day\nRally Day is highlighted by an all-college convocation at which distinguished alumnae are awarded the Smith College Medal. Afternoon classes are canceled.\nSaturday, March 11–Sunday, March 19\nSpring recess. Houses close at 10 a.m. on Saturday, March 11, and reopen at 1 p.m. on Sunday, March 19.\nMonday, April 3–Friday, April 14\nAdvising and course registration for the fall 2023 semester"
We can use RegEx to capture and return important dates.
Thursday, January 26 Classes begin at 8 a.m. Wednesday, February 1 Last day to add a course online Wednesday, February 8 Last day to drop a course online; last day to add a Five College course Wednesday, February 15 Last day to add a Smith course Thursday, February 23: Rally Day Rally Day is highlighted by an all-college convocation at which distinguished alumnae are awarded the Smith College Medal. Afternoon classes are canceled. Saturday, March 11–Sunday, March 19 Spring recess. Houses close at 10 a.m. on Saturday, March 11, and reopen at 1 p.m. on Sunday, March 19. Monday, April 3–Friday, April 14 Advising and course registration for the fall 2023 semester
stringr::str_detect()
will give you a TRUE
or FALSE
if any part of a string matches the pattern you provide.
It will work on a vector, returning a single TRUE
or FALSE
per element.
Notice how it still said TRUE
for “don’t like.”
Code is dumb!
stringr::str_count()
will return a number for each element in a vector, telling you how many times a match was found.
This could be useful for counting dates, key words, or anything else you can imagine.
For example, consider searching through a record of newspapers for how often a specific politician appears.
stringr::str_extract_all()
will extract your matches from the vector elements.
The result will be a list of length X
, matching how many vectors you have it.
The elements of that list will be vectors of length Y
containing all the matches.
test_vec3 = "Thursday, January 26
Classes begin at 8 a.m.
Wednesday, February 1
Last day to add a course online
Wednesday, February 8
Last day to drop a course online;
last day to add a Five College course"
stringr::str_extract_all(test_vec3,
"([A-Z][a-z]+, [A-Z][a-z]+ \\d{1,2})")
[[1]]
[1] "Thursday, January 26" "Wednesday, February 1" "Wednesday, February 8"
My research involves working with a lot of messy government reports.
Here is some code I used to standardize some common contractions for data consistency.
For example, the last line looks for the letters “bev” if they appear after a word boundary, optionally looks for a period, and then requires another word boundary after it. If all those things are true, it replaces “bev” or “bev.” with the word “beverage.”
gsub("\\bpd\\b", "police department", report_years$name)
gsub("\\bco so\\b", "sheriff department", report_years$name)
gsub("\\bco da\\b", "district attorney", report_years$name)
gsub("^chp-?", "california highway patrol ", report_years$name)
gsub("^bne-?", "bureau of narcotic enforcement ", report_years$name)
gsub("ca doj-bne ", " bureau of narcotic enforcement ", report_years$name)
gsub("\\buc\\b", "university of california", report_years$name)
gsub("\\bbev\\.?\\b", "beverage", x = report_years$name)
While doing a project that involved a large collection of online news media, I needed to remove all of the URLs in text so that did not muddy my text analysis.
The following RegEx looks for the pattern of a URL and removes them.
Lab 4
SDS 270: Advanced Programming for Data Science