Regex in R with stringr

Background on stringr

stringr is a wrapper, and implementation that lies over top of the stringi R package. In addition to these packages you can also use the base regular expression functions (regexpr(), gsub() and others). It’s critical to point out that there is a difference in the implementation of some of these expression engines, in particular the native regexpr implementation and the stringi/stringr implementations. The stringi implementation is much closer to the native regular expression implementation as seen in other programming languages.

Some critical resources

Regular expressions

String matching is critical for making the internet work. Password validation, checking URLs, email & credit card checks. . . in fact, regular expressions are associated with the earliest development of modern computers, with a standardized language developing in the mid 1950s. The term grep comes from Global Regular Expression search and Print, which is exactly what R’s grep() function does, and what grep does on almost any Linux based system.

For this tutorial I also wrote a small Shiny app to test out some of the functions. You can access the app from my Shiny Apps repo, and, as we continue working, you can clone and make pull requests from the RegularExpressionR GitHub repository. In particular, the “short story” I wrote is intended to throw some curveballs, so if you feel that you can think of some tricks, please go ahead and add them in.

Important Matches

Along with matching specific strings (like Goring, the default in the Shiny app), it’s possible to match classes of characters, for example all upper case characters, all lower case characters, or all numbers.

Expression Explanation
. Matches anything (you can use \. to match a period).
\s Matches a space.
\d Matches a digit.
\w Match characters and numbers.
[:alpha:] Matches alphabetical characters.
[:upper:] Matches all capital letters.
[:lower:] Matches all lower case characters.

Matching Multiples

Expression Explanation
* Any number of repeats.
{n} n repeats where n is a number.
{n,} n or more repeats.
{n,m} n or more repeats, but m or less.

Matching Locations

Expression Explanation
^ At the start.
$ At the end.
? The lazy question mark! Optional. . .

Using stringr

So, given these options, lets see what we can do. stringr has a number of key functions that we’ll explore here. There is a much more extensive tutorial

  • str_detect()
  • str_extract()
  • str_locate()
  • str_match()
  • str_replace()

There are a few exercises that I think we should try out. One thing that will be helpful as you’re testing is using the htmltools::browsable() function at the end of your call. It will post your text output to the browser window. You can further fix things up by adding tags or <br> elements for carriage returns.

Detecting:

  1. I made a mistake adding in personal details. When I showed this to my mom she got really mad at me and asked me to remove all mention of her. Let’s find all the lines that mention mom and then cut them. You can read the text file using the readr package’s read_lines() function and pulling from the same data file as the Shiny app:

file <- 'https://raw.githubusercontent.com/SimonGoring/RegularExpressionR/master/data/raw_file.txt'

text_file <- readr::read_lines(file) %>% ?????

Extraction

  1. Extract all the dollar amounts quoted in the story and plot the values. Did you get everything?

file <- 'https://raw.githubusercontent.com/SimonGoring/RegularExpressionR/master/data/raw_file.txt'

text_file <- readr::read_lines(file) %>% ?????
  1. Extract all the years that the papers were published. Be careful to craft the statement correctly so you don’t get all four character values!


file <- 'https://raw.githubusercontent.com/SimonGoring/RegularExpressionR/master/data/raw_file.txt'

text_file <- readr::read_lines(file) %>% ?????

Replacing

  1. In the Shiny app I do string matching, and then replace the text with the matched string, but also add an HTML tag (<span>) that includes the highlighting tag (background). Along with my mom’s personal information, I accidentally added a bunch of phone numbers. Go in and replace all of them with XXX-XXX-XXXX so that they’re masked. Can you get them all?

  2. Bonus Question: Can you pull everything out of quotes?

Conclusion

This is just an introduction to some of the tools you can use with regular expressions. The book “Mastering Regular Expressions” is a 500+ page tome, and there are bigger volumes. I also want to point you to a really cool post called “The Greatest Regex Trick Ever”. It’s pretty great at walking you through something that is, on the face of it, pretty simple, but, ultimately is actually quite elegant and complicated.