Regexp in Ruby (Advent of Code, Day 5)

December 28, 2015

Completing the New York Times crossword puzzle is a morning ritual for thousands of Americans. While I personally never partook in this ritual (the less time-intensive Jumble fit into my morning better), it was something I could definitely see myself doing if I could justify the time required. So, when I found about Eric Wastl’s Advent of Code, which offered a new programming challenge daily for 25 days in December, it quickly became part of my morning (or sometimes late night) routine.

There was a lot to like about Advent of Code. First, most of the problems were of just the right complexity: 20-40 minutes on average to come up with a working solution, but engaging enough to really require one to think. Second, there was a very nice variety of problem types, some math-based, some that were best solved using advanced data structures, some that allowed a Rubyist to delve deeply into infrequently used features of the language, and some that lent themselves to object-oriented programming. Third, the Advent of Code reddit community was a great resource for seeing how other programmers approached the problem and implemented their solutions in a range of different languages.

As Advent of Code occupied a lot of my mental energy in December (deepening my knowledge of writing tests for Rails was the other thing I focused on this month), I am going to use a few of my favorite solutions as jumping off points for a series of short technical blogs. (The rest of my Advent of Code solutions can be found in a GitHub repo.) My first post covers regular expressions in Ruby, and is directed toward those without a lot of prior experience using Regex.

Intro to Regexp in Ruby

Regular expressions, commonly called “Regex” in programming vernacular (or sometimes “Regexp,” as it exists in the Ruby language), provide a convenient way to locate matching patterns within a string. Need to validate that something is an email address or a phone number? Regex is your friend.

Surprisingly (but perhaps not so surprisingly to those who know Ruby fairly well), a regular expression in Ruby is an object that belongs to the Regexp class. An instance of this class can be created by placing the pattern to be matched between two forward slashes. For example, the regular expression /aa/ would generate matches in the strings "aardvark" and "naan" (but very few other English words).

To look for a pattern match, you obviously need two things—a pattern and a string. Once you have these two things, there are three Ruby “find pattern” methods that can be called:

  • #match returns a special MatchData object that behaves similarly to an array if a match is found; nil if there is no match
    "devushka".match(/dev/) # <MatchData "dev">
    /dev/.match("chlopak")  # nil
  • =~ returns the string index of the starting point of the first match if one is found; nil if there is no match
    /ipp/ =~ "Mississippi"  # 7
    "rhinoceros" =~ /ipp/   # nil
  • #scan returns an array containing all found matches or an empty array if none are found
    "kolokola".scan(/ko/)   # ["ko", "ko"]
    "Karelian".scan(/ko/)   # []

Note that the order of the string and Regexp is unimportant for #match and =~, and also that both these methods return a “truthy” value when a match is found and a “falsy” value if one is not (and thus can be conveniently used in conditionals). However, #scan must be called on a string (i.e. /a/.scan("banana") does not work) and will return a “truthy” value whether or not a match is found. Also, both #match and =~ will find only the first match, whereas #scan hoovers up all the matches.

Those are the basics of Regexp in Ruby (though perhaps the “replace” methods #sub and #gsub should be added to the three “find” methods). But, in order to fully unlock Regexp’s power, some additional techniques are required.

Wildcards and Ranges and Captures (Oh My!)

Of course, when we look for patterns, maybe we don’t want to look for a set combination of letters, but something more abstract like a group of letters and numbers followed by the @ symbol. This is where wildcards come into play. The broadest wildcard is ., which can stand in for any character except a new line, but the wildcards \d (any digit), \w (any “word character”), and \s (any whitespace character) see a lot of use as well.

To give two brief examples, if a linguist wanted to analyze the frequency of letters that precede “n” in a certain language, all the matches within a huge text corpus could be retrieved with .scan(/.n/) and then tallied. Or, if one wanted to find all the words that end with “ing” in a document, .scan(/\w+ing\b/) would do the trick. (\w+ finds sequences of “word characters” one character or longer, while \b identifies word boundaries. For interactive confirmation that this works, try out the excellent site Rubular, a great resource for working with Regexp in Ruby.)

More precision can be achieved by using ranges, which are placed inside brackets. Whereas the \w wildcard picks up letters, numbers, and underscores, using [a-z] restricts matches only to (lowercase) letters. Thus we have .scan(/[a-z]+ing\b/). Similarly, if the linguist in the above paragraph wanted to study only vowels preceding “n”, .scan(/[aeiouy]n/) would provide a succinct solution to the problem.

Finally, capture groups allow one to access specific parts of a successful match. They are designated by including the desired capture within parentheses, and this can be done multiple times within a single Regexp. An example will make this clearer. Say that you’re trying to isolate the different parts of a properly formatted Social Security Number:

ssn = "333-22-4444"
match = ssn.match(/\d{3}-\d{2}-\d{4}/)
# <MatchData "333-22-4444">

This might look a little confusing at first, but it’s not too difficult: The Regexp pattern given looks for a sequence of exactly 3 digits (\d is a wildcard that means any digit and {3} means exactly three in a row) then a dash then exactly 2 digits then a dash then exactly 4 digits. Now, to add the three sets of parentheses for the three desired capture groups:

ssn = "333-22-4444"
matched_data = ssn.match(/(\d{3})-(\d{2})-(\d{4})/)
# <MatchData "333-22-4444" 1:"333" 2:"22" 3:"4444">

Each of the digit sequences has been enclosed in parentheses, and these subpatterns are now included in the MatchData object. If we want to access the third capture group (that is, the group of 4 digits), Ruby gives us three different ways to do this:

$3                 # "4444"
matched_data[3]           # "4444"
matched_data.captures[2]  # "4444"

Upon their creation, Ruby stores capture groups in numbered global variables, within the MatchData object, and within something that can be conceptualized as a captures subarray within the MatchData object. Access is possible from any one of these three locations as demonstrated in the code above.

Finally, capture groups can also be referenced from within a regular expression as will be seen below.

Advent of Code, Day 5 – The Problem

In keeping with the Christmas theme, the Day 5 problem asked the user to develop a program that would help Santa quickly identify “nice” and “naughty” strings within a huge text file. For Part One of the problem, nice strings were defined as those that had 3 vowels, a double letter, and did not have adjacent combinations of certain consecutive letters (ab, cd, pq, and xy). For Part Two, nice strings were those that had “a pair of any two letters that appears at least twice in the string without overlapping, like xyxy (xy)” and “at least one letter which repeats with exactly one letter between them, like xyx, abcdefeghi (efe), or even aaa.”

Now, this problem could be solved without using regular expressions (for example, by setting up a while or for loop that steps through the string indices while checking for certain conditions), but such a solution would be much less elegant and arguably harder for a human to read. Therefore, it’s time to use what we’ve reviewed above to write some nice reusable one-line methods that leverage the power of Regexp.

Advent of Code, Day 5 – A Solution

Let’s get right to it. Here is my code for Part One:

def vowels_count(str)

def doubles_okay?(str)

def forbiddens_okay?(str)

def nice_string?(str)
  vowels_count(str) >= 3 && doubles_okay?(str) && forbiddens_okay?(str)

The first method, vowels_count, uses #scan to shove every individual vowel into an array and then determines the size of this array. This should seem pretty straightforward: "khachapuri".scan(/[aeiou]/) results in the output ["a", "a", "u", "i"].

The second method, doubles_okay?, is not nearly as straightforward. Let’s focus on the Regexp part for now. By using (.), where ., as you’ll recall, is a wildcard that can stand in for any character, I am in essence inspecting by turn each and every individual character in the string and assigning it to a capture group. Then, I reference this provisional first capture group using \1 in order to determine whether the character following the wildcard is the exact same character. If this occurs somewhere in the string, then the pattern /(.)\1/ is matched and a MatchData object is returned. However, if there is no double letter in the string, then there will not be a match and nil will be returned instead.

In theory, it is sufficient to send either the MatchData object (a truthy value) or nil (a falsy value) to the nice_string? method—the answer we would get back would be the same. However, it is cleaner to coerce the result of #doubles_okay? into a Boolean using !!. If a match is found, the first ! changes the MatchData object into false and the second ! flips this false into true. Analogously, if a double letter isn’t present, nil becomes true and then ultimately false.

The third method, forbiddens_okay?, will result in MatchData if any one of the four forbidden sequences (split by the | symbol in the Regexp) are found in the string. We want such a case to send false to the nice_string? method (because forbiddens are not “okay”), so the ! operator is used to switch the result into the appropriate Boolean value.

Finally, now that all three of the Regexp building blocks are in place, using the && operator in the nice_string? method will cause the method to return true should all three of the other methods return true and false otherwise.

My approach to Part Two was similar:

def has_pairs?(str)

def has_mini_palindrome?(str)

def nice_string2?(str)
  has_pairs?(str) && has_mini_palindrome?(str)

The method has_pairs? sequentially looks at each pair of side-by-side characters in the string and provisionally makes it a capture group. The .* signifies that zero or more characters can separate the initial pair and the second occurrence of the pair, which (as in the doubles_okay? method above) is represented by \1, a reference to the first capture group. After the #match method returns a MatchData object or nil, this result is then coerced into a Boolean using !!.

The next method, has_mini_palindrome?, is very similar to doubles_okay? with the only substantive difference being that there has to be exactly one character between the pair of letters. Therefore, we insert the all-inclusive wildcard . between /(.)\1/ to get /(.).\1/. If necessary, you can convince yourself that this works as expected by playing around with Rubular. The final step, though not strictly necessary because we already have truthy and falsy values, is to transform this result into a Boolean with !!.

Lastly, the brilliantly named method nice_string2? will return true if and only if both subconditions are true.

Final Remarks

I hope it is clear from this example how using Regexp can succinctly solve a variety of pattern matching problems. By way of contrast, imagine having to write a has_pairs? method without being able to leverage Regexp. It might end up looking something like this:

def has_pairs?(str)
  i = 0
  while i < str.length - 1
    pair = str[i..i+1]
    j = i + 2
    while j < str.length - 1
      return true if pair == str[j..j+1]
      j += 1
    i += 1

While admittedly str.match(/(..).*\1/) might be hard for a Regexp neophyte to read, I still think it looks much cleaner than the nested while loops above. As such, Regexp is well worth the effort to master for anyone who regularly deals with parsing or validating strings in their programming work.