Regular expressions are a powerful tool for searching and filtering text. This first article will give you some background information on regular expressions, so that we can move on to using them to dig through your server’s logs.
RegEx is about finding stuff
Regular expressions (often shortened to and referred to hereafter as ‘RegEx’) is, at its core, a system for finding and matching pieces of text. We’re all used to searching the text of a document, but have you ever noticed that text search… kinda stinks? We are also used to searching with Google and that can work ok especially if play around and get fancy with your queries, e.g., include this string but not that string. That type of logic is what RegEx can do, but it can also do a lot more.
Let’s say you’re writing an essay and realize you need to change every instance of “cherry” to “blueberry.” You do a find/replace, and suddenly your story reads “Marian sliced each blueberry and poured the cherries into the pie”.
The search we’re used to only finds exact matches, but our minds can come up with all kinds of searches that are more complex. RegEx can do all that.
“Find every word that starts with ‘cherr’ and ends with either ‘ies’ or ‘y’” is an easy search to do with a RegEx.
A word on language: the term for a particular regular expression is a ‘pattern’.
Here is all the RegEx knowledge we’ll need
In part 2 of this article I will show you how to search the logs of your web server for IP addresses. To do this we’ll need about half of the capabilities of RegEx. This is not a comprehensive tutorial but it does cover what we’ll need. Notably if you master everything here you can do a lot more later!
You can try out RegEx in the browser
Sites like https://regexr.com let you write and test RegEx in the browser. It takes a standard pattern (wrapped with \
s) and lets you set a few options for the execution of your patterns. Let’s write the simplest kind of regex, a literal that says ‘find these exact letters in this exact order’: The pattern to find the word \cat\
can be followed with a few options
- Case sensitive or no? I always leave mine case sensitive, which forces me to explicitly say
[a-zA-Z]
within my RegEx if I want to capture lower and upper case. Whatever you do, just remember this setting exists. -
Substring match or whole line. if you search the text:
A cat loves cream
Like a mouse loves milk
And a crow the dawn
With the pattern
\cat\
, substring mode will give you only the part of the line that exactly matched what you searched for (‘cat’), while a whole line search will give you everything from the line that has a match (‘A cat loves cream’) Note: there are a bunch of other options available that I never use so I won’t go over them here (remember I said this is just what we need)
The wildcard
The dot .
matches any character. If you want to match the dot as a character, escape it like this: \.
“Make sure this character is there, or don’t. Whatever”
A question mark ?
means that the preceding character is optional. If you want to match an actual question mark, escape it: \?
(in general: any time you want to search for a literal character that could also be used in a RegEx pattern like ?
, [
, .
, or even \
you can put \
in front of it to say ‘search for this character’)
The question mark is the first way we get matches of variable length, a pattern like /car?t/
will match either ‘cat’ or ‘cart’
One of several possibilities
A set is one or more characters enclosed in brackets [aqt]
. It matches only one of those characters - in this example only a, q or t.
Ranges
You can also specify a range [0-9]
, [a-z]
, which will match everything in the range.
Find that same character this number of times
We’re almost done, and I want to teach one thing which we don’t technically need but is super useful in making shorter patterns. When searching for a year value we might look for four digits in a row. Rather than writing [0-9]
four times in a row, we can just use \[0-9]{4}\
to grab the preceding pattern four times in a row.
Find this pattern any number of times
Instead of saying that we want to see a character an exact number of times, we can say “match this pattern any number of times with *
If what you’re matching is single character long you can just add an asterisk at the end e.g. [a-n]*
would mean ‘match any letter a-m any number of times (note this doesn’t require the character be the same, this will match ‘aaaaa’ or ‘bbbb’ just fine but also ‘banana’).
If you want to match a repeated sequence use parentheses e.g. ([a-b][m-q][a-z])*
Aren’t you going to explain special characters? What about whitespace? What about that regular expressions were invented by Marisol Regular and Constance Expression one magical afternoon in the Versailles of 1782-
Again this is just the RegEx we need to find IP addresses in a log file. It’s the things I think a beginner should know. As your knowledge expands you’ll be able to write a pattern that performs the same search in many fewer characters, or is much more careful and selective about its results. To go just a bit deeper read the docs on regxr and keep exploring!
RegEx is hard to read
Regular expressions are extremely compact and as such they’re very hard to read for the novice. The pattern I mentioned above looks pretty straightforward:
\cherr(y|(ies))\
this uses the ‘or’ pipe charachter
|
that we’re not going to use otherwise
But a pattern to find phone numbers can start to look incomprehensible:
/\(?[0-9]{3}\)?.?[0-9]{3}[\-\w]?[0-9]{4}\
I mention this not to discourage you but rather to establish that it’s okay if RegEx patterns look confusing at first. Additionally, you can create patterns quite successfully without using every technique, so don’t be surprised if after gaining some RegEx experience, you see examples or Stack Overflow answers that make less sense!
RegEx works character by character
At its most basic, RegEx works character by character. A pattern like /[aeiou]/
is saying ‘this single character can be one of these five letters’. While when we have a series of bare letters (literals) e.g. /cat/
we’re really saying ‘the first character must be c, the next one a, and the last one t’.
Most of the time we can think of the system following our RegEx pattern as a little robot that is reading lines of text like a strip of paper; at any time it’s only looking at one character.
Why is this important to remember? It explains why you can’t do math-based tests with a RegEx. You can’t easily write a RegEx to find every prime number or every perfect square, because those aren’t things you can identify by looking at one character at a time.
puzzle for the reader: it is possible, even easy, to write a RegEx to only find odd or even numbers? I’ll leave it to you to think about how.
Now, while this is all true in practice, a technique I haven’t talked about (though I used in the ‘cherry/cherries’ example) is subgroup matching which _does _allow more complex logic around groups of characters. I won’t use it since it’s unnecessary for simple tasks, and also because even a short RegEx using subgroups can become very computationally expensive. This brings us to...
RegEx can be dangerous!
Recently a lot of the internet stopped working for a bit, and a regular expression was (part of) the reason. Regular expressions can require computers to do things that take so much time their CPU can get maxed out, as happened to Cloudflare this month.
while it was a regular expression that caused the problem, a more accurate summation for “why did Cloudflare go down” would be that they pushed an update to all their servers at once, which isn’t a great deployment strategy.
So you can write a RegEx that’s very hard for a computer to execute and uses a lot of CPU time. This isn’t really that different from any computer code you write, but it _does _explain why sites like Google don’t let you write RegEx to search with.
Now, go practice your skills
The best way to learn RegEx is not with practical exercises, but with a few fun games
RegEx Golf (one of the higher levels of this game took me two weeks to do, so, quit when you’re ready)
RegEx Crossword
Interactive Exercises
The Regular Expression Game