Lesson 6

Sets, Ranges and Alternation

Introduction

Everything we've learnt so far requires you to know the exact characters that you want to match (or at least the type of character).

Regular expressions are more powerful than that, though.

Sets ranges, and alternation allow you to perform more complex logic in your expression.

Let's take a look...

Sets

Enclosing one-or-more characters inside square brackets means 'match any of these characters'. This is called a 'set'.

For example, the set [abc] will matches either, a, b, or c.

This expression selects either f or d followed by ish:

Try it out!

I wish I had chosen the fish dish.

Expression:>

Notice how wish is not selected, because it does not begin with f or d.

Excluding characters from sets

You can also exclude characters by prefixing the set with a ^.

The expression [^abc] means everything except for the characters a, b, or c.

Here's the opposite of the previous expression. It matches any character except for f or d, followed by the text ish.

Try it out!

I wish I had chosen the fish dish.

Expression:>

Notice how this only matches wish now, and no longer matches fish or dish:

Ranges

Specifying individual characters is fine when you only need to match a couple of them, but it's not so good when you have something more complex.

For example, to match any number from 0 to 9 using that method, you'd have to write [0123456789].. that's not ideal.

That's where 'ranges' come in, letting you specify a range of characters to match.

Using ranges

Ranges are also enclosed in square brackets.

Instead of individual characters though, you specify the start and end of the range:

This expression matches any (lowercase) character from a-z, followed by the literal text ish:

Try it out!

I wish I had chosen the fish dish.

Expression:>

Excluding ranges

The ^ character can also be used to 'negate' a range, the same way as we did with individual characters above.

This expression selects any character except for the ones a-f, followed by ish. See how it no longer selects 'dish'. because it starts with a character in this range:

Try it out!

I wish I had chosen the fish dish.

Expression:>

Combining ranges in sets

Just as you can combine multiple characters in a set ([abc] matches the individual characters a, b or c), you can also specify multiple ranges in a set.

The expression below matches either a character from a-d or a character from p-z.

Try it out!

I wish I had chosen the fish dish.

Expression:>

You'll commonly see this used in the set [A-Za-z], which captures any upper or lowercase letters (when the /i modifier isn't being used to specify case-insensitivity, of course).

Tricks with ASCII

When you specify a character in a range, you're actually referring to the address of that character in the character set you're using - usually ASCII.

That means that [a-z] actually means 'ASCII character code 97 (a) to character code 122 (z)'.

Because of this, we can use ranges to do some useful things!

For example, the first printable character in the ASCII character table is the space. The characters before this are 'un-printable' characters, such as tab, the carriage return etc.

The last printable character is the ~.

That means that the range [ -~] will match all printable characters in ASCII.

If you see this in a regular expression, now you know what it means!

Using Quantifiers

You can also use quantifiers with sets and ranges. They are added after the square brackets.

Remember the + quantifier from Lesson 5? It means 'one-or-more times'.

Here we use it to match any number (0-9) one or more times (+), followed by a % symbol:

Try it out!

Genius is 1% inspiration, 99% perspiration

- Thomas Edison

Expression:>

Alternation

The | symbol in a regular expression acts like an 'or'.

This expression finds the word 'creativity' or the word 'intelligence':

Try it out!

Creativity is intelligence having fun.

- Albert Einstein

Expression:>

This is called 'alternation'.

You can also use alternation as part of a bigger expression, by enclosing the options in brackets. You can have as many options as you like, as long as you separate them with the | symbol.

This expression finds either the word Tell, Teach, or Involve, followed by the word me:

Try it out!

Tell me and I forget. Teach me and I remember. Involve me and I learn.

- Benjamin Franklin

Expression:>

These brackets are called a 'capturing group', and they're really useful for other reasons too.

We'll look at capturing groups in more detail in the next lesson.

Mini-Game

The combination of sets, ranges and alternation allows us to do some powerful things.

You're going to need all of them for this game, I'm afraid.. it's tricky one!

Let's build a tool that detects dates in an email and automatically overlays them on a calendar.

The email below contains four dates. Each date you match will be added to the calendar.

Match them all (without accidentally matching the times), to win!

To:

you@youraddress.com

From:

boss@yourcompany.com

Hi,

Let's meet on the 1st, 2nd, 5th or the 23rd. I can do 9am or 1pm.

Thanks.

  1. S
  2. M
  3. T
  4. W
  5. T
  6. F
  7. S
  8. 1
  9. 2
  10. 3
  11. 4
  12. 5
  13. 6
  14. 7
  15. 8
  16. 9
  17. 10
  18. 11
  19. 12
  20. 13
  21. 14
  22. 15
  23. 16
  24. 17
  25. 18
  26. 19
  27. 20
  28. 21
  29. 22
  30. 23
  31. 24
  32. 25
  33. 26
  34. 27
  35. 28
  36. 29
  37. 30

Your expression:

  • The dates contain either one or two digits
  • They are then followed by either the string st, nd,rd or th
  • You'll need to use 'alternation', and there's a quantifier in this expression somewhere too!