Lesson 8

Groups

Introduction

So far, all of the regular expressions we've learned either match or they don't.. but that's it.

Regular expressions can do so much more though.

In this lesson, we'll learn how to return data from a regular expression, so we can use it later.

This is a long lesson, but it will really unlock the power of regular expressions, so strap in and let's get started!

Capturing Groups and match()

In the last lesson, we learnt about the match() function, and briefly mentioned matchAll().

These functions return the matches for the regular expression, but also the 'capturing groups'.

A 'capturing group' is a way to return parts of a regular expression match.

Let's work through an example that shows the power of capturing groups ...

Without capturing groups

Suppose we're trying to grab a person's height from a string like this:

"Bob is 183cm tall"

Using what we know already, we could write a regular expression to find the height:

// '[0-9]' Find the numbers 0-9 .. 
// '+'     one-or-more times ..    
// 'cm'    followed by the text 'cm'
"Bob is 183cm tall".match(/[0-9]+cm/)

// [ "183cm" ]

That's great, but we would need to write code that removes the pesky cm from the end. We can do better.

Capturing groups let us return specific parts of the expression match - like the number value in this case. You can recognise them because they are (enclosed in brackets).

What capturing groups look like

Let's re-write the expression to include a capturing group:

let regex = /([0-9]+)cm/g

This is the same expression, but we've enclosed the section containing the number ([0-9]+) inside the capturing group.

Cricually, the cm text is outside of the group.

This expression will still match our height string, but it will also extract the number value.

Let's see how we'd use it...

Capturing groups with match()

You'll remember from the last lesson that the match() function returns an array.

If we use the /g flag in the expression, that array contains all of our matches.

But if we don't use the /g modifier - which indicates that we only want the first match - then the behaviour is slightly different.

We'll still get an array back, but this time the first element will be the whole match, and the remaining items in the array will be the individual 'capture groups'.

Here's the result of using match() with our expression above:

"Bob is 183cm tall".match(/([0-9]+)cm/)

// [ "183cm", "183" ]

The first item in the resulting array is the match for the whole expression. In our case that's the the height including the 'cm' text.. no surpsises there.

The next item in the array is the interesting bit.

We have one capture group, so there is one extra item in the array. That's the result of our capture group - "183".

This is even more powerful when we use more than one capture group in an expression.

This example extracts the height and width of an image from a string:

Try it out!

let sourceString = "Image size: 800x600px"

// ([0-9]+)  Group 1: The numbers 0-9 one-or-more times
// x         followed by the text 'x'
// ([0-9]+)  Group 2: The numbers 0-9 one-or-more times
// px        followed by the text 'px'
let regex = /([0-9]+)x([0-9]+)px/

sourceString.match(regex)
// [ "800x600px", "800", "600" ]
Text:

Named Groups

This is already a really powerful tool, but we can do even better.

It's possible to name your capture groups, so we can reference them directly later. This avoids having to grab values out of the array, and makes your code much more explicit.

You 'name' a capture group by adding a ?<name> at the start of the group.

To name our first group 'height', ([0-9]+) becomes (?<height>[0-9]+),

and to name our second group 'width', ([0-9]+) becomes (?<width>[0-9]+)

.. and our whole expression becomes:

let regex = /(?<height>[0-9]+)x(?<width>[0-9]+)px/

The match() function will now return a .groups object containing our named capture groups.. so we can do cool stuff like this:

let regex = /(?<height>[0-9]+)x(?<width>[0-9]+)px/
let result = "Image size: 800x600px".match(regex)

console.log(result.groups.height)
// "800"

console.log(result.groups.width)
// "600"

Using matchAll()

This is all fine, but what if we do want a global match with the /g modifier?

For instance, what if we wanted to grab the heights and widths from a string containing multiple image sizes:

let images = `Image 1: 800x600px
Image 2: 600x400px
Image 3: 1024x768px`

Because we want all of the matches, we have to use the /g modifier. But, we know that the match() function, when combined with the global modifier, returns an array containing just the matches - no capturing groups.

So, how do we get multiple matches, and capturing groups?

We use matchAll(), that's how!

This function returns an 'iterator'. That means we can loop over the results, and get the full text and capturing groups for each match:

let images = `Image 1: 800x600px
Image 2: 600x400px
Image 3: 1024x768px`

let results = images.matchAll(/(?<height>[0-9]+)x(?<width>[0-9]+)px/g)

for (const match of results) {
    console.log(`Height is ${match.groups.height}, width is ${match.groups.width}`)
}

// "Height is 800, width is 600"
// "Height is 600, width is 400"
// "Height is 1024, width is 768"

(if you're not familiar with what's going on in the console.log, check out the Template Literals section here).

matchAll() only works with global expressions. If you pass a regular expression withou the /g modifier then it will throw an exception.

Non-capturing groups

The ability to group things in a regular expression is very useful. Sometimes we want to use groups, but we don't care about getting the result of the group back.

Let's say we wanted to get the color from either of these strings:

"The cat is orange"
"The fish is red"

We could use a group to select either 'cat' or 'fish' (you'll remember from Lesson 6 that the | is used as an or in regular expressions)

// "The "        the string 'The '
// (cat|fish)    Group 1: either 'cat' OR 'fish'
// " is "        then the string ' is '
// (w+)          Group 2: any word character, one-or-more times
let regex = /The (cat|fish) is (w+)/

"The cat is orange".match(regex)

// [ "The cat is orange", "cat", "orange" ]

That's great, but we don't care about the animal, only the color.

Capturing groups have a cost, they are a little bit more work for the expression parser.

We can prevent a group from being 'captured' - or returned in the result - by adding ?: at the beginning of it.

This is called a 'non-capturing group':

// Note the ?: in the first group
let regex = /The (?:cat|fish) is (w+)/

"The cat is orange".match(regex)

// The animal is no longer returned:
// [ "The cat is orange", "orange" ]

Mini Game

Extract the file names from the terminal

We need an expression that finds just the file name (not the extension) of each item in the terminal window.

The code is fine, but the expression isn't working correctly. Can you fix it? 👇

\git\myrepo\readme.md \git\myrepo\favicon.icon \docs\regex_help.png \folder.1\proposal_v3.docx \folder.2\package.json

Expression:

let results = files.matchAll(expression)

for (const match of results) {
    console.log(match.groups.filename)
}

Target Matches:


    Ok, so the difficult is really ramping up now!

    • The first group seems about right. It's matching any word character (\w) one-or-more-times (+)
    • The second group is matching any character (.), one-or-more-times (+)
      That might be why it's accidentally matching those folders that contain a .
      Can we change that group to only match word characters instead?

    • We might also want to think about anchoring the expression to the end of each line (remember, it's a multi-line string!)