Lesson 9

More Assertions

Introduction

That last lesson was heavy! Don't worry, this one is a lot shorter.

It covers a really easy way of 'looking around' the input string to find your match.

Lookahead

A 'lookahead' lets us match something only when it is immediately followed by something else in the string.

Cricually, the lookahead does not capture the text that follows, it only uses it for the match.

A lookahead is written in brackets like a group, but it starts with a ?=, e.g. (?=hello).

This expression matches one-or-more word characters (\w+) that are immediately followed by the "ing" ((?=ing)):

Try it out!

My hobies are eating, drinking and scrolling Twitter

Expression:>

Notice how the 'ing' text is not captured, it's only used to match the characters before it.

Negative lookahead

Starting the lookahead with ?! makes it a 'negative lookahead'. That means it will only match when NOT immediately followed by the lookahead value.

This expression looks for two-or-more numbers ([0-9]{2,}) that are NOT followed by the text "kg" ((?!kg)):

Try it out!

I am 183cm tall and weigh 75kg

Expression:>

Notice that the "75" from "75kg" is not to matched.

Lookbehind

As you have probably guess, this is the opposite of a 'lookahead'.

The 'lookbehind' matches something immediately preceded by the condition.

You can recognise a lookbehind by the ?<= at the start (the arrow is pointing backwards, indicating a lookbehind).

This expression matches one-or-more numbers ([0-9]+), but only when they are immediately preceded by a $ sign ((?<=\$))

(note that we need to escape the $ sign with a \. The $ is 'reserved' in regular expressions, so it becomes \$)

Try it out!

It's 100 years old, and is worth between $40 and $80

Expression:>

The number "100" is not captured, because it doesn't have the dollar sign before it.

Negative lookbehind

Lookbehinds also have a 'negative' version, indicated by a ?<! at the start.

This expression looks for two-or-more numbers ([0-9]{2,}), but only when they are not preceded by the $ sign ((?<!\$))

Try it out!

It's 100 years old, and is worth between $40 and $80

Expression:>

Mini Game

Your goal is to extract the domain names from URLS ending in 'robots.txt'.

We only want the domain names though, and only when they were using HTTPS.

Your results are on the left. The correct answer is on the right. Just make them match - simple! 😬

Your matches:

articlehttps://www.microsoft.com/home

articlehttps://www.apple.com/robots.txt

articlehttp://www.yahoo.com/robots.txt

articlehttps://www.bbc.co.uk/robots.txt

articlehttps://www.msn.com/robots.txt

articlehttps://www.dev.to/robots.png

articlehttps://www.bbc.co.uk/news

Target:

articlehttps://www.microsoft.com/home

articlehttps://www.apple.com/robots.txt

articlehttp://www.yahoo.com/robots.txt

articlehttps://www.bbc.co.uk/robots.txt

articlehttps://www.msn.com/robots.txt

articlehttps://www.dev.to/robots.png

articlehttps://www.bbc.co.uk/news


Your expression:

  • Don't be alarmed by all of the \/ stuff.
    '/' characters are reserved, and have to be escaped with a '\'.
    So, \/ simply matches a \ character in the string.
  • The results must end with /robots.txt, but not return it. Sounds like we need a lookahead.
  • They must also start with https://, but we don't want to return that. Perfect for a lookbehind!
  • Finally, get rid of that pesky wildcard at the end, so we only match .txt files