Lesson 4

Setting Boundaries

Introduction

In the Lesson 3 we learnt about 'metacharacters'- characters that have a special meaning in the expression.

'Boundaries' and 'Anchors' are metacharacters that allow you to 'find' parts of the string.

They are incredibly useful and you'll find yourself using them a lot.

Let's take a look...

Word Boundaries

'Word boundaries' occur whenever the text changes from a non-word character to a word character, or vice-versa. Typically, you would use this to find the start or end of words.

These are represented in a regular expression with the \b metacharacter.

The expression below finds the letter a, but only when it is the first character in the word. Remove the word boundary and see what happens:

Try it out!

Life is either a daring adventure or nothing at all

- Helen Keller

Expression:>

In the expression above, only the a is selected, the 'word boundary' metacharacter itself doesn't actually select any characters - it just tell the regular expression engine where to look for matches.

Word boundaries can be confusing though, as they are also found when there are special characters inside a word.

The word boundaries in the text below are marked in yellow. Try changing the string, adding special characters etc., to get a feel for what a 'word boundary' actually is:

Try it out!

Text:>

Anchors

We use Anchors to 'pin' an expression to the start or end of a line.

Start of a line

The ^ character represents the start of a line.

This expression matches the letter o, only when it occurs at the start of the line (^).

Try removing the anchor character to see what we mean:

Try it out!

One, two, three, four, five

Once I caught a fish alive

Expression:>

We can describe this expression as being 'anchored' to the start of the line.

This expression uses the i modifier from Lesson 2, which makes the expression case-insensitive - it will match both upper and lower-case o characters.

End of a line

The $ anchors the expression to the end of the line.

This matches any character (.) that is immediately followed by the end of the line ($):

Try it out!

One, two, three, four, five

Once I caught a fish alive

Expression:>

We say that this expression is 'anchored' to the end of the line.

Mini-Game

Sentiment Analysis!

Select ONLY the tweets that contain the word 'bad'. Matching a single word in the tweet will select it.

Select these:

Trolly McTrollFace

@troll3545

Hey @baseclass, how can you be this BAD at stuff!?

Grumpy Customer

@grumpy1654

The @baseclass app just crashed on me, this app is so bad it hurts!

But DON'T select these:

    Average Joe

    @joe7978

    Just earnt the 'super contributor' badge in the @baseclass app!!

    Happy Customer

    @notabot56

    I can berely contain my excitement about the @baseclass app.


Your expression:

  • We need to capture both upper and lower case 'bad's. You'll need a modifier from a previous lesson for this.
  • To avoid accidentally matching tweets with words that contain the text 'bad' (e.g. 'badge'), thing about adding word-boundary