homeposts

Created: 10/12/2022

Understanding string.replace in JavaScript

Finding and replacing with capture groups is super useful. I do it all the time in vim using sed syntax. Here's an example!

Scenario

Oh no, we're migrating to a new email provider, and they don't support the +custom thing (more one the email + trick)

Input

[email protected]
[email protected]

Vim / Sed Regex; horribly cryptic, as usual:

%s/\(\w\+\)\(+\w\+\)\?@\(.*\)/\[email protected]

Output:

[email protected]
[email protected]

This is all very fine and good. Using capture groups allowed us to separately capture the base email from the +custom bit, then to lift that out of the replace expression so that we can do our whole transformation in one go.

How can you embrace this magic in JavaScript, though; it seems like something that would be useful for string manipulation in our applications!

Using Capture Groups for Substitution in JavaScript

The gist of the API in javascript is that you can use $1, $2, etc. to refer to to captured groups just like you can use \1, \2, etc. in vim or sed.

Although this makes sense by the end, I was really thrown for a loop by this behavior.

'input string'.replace(/(input)/, '$1') // 'input string'

JavaScript, what is happening???

I expected this usage to cause some change to the output, but the output was unaffected.

Partial Matches Partially Replace

It turns out that with a partial match, finding and replacing only affects the text for which it matched. So, we cannot touch the ' string' substring at all with the pattern shown above, since it wasn't captured by the regular expression. In practice, that means we need to capture everything we want to transform to do anything useful.

'input string'.replace(/(input) (string)/, '$1') // 'input'

We were able to remove ' string' from the output because $1 now only captures input, but doesn't capture anything else.

We can use this pattern to do something sort of practical, actually.

names = ["John Smith", "Mary Jane", "Tim Peters"]
names.map((name) => name.replace(/(\w+) (\w+)/, '$2, $1'))
// => [ 'Smith, John', 'Jane, Mary', 'Peters, Tim' ]

The key here is that the whole string matches the whole regex. If the string doesn't match the regex at all, it doesn't matter what we put in the replace expression, the original input is passed through unaffected.

"can't touch this".replace(/no match/, '$1')  
// => "can't touch this"

"can't touch this".replace(/no match/, '$2')  // "can't touch this"
// => "can't touch this"

"can't touch this".replace(/no match/, 'hello??')  
// => you guessed it; still "can't touch this"

The weird thing is that for a partial match, the bit that doesn't match gets passed through, while you can transform the part that does match. This highlights the weirdness. First, let's breakdown the regex I'm going to use:

/(\w+)@(\w+)\.(\w{3})/

In plain language, this is:

  • A group of one or more letters ($1)
  • An "@"
  • A group of one or more letters ($2)
  • A "."
  • A group of exactly 3 letters ($3)

As you might have started to recognize, this is an email regex. But, there's a problem. It supports top-level domains with 3 chracters (".com"), but not ones with two characters (".co"), which do exist! Therefore, we see the following:

// this is the same regex as before; don't bother reading it ๐Ÿ˜
const reg = /(\w+)@(\w+)\.(\w{3})/

reg.test('[email protected]') // true
reg.test('[email protected]')  // false

Bringing It All Together

With that in mind, let's look at examples that combine everything we've learned!

const reg = /(\w+)@(\w+)\.(\w{3})/g // still the same!

// this is expected behavior; the regex matches exactly
'[email protected]'.replace(reg, '$1') // 'email' (as we'd expect)
'[email protected]'.replace(reg, '$2') // 'domain' (as we'd expect)
'[email protected]'.replace(reg, '$3') // 'com' (as we'd expect)

// when the regex doesn't match the input is passed through! Just like the
// "can't touch this" example from before
'[email protected]'.replace(reg, '$1')     // '[email protected]'
'[email protected]'.replace(reg, '$2')     // '[email protected]'
'[email protected]'.replace(reg, 'ahhhh')  // '[email protected]'

Let's look at some partial matches now!

const reg = /(\w+)@(\w+)\.(\w{3})/g // still the same!

const falseInformation = 'My email domain is [email protected]'
// uh, no, the domain is just "domain.com"; let's fix that with our regex!

const fixed = falseInformation.replace(reg, '$2.$3')
fixed === 'My email domain is domain.com' // true

let challenge = `
  For each of these emails, change the top-level domain from ".com" to ".gov"

  - [email protected]
  - [email protected]
  - [email protected]
  - [email protected]
`;

// There are other ways to accomplish this without repetition; out of scope for
// this tutorial
challenge = challenge.replace(reg, '$1@$2.gov')
challenge = challenge.replace(reg, '$1@$2.gov')
challenge = challenge.replace(reg, '$1@$2.gov')
challenge = challenge.replace(reg, '$1@$2.gov')

Summary

So, basically, if we have an exact match, the behavior is pretty unsurprising. If we don't have a match, the whole string just gets passed through. If we have a partial match, javascript performs the transformation on the bit that matched, but doesn't touch the rest.

Extra Examples

// halloween substitution ๐Ÿ‘ป
'hello world'.replace(/(hell)(o)/, '$1') // 'hell world'

And $2 captures the o:

'hello world'.replace(/(hell)(o)/, '$2') // 'o'

Other Resources

See Also: Modifier Flags

Another important topic for understanding regex in JavaScript is regex modifier flags, which I've ignored to keep it simple here, but mdn documents them very well and I encourage you to experiment with them!