Possibly the most malicious Regular Expression

This is the next of my episode on regular expressions. Today, we look at the worst regex you can possibly come up with, although it looks innocent and simple. You will learn about this backtracking trap that let’s you easily wait for 10^30 steps, as an example of an errant email regex will illustrate. One possible solution we investigate is the use of possessive quantifiers.

Well, some of you might think: All regular expressions are malicious.

If you are one of them: You have no idea how bad they can become!

Greedily Looking for ‘aa‘

Have a look at this short example:

public static void main (String[] args) {
final String pattern = "(aa|aab?)*";
// two 'a'
final String a002 = "aa";
// three 'a'
final String a003 = "aaa";
// 102 'a'
final String a102 = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
// 103 'a'
final String a103 = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
System.out.println(a002.matches(pattern)) ;
System.out.println(a003.matches(pattern)) ;
System.out.println(a102.matches(pattern)) ;
System.out.println(a103.matches(pattern)) ;

What does it print?

  1. true false true false
  2. Throws an exception
  3. Does not compile
  4. None of the above

The regular expression “(aa|aab?)*” matches all strings that are made up of “aa” or “aab“. I.e., it matches for even numbers of ‘a’ and does not match for odd number of ‘a’. (I know. The same can be done with a simple “(aab?)*” but just follow me.)

So in theory it should print “true false true false” (answer 1 above). But actually, it prints “true false true ” (answer 4). The final “false” will come some billion years later if you are patient and your computer does not want you to reboot in the meantime.

It’s not a bug, It’s a feature

We got caught in a backtracking trap. Let’s have a look at a shorter example: “aaaaa“. The following table shows what the java regular engine does to see if it matches to our pattern.

Step Group 1 Group 2 Group 3 Result
1 (aa) (aa) (aa) fail
2 (aa) (aa) (aab?) fail
3 (aa) (aab?) (aa) fail
4 (aa) (aab?) (aab?) fail
5 (aab?) (aa) (aa) fail
6 (aab?) (aa) (aab?) fail
7 (aab?) (aab?) (aa) fail
8 (aab?) (aab?) (aab?) fail


  • Step 1: The engine finds two “aa“, tries to find a third “aa” and fails.
  • Step 2: The engine finds two “aa“, tries to find a “aab?” next and fails.
  • Step 3: The engine starts backtracking and replaces the second “aa” with “aab” and fails again for both “aa” …
  • Step 4: and “aab?“.
  • And so on.

The result is: “aaaaa” does not match “(aa|aab?)+“.

In our example with 103 ‘a’ the engine will execute 2^100 = 10^30 steps, that is 1000 billion billion billion. The regex “(aa|aab?)*” looks so innocent and simple but is probably the worst regex I ever saw.

But backtracking is useful. If you match “<a><b></b></a>” to the regex “<(.*)>(.*)<1>” you will find a match of opening and closing tags. I think there is no regular expression for a regex engine without backtracking that can do this match. Please comment on this!

Switching to Possessive Quantifiers

What will change if we use a slightly different regex? We change the quantifier from a greedy “*” to a possessive “*+“.

[code language=”java” light=”true”] final String pattern = “(aa|aab?)*+”;

What does it print?

  1. true false true false
  2. Throws an exception
  3. Does not compile
  4. None of the above

Since we use a possessive quantifier the regular expression engine will not do backtracking. So the process stops after step 2 and the “aa” in ‘Group 2’ won’t be replaced for “aab?“. The first result is printed at once.

A Case Study with Email Addresses

Let’s take the example of email address. Let’s look at two similar regular expressions for it:



Do you spot the difference? The first expression is as evil as “(aa|aab?)*” and gets stuck if you try to match e.g. "aaaaaaaaaaaaaaaaaaaaa!" while the second is fine. I.e. if you use the first regex to validate email addresses in your Java code your thread gets stuck for a few million years and your system is vulnerable to a “regular expression denial of service” attack. In principle all regular expression engines that allow backtracking will have the same problem. Imagine what will happen to your database system if there are a few thousand invalid email addresses to be stored and you use an malicious regex in the DB!

There is no simple way to recognize regular expressions which fall into the evil category. But here are some hints:

  • Be careful with options that imply each other: “aa” vs. “aab?“.
  • Try to avoid cascading quantifiers like in “(a+)+“.
  • Try to be possessive like in “(aa|aab?)*+“.

Have a look at the Wikipedia article on Redos for more information and additional examples for malicious regular expressions.

I’m curious to hear your “war” stories: Did you actually have this problem in your systems on production? What did you do to find it and fix it?

Series Navigation<< Consequences when using Mutable Fields in hashCode()