Also available at

Also available at my website http://tosh.me/ and on Twitter @toshafanasiev

Tuesday, 28 June 2011

Anatomy of a Password Strength Regex

So you want to enforce a certain level of complexity for your users' passwords? You could do this in code with lots of length and character presence tests chained together with logical ands or you could express the desired complexity with a single succinct, declarative regular expression.

Ok, let's say you want to ensure that passwords are between 8 and 20 characters long and contain a mix of lower case, upper case, numeric and non-alphanumeric characters. (Normally you'd not want to restrict the length of the password as longer == stronger, but it makes for a more comprehensive example.)

Starting with string length

.{8,20}

This matches a string of between 8 and 20 characters (any characters) but as it is the maximum is not enforced as it would match the *first* 20 of a longer string. For the maximum to work you need to specify that the whole string fits in this length, as follows

^.{8,20}$

Note: if there is no maximum then the following would work

^.{8,}$

The character requirements need care to prevent the expression from becoming unwieldy; they need to be reworded so as to be easily expressed in a regex. For instance 'the password must contain a digit' is expressed as 'either the first character is a digit, or some character after the first character is a digit' or closer to the regex still, 'there is a digit which is preceded by zero or more characters'. This might seem a little odd but the last version is an exact semantic match for a regex construct called a Positive Lookahead.

A Positive Lookahead is a Zero-Width assertion that qualifies the expression that it immediately follows, only allowing that expression to match if the expression contained in the assertion matches. The importance of its zero-width is that it qualifies the expression it follows without itself consuming any characters. The best way to see this is by example.

The expression

^(?=.*\d)

will match the start of the string, but only if at some point after (?=) the start of the string ^ there is a digit \d which is preceded by zero or more characters .*; what's important to note is that the match itself consumes the start of the string only, not any of the following characters, even though they are involved in the match. To extend this to match the digit-containing characters also, you'd simply add an expression for those characters after the lookahead-qualified string start pattern:

^(?=.*\d).*

This would match "gimme 5", but would not match "gimme five".

The lookahead expressions for character case ([a-z] for lower, [A-Z] for upper) and non-alphanumeric or 'special' characters (\W) follow a similar form, resulting in the following completed expression:

^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*\W).{8,20}$

Or, if no maximum is required:

^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*\W).{8,}$

Which may look daunting if encountered in a dark alley, but once the anatomy has been explored, the logic is clear.

No comments:

Post a comment