Start a Project

Demystifying The Regular Expressions – I

 

Doesn’t matter even if you are a ninja programmer or a novice, at a point in life, you got to master regular expression. Therefore, the sooner the better.

So, let’s get started.

Many people refer to call regular expression as Regex, Regexp or simple Regular expression. All means the same. Regex is a pattern that consists of rules used to match a certain set of strings. For example, we can write the names (three bold words above) in a regex form like this

 

/^Reg(exp?|ular expression)$/

 

A more abstract definition would be, “regex is a way to describe a set strings”. Now the more pressing question is why regex are useful or should you use regex at all?

 

 

We can check if a string matches a certain format, like in HTML5, we can use it in form validation. We can use it to extract certain parts of the string or replace, depends on your need. Regex capabilities are built-in in many programming languages, IDEs, command line tools like grep, sed, etc.; And also with databases. Regex are used in search engine.

 

One limitation of regex would be that, they are not fully portable, means regex written for ruby might not work with PHP. This is due to the different regex engines (PCRE, Jregex, XregExp).

 

Now let’s dive into regex basic:

 

Common Matching Symbols

Dot (.)

It matches anything except newline.

 

{n}, {n, m}, {n, }

To match same consecutive symbols over and over again, we can use curly-braces and to define length of character(s). They are also called quantifiers.

 

Above regex will match c, 10 consecutive a‘s and a t.

{n} = matches exactly ‘n’ times.

{n, } = matches at least ‘n’ times but can match farther times, latter is known as greediness.

{n, m} = matches at least ‘n’ times, but no more than ‘m’ times, ‘m’ is just the upper bound here.

 

There are certain shortcut for defining quantifiers.

 

* = {0, }

* matches 0 or more number of time, it also matches empty strings if there are zero matches.

+ = {1, }

+ matches 1 or more number of time, it does not matches empty strings.

? = {0, 1}

? matches either 0 or 1 number of time, It also matches empty strings if there are zero match.

 

Question: Match the highlighted text using quantifiers?

                <title>aaaabbbbbccccc</title>

 

Solution: A poor regex would be this <.+> because it won’t give you the correct answer. Why?

Let’s analyse it.

Step 1:

String: <title>aaaabbbbbccccc</title>

Regex: <.+>

The regex starts with < therefore, it will just match the angle bracket.

 

Step 2:

String: <title>aaaabbbbbccccc</title>

Regex: <.+>

The dot (.) will match only one character and using it with + will make it to match at least one character or more, as quantifiers are greedy, therefore it will match the whole string.

 

Step 3:

String: <title>aaaabbbbbccccc</title>

Regex: <.+>

Now the regex engine will try to find the closing angle bracket, for this it will do backtracking from the end until it reaches the first angle bracket.

Result: <title>aaaabbbbbccccc</title>

 

Technically regex did what it suppose to do. But this isn’t what we wanted.

So the main culprit here is the greediness. To stop greediness use ? in regex.

 

Regex: <.+?>

String: <title>aaaabbbbbccccc</title>

now it match as little as possible.

 

Brackets [ ]

[ ] are used to set alternation between characters or ranges or both.

 

Question: Match using [ ]

Match can
Match man
Match fan
Match jan
Skip ran
Skip tan
Skip pan

Solution: [cmfj]an

 

To avoid writing all characters inside [ ] you can write ranges like [a-z] will match all small English characters.

You can also combine different ranges like this [a-zA-Z0-9], this will match all capital, small alphabets, and numbers. And using a quantifier with [ ] can match multiple characters.

 

Match: aabbccddeeaabbzztt

Regex: [a-z]+

 

Instead of writing long ranges of characters, numbers inside [ ] you can use character classes.

 

[\w] = [a-zA-Z0-9_]

It contain plain old English characters, numbers and an underscore, it will not match or ϕ kind of symbols.

 

[\d] = [0-9]

It contain numbers from 0 to 9.

 

[\s] ≈ [\n\t\r ]

This class contain a tab, newline, vertical tab, form feed, carriage return and space.

 

Question: Match hex-color

Match #abc
Match #f55
Match #XTQ67A
Match #C0FFEE
Skip #!AB
Skip #!f004
Skip #f5Y

Solution: Hex colors are about of 3 and 6 characters long which contain alphabets from ‘a’ to ‘f’ and ‘A’ to ‘F’ and numbers, which means #f00 is valid hex color code and #K00, #f002 are invalid. Therefore, we need to filter out those hex colors which have wrong alphabets and are of the wrong length.

 

Many people will write this regex #[a-fA-F\d]{3,6}, well this is wrong, because it does not filter out hex colors of invalid length. The correct solution is #([a-fA-F\d]{3}){1,2}

 

Analysis: Here we have used the brackets to group the regex and added a quantifier of 3 which matches any letter a-f and A-F and digits 3 times and this pattern needs to be repeated once (3 length) or twice (6 length), therefore another quantifier is added {1,2} outside the brakets..

 

More on brackets and other topics of regex in the next part.

Thanks for your time.

Stay tuned.

 

Exit mobile version