Doesn’t matter even if you are a ninja programmer or a novice, at a point in life, you got to master regular expression. Therefore, the sooner the better.
So, let’s get started.
Many people refer to call regular expression as Regex, Regexp or simple Regular expression. All means the same. Regex is a pattern that consists of rules used to match a certain set of strings. For example, we can write the names (three bold words above) in a regex form like this
/^Reg(exp?|ular expression)$/
A more abstract definition would be, “regex is a way to describe a set strings”. Now the more pressing question is why regex are useful or should you use regex at all?
- Regex are extremely fast, provided you write a good regex, not a sloppy one.
- Regex can help you to write short codes.
- Regex saves a lot of time.
- Regex can match about anything.
We can check if a string matches a certain format, like in HTML5, we can use it in form validation. We can use it to extract certain parts of the string or replace, depends on your need. Regex capabilities are built-in in many programming languages, IDEs, command line tools like grep, sed, etc.; And also with databases. Regex are used in search engine.
One limitation of regex would be that, they are not fully portable, means regex written for ruby might not work with PHP. This is due to the different regex engines (PCRE, Jregex, XregExp).
Now let’s dive into regex basic:
- The most basic regex is which contain letters, numbers and other symbols that literally matches Itself. For example ‘a’ is a regex to match symbols ‘a’.
- By default regex are case sensitive.
Common Matching Symbols
Dot (.)
It matches anything except newline.
- .at is a regex and can match cat, bat, 1at, #at, (at, .at etc strings.
- To change this behaviour of dot(.), escape it with a backslash, now \.at will just match .at
{n}, {n, m}, {n, }
To match same consecutive symbols over and over again, we can use curly-braces and to define length of character(s). They are also called quantifiers.
- string: caaaaaaaaaat
- regex: ca{10}t
Above regex will match c, 10 consecutive a‘s and a t.
{n} = matches exactly ‘n’ times.
{n, } = matches at least ‘n’ times but can match farther times, latter is known as greediness.
{n, m} = matches at least ‘n’ times, but no more than ‘m’ times, ‘m’ is just the upper bound here.
There are certain shortcut for defining quantifiers.
* = {0, }
* matches 0 or more number of time, it also matches empty strings if there are zero matches.
+ = {1, }
+ matches 1 or more number of time, it does not matches empty strings.
? = {0, 1}
? matches either 0 or 1 number of time, It also matches empty strings if there are zero match.
Question: Match the highlighted text using quantifiers?
<title>aaaabbbbbccccc</title>
Solution: A poor regex would be this <.+> because it won’t give you the correct answer. Why?
Let’s analyse it.
Step 1:
String: <title>aaaabbbbbccccc</title>
Regex: <.+>
The regex starts with < therefore, it will just match the angle bracket.
Step 2:
String: <title>aaaabbbbbccccc</title>
Regex: <.+>
The dot (.) will match only one character and using it with + will make it to match at least one character or more, as quantifiers are greedy, therefore it will match the whole string.
Step 3:
String: <title>aaaabbbbbccccc</title>
Regex: <.+>
Now the regex engine will try to find the closing angle bracket, for this it will do backtracking from the end until it reaches the first angle bracket.
Result: <title>aaaabbbbbccccc</title>
Technically regex did what it suppose to do. But this isn’t what we wanted.
So the main culprit here is the greediness. To stop greediness use ? in regex.
Regex: <.+?>
String: <title>aaaabbbbbccccc</title>
now it match as little as possible.
Brackets [ ]
[ ] are used to set alternation between characters or ranges or both.
- [abc] which means match either a or b or c, not all at the same time.
Question: Match using [ ]
-
Match can Match man Match fan Match jan Skip ran Skip tan Skip pan
Solution: [cmfj]an
To avoid writing all characters inside [ ] you can write ranges like [a-z] will match all small English characters.
You can also combine different ranges like this [a-zA-Z0-9], this will match all capital, small alphabets, and numbers. And using a quantifier with [ ] can match multiple characters.
Match: aabbccddeeaabbzztt
Regex: [a-z]+
Instead of writing long ranges of characters, numbers inside [ ] you can use character classes.
[\w] = [a-zA-Z0-9_]
It contain plain old English characters, numbers and an underscore, it will not match Ω or ϕ kind of symbols.
[\d] = [0-9]
It contain numbers from 0 to 9.
[\s] ≈ [\n\t\r ]
This class contain a tab, newline, vertical tab, form feed, carriage return and space.
Question: Match hex-color
-
Match #abc Match #f55 Match #XTQ67A Match #C0FFEE Skip #!AB Skip #!f004 Skip #f5Y
Solution: Hex colors are about of 3 and 6 characters long which contain alphabets from ‘a’ to ‘f’ and ‘A’ to ‘F’ and numbers, which means #f00 is valid hex color code and #K00, #f002 are invalid. Therefore, we need to filter out those hex colors which have wrong alphabets and are of the wrong length.
Many people will write this regex #[a-fA-F\d]{3,6}, well this is wrong, because it does not filter out hex colors of invalid length. The correct solution is #([a-fA-F\d]{3}){1,2}
Analysis: Here we have used the brackets to group the regex and added a quantifier of 3 which matches any letter a-f and A-F and digits 3 times and this pattern needs to be repeated once (3 length) or twice (6 length), therefore another quantifier is added {1,2} outside the brakets..
More on brackets and other topics of regex in the next part.
Thanks for your time.
Stay tuned.
Be the first to comment.