What is a regular expression?
A regular expression (also called regex or regexp for short) is a sequence of characters that describes a pattern in text.
Some examples of regular expressions:
p..t => A dot stands for a single character. This regular expression will match words that start with a ‘p’, end with a ‘t’ and have any two characters in between them (since there are two dots within).
So, it will match words like ‘part’, ‘post’, ‘pact’, but not ‘cost’, ‘pat’, ‘pull’.
p*t = A * matches zero or more occurrences of the previous subexpression. This will match pt, ppt, pppt, and so on, but not ptt, pct.
See the page on Metacharacters Supported in Regular Expressions to get the full set of metacharacters.
What role do regular expressions play in Oracle?
Regular expressions are a powerful tool for selecting text of a certain pattern. Sure we have LIKE in SQL queries, but that’s hardly adequate. LIKE can get you all phone numbers containing the string “445”, but can it get you all phone numbers of the format “(DDD) DDD-DDDD”?
LIKE cannot. At least not as simply and intuitively as regular expressions can. Let’s see how.
Case Study: Contact details validation
Consider a table that stores emails and phone numbers of superheroes (we can but dream).
SQL> select * from t_superheros; ID NAME EMAIL PHONE --- ------------- ------------------ --------------- 1 Superman superman@kmail.com (655) 981-4455 2 Batman batmankmail.com 6a5-981-4455 3 Spiderman spiderman@com 657-8154455 4 Flash Gordon flashgordon@com (99) 981-4455 5 Phantom phantom@kmail.com (999) 981-4455
Now, the expected phone number format is “(DDD) DDD-DDDD” but only Superman and Phantom have their phone numbers in that format. Batman’s has a letter instead of a digit, Spiderman’s is missing brackets and -, Flash Gordon’s is a number short within the brackets.
How can such invalid formats be tracked? Let’s construct the regexp for it.
To match a digit, we use \d. To match a fixed count of digits, we add the count in braces immediately after. So, DDD will be represented by \d {3}.
To match ( ), we need to use the escape character “\” before ( and ), to indicate that the bracket is used literally and not as a grouping expression. So, to match (DDD) we need the regexp \(\d{3}\).
Finally, to say that the entire string has to be the phone number in this format and nothing else, we put the expression within the markers ^ (start of line) and $ (end of line).
Putting it all together, the regular expression for “(DDD) DDD-DDDD” turns out to be ‘^\(\d{3}\) \d{3}-\d{4}$’. Use this in the query and – bingo!
SQL> select * 2 from t_superheros 3 where regexp_like 4 (phone 5 ,'^\(\d{3}\) \d{3}-\d{4}$'); ID NAME EMAIL PHONE --- ------------- ------------------ --------------- 1 Superman superman@kmail.com (655) 981-4455 5 Phantom phantom@kmail.com (999) 981-4455
Exercise: In the table above, the email ids are not all in correct format either. How will you validate emailid with a regular expression?
For further reading
- REGEX to split a comma-separated string into rows
- Flatten hierarchical data using a single SQL
- How to validate email address using regular expression
Photo by cackhanded
{ 4 comments… read them below or add one }
Very Useful thing….with good explanation…Thanks
It’s a plseuare to find someone who can identify the issues so clearly
Good Explanation. Thx u so much
I really liked the way you explain the whole concept. It is Very lucid and helpful.
thank you.