SEARCH W7R

Monday, January 21, 2013

Regex Tutorial 3: Phone Nums


w7r.blogspot.com
Tutorial #2 Answers
  1. "a{3}b{4}c{5}"
  2. "kne{2}"
  3. "cab{2}age"
  4. "2{2}nd stre{2}t"



Typical Regex

Often times, the data we are trying to capture using Regex is not exactly clear. For example, with Regex you may want to capture all of the phone numbers on a website without knowing what actual numbers you will find.

Searching the literal text "333-333-3333" would not be a means of locating all of the phone numbers on a website. It would only capture results with that exact phone number.

So, we need to describe a pattern that follows: 3 digits + hyphen + 3 digits + hyphen + 4 digits.



The Digit Character

"\d" is the text to describe 1 digit. A digit character can only match: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
So, "\d\d\d" could match the first three digits of a phone number; however, we can simplify this expression to be "\d{3}". The curly brackets surround the exact number of digits we want to target.
So far "\d{3}" matches the 1st of 5 parts in the phone number pattern we would like to describe. The part following the three digits is a hyphen "-". A hyphen is treated literally in regex because the hyphen has no other purpose in regex besides specifying ranges which appear only in the quantity specifying portion of a regular expression.



Congratulations! Our Work is Nearly Finished!

It so happens that the pattern "###-" (DIGIT + DIGIT + DIGIT + HYPHEN) occurs twice with no characters in between in a phone number (###-###-####) so we can simply repeat the 1st 2 parts of our regex.

Updated Regex: "(\d{3}-){2}"



An image of the regex, ({\d{3}-){2} in a search and replace module

What Do Parenthesis Do in Regular Expressions??

As shown in the 'Updated Regex' snippet above, parenthesis combine sub-expressions into a bigger expression. Combining the "\d{3}" and the hyphen "-" yields "(\d{3}-)" in which we can specify how many times this group should occur using repetition operators as we did with the "\d".
Parenthesis can do more than group a series of sub-expressions together; however, its purpose is out of the scope of this tutorial. Keep working hard and the answer will soon unveil! Or spoil the answer with the link I provided below.
In-Depth Look at Brackets in Regex



Ta Duh!

Updated Regex: "(\d{3}-){2}\d{4}"

By adding on a "\d{4} to the end of our regular expression we have finished the pattern. However, our regex can be easily tricked with text such as "333-333-55555" because it does not exlude results that have additional numbers on the end. And the same goes for extra digits before a set of 3 digits.

Although, this expression brings in new syntax to these tutorials, I want to make sure you have a strong feeling for the amount of thought necessary to construct a successful regex.

Keep this expression somewhere because it may come in handy some time! Regular expressions are like recipes because they require ingredients (characters, and quantifiers, ..) in a specific order to construct a meal (solution).


B.S. Capital Dee

BS = Backslash. A backslash capital D is an appropriate finish to this expression. "\D" matches any character that is not a digit.

Therefore the cases "333-333-55555" and "4444-333-4444" are ignored as they should be! We wanted only phone numbers in the format ###-###-#### in which this updated expression includes a first and last character case that can quickly determine whether text is a valid phone number (in terms of format ###-###-####).

My challenge for you is to comment text data that would break the expression below!

Updated Regex: "\D(\d{3}-){2}\d{4}\D"

No comments: