Using Regular Expressions in PHP

When I first started programming in PHP, I found regular expressions very difficult. They were complicated, looked ugly, were hard to figure out, and there seemed to be a real lack of documentation in this area. This article will provide you with an insight as to what they are, how they are useful, and how to apply them.

What are Regular Expressions?

Regular expressions started out as a feature of the Unix shell. They were designed to make it easier to find, replace and work with strings — and since their invention, they’ve been in wide use in many different parts of Unix based Operating Systems. They were commonly used in Perl, and since then have been implemented into PHP.

What could I use them for?

There are a few common uses for regular expressions. Perhaps the most useful is form validation. For example, you could use regular expressions to check that an email address entered into a form uses the correct syntax. We’ll consider this specific example later on in this article.

You could also use them to complete complex search and replace operations within a given body of text that would not be possible with PHP’s standard str_replace function. Yes, the possibilities are endless!

How do I use them?

Let’s look at how we might use a regular expression to check the syntax of an email address entered into a form that’s submitted to a PHP script.

There are two types of regular expression functions included in PHP:

  • the ereg functions — PHP’s standard regular expression syntax

  • the preg functions, which use a Perl-compatible regular expression syntax

    For this article we’ll use the eregi function. The eregi function is used to match a string to a particular regular expression. The ‘i‘ in the function name means ‘case insensitive’ — you can also use ereg if you want it to be case sensitive.

    You can see the PHP Manual pages for the eregi function here.

    Now, as you know, email address are always in a particular format:

    username @ domain . extension

    That makes them an ideal candidate to be tested with a regular expression. So let’s take a look at an expression I wrote to check the validity of an email address. We’ll look at each section of the expression individually, and then I’ll include a syntax reference at the end of the article. But first, here’s the expression itself:

    eregi('^[a-zA-Z0-9._-]+@[a-zA-Z0-9-] 
    +.[a-zA-Z.]{2,5}$', $email)

    If you’re anything like I was when I first used regular expressions, that example probably looks very confusing! Let’s split it into sections and make sense of each part individually:

    ^[a-zA-Z0-9._-]+@

    This part of the expression validates the ‘username’ section of the email address. The hat sign (^) at the beginning of the expression represents the start of the string. If we didn’t include this, then someone could key in anything they wanted before the email address and it would still validate.

    Contained in the square brackets are the characters we want to allow in this part of the address. Here, we are allowing the letters a-z, A-Z, the numbers 0-9, and the symbols underscore (_), period (.), and dash (-). As you’ve probably noticed, I’ve included letters both in capitals and lower case. In this instance, this isn’t strictly necessary, as we’re using the eregi (case insensitive) function. But I’ve included them here for completeness, and to show you how the functions work. The order of the character pairs within the brackets doesn’t matter.

    The plus (+) sign after the square brackets indicates ‘one or more of the contents of the previous brackets’. So, in this case, we require one or more of any of the characters in the square brackets to be included in the address in order for it to validate. Finally, there is the ‘@‘ sign, which means that we require the presence of one @ sign immediately following the username.

    [a-zA-Z0-9._-]+.

    This part of the expression is very similar to the section we t looked at. It validates the domain name in the email address. As before, we have a series of characters in square brackets that we’ll allow in this part of the address, followed by a plus (+) sign, requiring one or more of those characters.

    At the end of this section, there is a backslash, then a period sign. This tells the expression that a period is required at this point in the expression (ie. between the domain and extension). However, the backslash is slightly more complicated. In a regular expression, a period actually means ‘any character’. In order to make this expression take the period’s literal value rather than use it as a wildcard for any character, we need to ‘escape’ it — this is done by preceding the period with a backslash. You may have come across this before if you use databases such as MySQL, as escaping characters is very important there too.

    [a-zA-Z]{2,4}$

    This is the final part of the expression. At the beginning is another set of characters enclosed in square brackets. This time, I have simply allowed the letters a-z and A-Z, because numbers and other characters are not valid in domain extensions.

    Instead of the + sign we used before, here we have ‘{2,4}‘ immediately following the square brackets. This means that we require between 2 and 4 of the characters from the square brackets to be included in the email address. So com, net, org, uk, au, etc. are all valid, but anything longer than these will not be accepted.

    Finally, the $ sign at the end of the expression signifies the end of the string. If we didn’t include this, then a user could type anything after the end of the email address and it would still validate.

    Here’s the source code of a script you can use to test this regular expression — and any others you want to play with:

    <?php  
    if (!$_REQUEST['action']) {  
    ?>  
    <form action='<?=$_SERVER['PHP_SELF']; ?>' method='POST'>  
    Email Address: <input type='text' name='email'>  
    <input type='hidden' name='action' value='validate'>  
    <p>  
    <input type='submit' value='Submit'>  
    </form>  
    <?php  
    }  
     
    if ($_REQUEST['action'] == 'validate') {  
    if (eregi('^[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+.([a-zA-Z]{2,4})$',  
           $_REQUEST['email'])) {  
    echo 'Valid';  
    } else {  
    echo 'Invalid';  
    }  
    }  
    ?>

    Feel free to use the regular expression we made above on your own site to validate email addresses, or modify it for your own purposes.

    Syntax Reference

    This is a quick reference to some of the basic syntax. We’ve already seen much of it earlier on, but there are a few new things here that you may find useful.

    ^   start of string
    $ end of string
    [a-z] letters a-z inclusive in lower case
    [A-Z] letters A-Z inclusive in upper case
    [0-9] numbers 0-9 inclusive
    [^0-9] no occurrences of numbers 0-9 inclusive
    ? zero or one of the preceding character(s)
    * zero or more of preceding character(s)
    + one or more of preceding character(s)
    {2} 2 of preceding character(s)
    {2,} 2 or more of preceding character(s)
    {2,4} 2 -- 4 of preceding character(s)
    . any character
    (a|b) a OR b
    s empty space (known as whitespace)

  • Free book: Jump Start HTML5 Basics

    Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

    • Prasu001

      I like this..I am free from all doubts of regex now..thank you very much..

    • http://www.phpin.net kevin

      nice, clear regex examples.

    • http://www.talksolarpanels talk solar panels

      Hi – I just want to say thanks because I’ve used this script in creating an RSS feed and also in form validation on our website.

    • http://www.codeofaninja.com/ Mike

      this is a decent and clear tut, thanks a lot! :)

    • http://google ujala

      thank u so much here define all steps in well way and im understand it easily

    • http://google ujala

      clear all confusion thanks