URL routing and regular expressions

Well I know this may not be the best subforum to post a regular expression question, but I dont see anywhere that it will fit so I have to do it here. The fact is that I want to write several different regular expressions to match the following possible 5 types of URL routes on my site:

/{area}
/{area}/{controller}
/{area}/{controller}/{action}
/{area}/{controller}/{action}/{params}
/{area}/{controller}/{action}/{params}/page-{number}

Here the area can be the main site, admin control panel, or clan/group control panel. The controller, action, params and page numbers should be very self-explanatory if you are familiar with MVC and pretty URLs. For the page number pattern, it should always match whenever it seems the keyword ā€˜page-ā€™, and in this case it will match the last type of URL always.

The question is, how do I write the regular expression that will match exactly such URL routes? I am still at a beginner level to regular expression, and its syntax confuses me a lot. Thanks.

Do you have some real world URL examples you can give us?

Do you want to collect any area? or certain words, same with controller, action and params

Are you writing your own router? Is it one that comes as part of a framework? I can recommend FastRoute to save you a lot of trouble, and you wonā€™t build one faster. Or is this just an exercise in RegExp?

Why not adopt the Php Framework approach and rewrite everything to index.php?

.htaccess
<IfModule mod_rewrite.c>
  RewriteEngine On
  # !IMPORTANT! Set your RewriteBase here and don't forget trailing and leading
  #  slashes.
  # If your page resides at
  #  http://www.example.com/mypage/test1
  # then use
  # RewriteBase /mypage/test1/
  RewriteBase /
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteCond %{REQUEST_FILENAME} !-d
  RewriteRule ^(.*)$ index.php?/$1 [L]
</IfModule>

<IfModule !mod_rewrite.c>
  # If we don't have mod_rewrite installed, all 404's
  # can be sent to index.php, and everything works as normal.
  # Submitted by: ElliotHaughin

  ErrorDocument 404 /index.php

Yes, I can give you an example for the five possible routes:

  1. /site, /admin, /(in this case no area is provided, it defaults to /site)
  2. /site/account, /admin/user,
  3. /site/pm/create, /admin/user/create
  4. /site/pm/read/1, /admin/user/edit/2, /site/vm/view/1/2(in the last case, it has two parameters)
  5. /site/pm/page-5, /admin/user/page-6

In these cases, site and admin are ā€˜areasā€™; account, user, pm and vm are controllers; create, read and edit are actions; the numbers 1, 2 are parameters(note the last example in fourth route has 2 parameters, its not common but it can happen); page-5 and page-6 are page numbers(identified by keyword or prefix page-).

The routes are listed by increasing priority. If the page- keyword is found, it matches fifth route. otherwise, it will match fourth route, third route, second route and first route in this particular order.

For areas, they can only be certain words like ā€˜siteā€™, ā€˜adminā€™, ā€˜modā€™, ā€˜clanā€™ and ā€˜installā€™. The controllers, actions must be alphanumeric strings, while the parameters and page numbers should be integer numbers. Those are the only restrictions.

Yes I am writing my own router, and I am doing this for a reason. The objective is not to build a site quickly, as I am creating my own framework as part of my practices. So I wont be using third party libraries, but I may look into the FastRoute library to see what advice/tips I may get from it. Thanks.

It seems that you dont understand what I am doing at all, this is not about URL rewrite. Of course, everything is being redirected to index.php, I have a .htaccess file that does this. I have a front controller inside the index.php file, which handles all requests and routes to different app/page controllers and actions based on the URL provided. I have already accomplished this, and now I am trying to build routers and routes that will match with browser URLs. You are talking about what happens before routing, not routing itself.

You want to read this then :slight_smile:

http://nikic.github.io/2014/02/18/Fast-request-routing-using-regular-expressions

I see, thanks for the article, Antnee. And can someone please help me by giving an example of how the regular expression should look like? Maybe the 4th or 5th route? I think I can come out with how to write the others if I have 1-2 examples, thanks.

Iā€™m not sure this will directly answer your question @Hall_of_Famer, but this article from Hugo Giraudel went up yesterday. Regex isnā€™t something I do anything with myself, but it looks a good primer on the subject.

If you know what exactly each segment of URL should mean, isnā€™t that will be easier to just split them?
Something like this (sample code, havenā€™t tested):

function getRoute($url){

    $url = trim($url, '/');
    $urlSegments = explode('/', $url);

    $scheme = ['area', 'controller', 'action', 'params'];
    $route = [];

    foreach ($urlSegments as $index => $segment){        
        if ($scheme[$index] == 'params'){
            $route['params'] = array_slice($urlSegments, $index);
            break;
        } else {
            $route[$scheme[$index]] = $segment;
        }
    }

    return $route;

}

then if you call

getRoute('/site/vm/view/1/2');

it should return

[
    'area' => 'site',
    'controller' => 'vm',
    'action' => 'view',
    'params' => [1, 2]
]

And I wouldnā€™t treat page number as something standalone. Basically, it can be passed as one of regular params.

1 Like

Well the issue is that the URL can be more flexible, sometimes actions and params do not exist, sometimes there are two params, and sometimes there are page numbers that need to identified for pagination. There is a reason why I am using regular expression for this.

Iā€™m not a regex ninja, but howā€™s this for you?
/\/(?<area>[\w\d\-_]+)?(\/(?<controller>[\w\d\-_]+))?(\/(?<action>[\w\d\-_]+))?(\/(?<param1>[\w\d\-_]+))?(\/(?<param2>[\w\d\-_]+))?/

I tried it against all of these examples and it looks like itā€™s working to me:

  • /
  • /site
  • /admin
  • /site/account
  • /admin/user
  • /site/pm/create
  • /admin/user/create
  • /site/pm/read/1
  • /admin/user/edit/2
  • /site/vm/view/1/2
  • /site/pm/page-5
  • /admin/user/page-6

If you do preg_match($pattern, $route, $matches); you should find that you have a $matches array that has named groups, so you would be able to check for $matches['area'], $matches['controller'], $matches['action'], $matches['param1'] and $matches['param2']

My function handles that as well.

I think there will be a very slight speed gain by using regular expressions but will be far more complicated to administer any changes. The script is only called once and far more time will be spent debugging.

@megazoidā€™s approach is sleek and not only effective but also caters for the complete range of URIs. (I wrote a script which was far more verbose and also rigid )

Online Demo

1 Like

So how is a script meant to identify which is which? Whether you use Regex or not, you need rules that define your structure.

Let me take a stab, and see if you agree.

  1. The first element is always the site.
  2. If there are 2 or more elements, element 2 is always the controller.
  3. If there are 3 or more elements, element 3 is always the action.
  4. If there are 4 or more elements; all elements except the last are parameters; the last element is a parameter unless it begins with the word ā€˜pageā€™ followed by a hyphen.

Note: Assuming this ruleset is correct, then megazoidā€™s code is perfectly functional except he needs to add a check for the ā€˜pageā€™ option.

if(substr(end($route['params']),0,5) === "page-") {
   $route['page'] = (int) substr(array_pop($route['params']),5)
}

I think allowing page number to be standalone part of URI scheme doesnā€™t make sense at all.
There is params already for all action options, including page numbers.

Rather than having each page that is going to do page-numbers individually run the logic for finding the page number, I would put it in the generalized router.

Ok, maybe it depends on how exactly actions will be implemented. I assumed that action is a function with input arguments, where each argument represents parameter from URL, for example:

URL: /site/shop/category/25/5

//controller
class shop { 
    //action
    function category($id, $page){
        //$id = 25, $page = 5
    }
}

In such case there is no need to have ā€œlogic for finding the page numberā€, itā€™s just passed like a regular parameter.

Check out my response. The Regex is a mess, but it works for all of the examples.

Personally, Iā€™d use parse_url() to get info about the full URL, and then just explode() the path and get the segments out because itā€™s just easier to follow whatā€™s happening, but as an exercise in regex, my solution will work

1 Like

I see what you mean.

Except in his definition of his URLā€™s, he specified:

As two seperate classes of URLā€™s, which is why i believe itā€™s actually handled separately. If it isnt, then absolutely just keep it as is.

Your solution will work, as long as there are exactly 0, 1, or 2 parameters. As that wasnā€™t a specification of the declared URL schemeā€¦