Processing non alphanumeric characters in urls

Hey, I’m encountering a problem which I can’t imagine hasn’t been fixed before by someone else.
For my website I am using htaccess to rewrite the urls, transforming for instance:

index.php?section=therapists&item=Therapist
to
therapists/therapist.html

Now I’ve been putting off handling non-alphanumeric characters by simply using urlencode (so therapist name would become therapist+name), but now I want to go and implement it.

What I would like would be replacing every symbol that is not alphanumerical to dashes, or at least all the spaces. Now my problem is that if I have the following name: Just a-test, this would be rewritten to just-a-test.html and when I take this variable from the $_GET and after str_replacing the dashes try to look for “just a test” in the database, I won’t find it.

Of course I could replace the dashes to underscores, but then I would have the same problem with underscores, etc etc.
How do you (or most people) handle this? I could put the item’s id in the url as well but I really don’t want to (I only do that if the name occurs multiple times).

The term for making a uri-friendly token - I call it making a “slug” which I came across in WP many years ago.

So you have a title “My Title” and you slugify that to “my-title”.

I came across this “double encoding dash” problem too and you have a couple of ways around it that I recall.

Don’t use dashes in your titles. Ban them.

Don’t use dashes in your slugs, use underscores eg “my_title” and ban underscores from your titles.

First replace dashes with 2 dashes and create more complexity in your slugify/unslugify functions.

Weigh up which method contains the least pain, most gain in your particular application.

FWIW Here is what I ended up doing, leave the title alone and just extract a slug from it.


class SlugMaker {

    /** method slugify
    * 
    * cleans up a string such as a page title
    * so it becomes a readable valid url
    *
    * @param STR a string
    * @return STR a url friendly slug
    **/

    function slugifyAlnum( $str ){

    $str = preg_replace('#[^0-9a-z ]#i', '', $str );    // allow letters, numbers + spaces only
    $str = preg_replace('#( ){2,}#', ' ', $str );       // rm adjacent spaces
    $str = trim( $str ) ;

    return strtolower( str_replace( ' ', '-', $str ) ); // slugify


    }

    
    function slugifyAlnumAppendMonth( $str ){

    $val = $this->slugifyAlnum( $str );

    return $val . '-' . strtolower( date( "M" ) ) . '-' . date( "Y" ) ;

    }

}

I did not need to unslugify anything because I finally figured out (on my own, and with a bit of help from kind users on here) that the slug can actually be a very effective PK for your articles table - that way you can leave your users title alone and it can contain, dashes, underscores ampersands spaces and anything users want to add.

NOT

id, title, article

1, “My Title”, blah de blah

BUT
slug, title, article

me-my-title, “Me & My Title”, blah de blah

You have to enforce PK slug uniqueness of course, hence I added a month-specific uniqueness for entries into a calendar system - but that could be day-unique if you really had to do it.

The other edge case to bear in mind is if a user edits a title, then the slug could be misleading - so you need to alert them of that possibility.

The main problem is everything will be input by the client, I was also thinking of maybe just replacing everything that is non-alphanumeric with a dash and then using wildcards in my sql like so:


$title = 'some--string-with-stuff';
$title = str_replace('-','_',$title);
$query = mysql_query("SELECT * FROM `table` WHERE `title` LIKE '".$title."'");

So it would put in a wildcard for every special character.

I am just wondering, does this slow down sql a lot or is there anything I am missing? I know this could cause problems if there’s someone with for instance “name-” and someone with “name&”, but I can do a check if I find multiple entries, and then pre- or suffix the entry id in the url.

No, I would not contemplate doing that.

Using a LIKE clause risks returning multiple records, as well as being slow - as you imagined.

Simplest might be to just create a new field “slug” and on record creation you bake the slug from the title - use the method I provided previously.

You still have to countenance uniqueness in that slug though - which many do by appending a date - but if you go that far then you might as well get on and make it the PK.

Be aware that you have built a massive dependency upon Apache into your application though.

I will probably be going with baking the slug field, the reason I didn’t like this option is because it requires building this functionality into my CMS, but I guess that makes it reusable.

I don’t think there’s a problem with having a dependency upon Apache.

If you don’t want to affect your cms too much, make a lookup table.


article_slugs
=========
id | slug
======
23 - 'my-title'

But it comes down to answering the question - “why do I need an id number as a PK when I am enforcing a unique slug in any case?”

I felt very much the same as you, about not wanting to fiddle with my CMS, but if you really tie Apaches (or IIS or Nginx) rewrite functionality into your cms then it makes so much sense to go directly from URLS like:

/Articles/my-article

via mod_rewrite

TO

index.php?type=arcticles&slug=my-title

TO

“select * from articles where slug = ‘my-title’”;

without having to unmap slug = my-title and therefore id is 23 so, get article #23.

I do not of course, skip over all the referential tables you might then need to go on and alter because your entire set up is crammed with tables such as:

table_menus

page_id | cat_id
23 - 2

But think how much more useful it would be if those tables contained slugs instead:

table_menus

page | cat_id
my-title - 2

OR even

table_menus

page | cat
‘my-title’ - ‘carnivores’

Hope some of this helps…

ps if you are really bored, read this old rambling thread of mine Continuing normalization questions where you can truly appreciate how slowly my brain works. :wink:

Since I wrote the CMS from start to finish I don’t mind adding extra functionality, I applaud it :wink: I was just hoping there was a universal solution or function which I just missed.

I like that I now have a term to call it (slug), and will probably build a function in my cms that derives the slug from a given field, does the unique check in there (and add a 2 or something if it exists), and then just add it to the site.

The main feature of my cms is customizability (as a programmer, it’s for in-house projects) so it’s not like I’ll have to destroy entire systems just to add this functionality.

Cheers for your thinking, I know what I’m gonna do now ^^.

The thread you linked seems interesting, I’ll probably give it a read once I’m out of the office :slight_smile:

Yes, giving things a name does help, it gives your brain a nice handle to pass around - I just hope I gave you the right name!

Off Topic:

Writing and maintaining your own CMS, sounds as though we have a lot in common then. Last big thing I did with my CMS was to get rid of images, all of em - put them all on Flickr, let the users manage them from there. Use the API to associate images to articles and so on. Got a local group going to contribute photos and we have permission to use on our site. Images on Flickr provide link-bait back to site. Absolutely brilliant!

Well there are two ways to work around this. The simpler, easier and more common one technique is to embed a number somewhere in your URL. Don’t wander for examples, just look at the URL of this post:

/forums/php-34/processing-non-alphanumeric-characters-urls-751757.html

Notice 751757? This is the thread id. Interestingly, there is another number, 34 which (probably) represents topic id or something similar.

Other variations of this technique include using numbers inside path names, not file names. For example:

/forum/34/php/751757/processing-non-alphanumeric-characters-urls.html

The trickier way is to generate a “permanent URL” once and store it in the database inside an indexed (and ideally, unique) field. So for an article titled:

Just a-test

Your permalink would be:

just-a-test

From this point forward you would query the database, looking inside the permalink field, NOT the title field.