.htaccess RewriteRule clean URL (remove punctuation; replace spaces with hyphens,etc)

Hello,

I am trying to create a clean URL using a RewriteRule in an .htaccess file (using Apache 2.2).

Using a hypothetical example, I would like this:
Ripley’s Believe It Or Not – Piccadilly Circus (London, England)

To appear like this:
attraction/ripleys-believe-it-or-not-piccadilly-circus-london-england

i.e. Remove all punctuation, replace spaces with hyphens, and make upper case letters lower case. The number of spaces will vary from entry to entry and could be even more than the eight here, so I expect the [N] suffix may well be required.

I am currently using the ‘id’ (below) rather than the ‘attraction_name’, which is obviously far simpler, but does not create a very useful or attractive URL:

Options +FollowSymLinks
RewriteEngine on
RewriteRule ^attraction/([0-9]*)$ attraction/?id=$1 [L,NC,QSA]

I have also used a PHP custom function (‘GenerateUrl’) to generate the URL I need from within the link (below), but with this method (found at this site) the variable is not passed to the next page in its original state and therefore cannot then be used to select corresponding data.

<a href = "/attraction/<?php echo GenerateUrl($attraction['id']); ?>"><?php echo html($attraction[attraction_name']); ?></a>

I don’t want to use the ‘RewriteMap myquery’ method as once my site goes live I don’t expect I’ll have access to the server’s httpd.conf or virtualhost configuration files, which that would require.

I’ve considered using the custom function to create a URL that can be saved in the ‘attraction’ table and therefore be used to select corresponding data thereafter, but would rather not given I’m pretty sure it’s avoidable.

I just can’t figure out what the RewriteRule should be – can anybody help me out?

<snip><merged from hijacked thread><edited>
DK or ScallioXTX seeing as you each seem to be have expertise on RewriteRules, I hope that you might be able to help.
</snip></merged></edited>

Thanks in advance,

Andy

And it should allow for numbers in case there is such an attraction to be listed, i.e. ‘Cafe 1001’ (attractions/cafe-1001).

I’ve decided that the best way is to create a field in which to store URLs for their respective entries.

I am using the GenerateUrl function (found here):-

function GenerateUrl ($s) { //Convert accented characters, and remove parentheses and apostrophes $from = explode (',', "ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,e,i,ø,u,(,),[,],'"); $to = explode (',', 'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,e,i,o,u,,,,,,'); //Do the replacements, and convert all other non-alphanumeric characters to spaces $s = preg_replace ('~[^\\w\\d]+~', '-', str_replace ($from, $to, trim ($s))); //Remove a - at the beginning or end and make lowercase return strtolower (preg_replace ('/^-/', '', preg_replace ('/-$/', '', $s))); }

It works great for the most part, although I am having problems with apostrophes.

Used as quotation marks (i.e. only touching another character on one side) they work fine:-
‘Eiffel Tower (Paris)’ becomes eiffel-tower-paris

But used as actual apostrophes (being sandwiched between two characters), not so well:-
St Paul’s Cathedral (London) becomes st-paul-s-cathedral-london

I’m using PHP 5.4.3 and have code to undo the modifications of magic quotes (should this be the cause of the problem).

Thanks,

Andy

Sorry, code should read better here:-

function GenerateUrl ($s) {
//Convert accented characters, and remove parentheses and apostrophes
$from = explode (',', "ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,e,i,ø,u,(,),[,],'"); $to = explode (',', 'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,e,i,o,u,,,,,,');
//Do the replacements, and convert all other non-alphanumeric characters to spaces
$s = preg_replace ('~[^\\w\\d]+~', '-', str_replace ($from, $to, trim ($s)));
//Remove a - at the beginning or end and make lowercase
return strtolower (preg_replace ('/^-/', '', preg_replace ('/-$/', '', $s)));
}

I should mention that the desired URL would be:

st-pauls-cathedral-london

I’ve just figured out the original function code I gave DOES work. However, it only seems to works if I apply the function to the name live on the page, i.e.

<?php echo generateurl($attraction['attraction_name']); ?>

But what I am currently doing is applying the function within the index.php file when data is entered into the website. I suspect the problem is coming from the fact that I am applying the function to a value which has already had the below function applied to it (to deal with magic quotes):-

$attraction_name = mysqli_real_escape_string($link, $_POST['attraction_name']);
$attraction_url = generateurl($attraction_name);

I reckon I’ve got to shift some coding around to generate the URL from the attraction_name before it is affected by mysqli_real_escape_string. I’ll let you know how I get on… (nobody else has yet joined this discussion but I figure if I solve it then it could prove useful to somebody in the future).

Yes, it turns out that

mysqli_real_escape_string
was the cause of the problem. A bit of reordering of the code seems to have sorted it:-

$attraction_url = generateurl($_POST['attraction_name']);
$attraction_name = mysqli_real_escape_string($link, $_POST['attraction_name']);

Thanks!

Andy

The easiest way to get it to be st-pauls-cathedral-london, is to replace 's with just an s before you do anything else.


<?php
$str = str_replace("'s", "s", $str);

Hi,

The original function code did work; the problem I was having was casued by the ‘mysqli_real_escape_string’ function (see above).

I now want to remove the word ‘The’ (or ‘the’) from the start of any URLs that I create (and ‘A’ and ‘An’ as well once I’ve got that working). Surely it should be a simple change made to the bottom line of the function (see below), but it does not appear to be working - any ideas?

return strtolower (preg_replace ('/^-/', '', preg_replace ('/-$/', '', preg_replace ('/^the /i', '', $s))));

Thanks,

Andy

you could do

$urlstring = strtolower(preg_replace(‘[A-Za-z0-9-_]’,‘-’, $urlstring));

that should replace everything thats not A to Z or a to z or 0 to 9 or - or _ with a dash (-) and then make it all lowercase :slight_smile:

you could then check if thats the current url and use a header() call to automatically 301 to the correct place.

thats what I do for all my projects

This is the solution I’ve used to remove ‘the’, ‘a’, and ‘an’ from start of any URL:-

Replace the bottom line of the above GenerateUrl function code with the below (obviously this line still also converts everything to lower case letters and removes any opening or closing hyphens):-

return strtolower (preg_replace ('/^-/', '', preg_replace ('/-$/', '', preg_replace ('/\\b(^the|^a|^an)\\b/i', '', $s))));

Some useful advice on this subject from Stack Overflow.

And good article on using \b for word boundaries in regular expressions from Regex Tutorial.

Don’t forget to check the ALLOWED/RESERVED/PROHIBITED lists of characters at http://www.ietf.org/rfc/rfc2396.txt before you get into all this changing of URI characters.

I’ve got a client set-up to do exactly this BUT have prohibited any offending characters from his article titles. PM me if you would like to see the code.

Regards

DK