Modify rewrite rule that strips .html from urls to ad trailing slash

Hi All,

I got this little gem from here a while ago to strip .html from the end of and inside all my urls:

RewriteRule ^([^.]+).html([^.]*) $1$2 [R=301,L]

Which turns these:

http://www.mysite.com/a-example-of-a-page-or-past.html
http://www.mysite.com/a-page-with.html/pagination/2

Into these:

http://www.mysite.com/a-example-of-a-page-or-past
http://www.mysite.com/a-page-with/pagination/2

But I would like to have a trailing slash to my url structure. I use WordPress and when I have trailing slash WP by default redirects the none-slash to the slash version. So that mean I would have 2 redirects using the above rewrite rule (please correct me if I’m wrong): .html > none-slash > slash

I would like to cut that down by adding a trailing slash to the rule, and I’ve done by just adding it after the $1$2 like this:

RewriteRule ^([^.]+).html([^.]*) $1$2/ [R=301,L]

Is that the correct way? It seams to work fine, but I just want to double check first.

And any other critique, input and/or comments regarding this, the rule and method is more then welcome!

Thanks and all the best!

/A

1 Like

kk,

IMHO, telling the server that the files it is supposed to serve should be renamed as if they were directories (then fetched as directories) is simply crazy. How do you expect Apache to know which file to serve with /a-page-with/pagination/2 as the URI. What type of file is “2”?

Okay, okay, that’s the old “you’ve got the wrong (re)direction” or an abuse of the Options MultiViews feature.

Please have a read of my mod_rewrite tutorial (http://dk.co.nz/seo) which should help you with an explanation of “direction” as well as providing example codes. Since it seems you want to redirect to a request format which Apache cannot serve, there is also the “Loopy Redirection” case, too.

Regards,

DK

1 Like

Hey DK,

You’re actually the one that gave me the mentioned re-write rule in the first place, in this thread.

You had a bit of a go at me over there too, for different reasons though… :smirk: …but never questioned how Apache would know what file type “2” was, probably because that thread has a lot more background info.

But when I asked then I did so without the wish for the trailer slash.

So my questions now is, will this:

RewriteRule ^([^.]+).html([^.]*) $1$2/ [R=301,L]

(with the / after the $1$2 as my only addition from the original)

Permanent redirect these:

http://mysite.com/a-example-of-a-page-or-past.html
http://mysite.com/a-page-with.html/pagination/2

to these:

http://mysite.com/a-example-of-a-page-or-past/
http://mysite.com/a-page-with/pagination/2/

It seams to works fine on my test server, but still wanna double check with people that know as this modification is mostly a guess from my end. And who better to do so then you, as you are the one that made it for me in the first place.

And thanks again for your reply and time!

Best regards,
A

1 Like

Why not adopt the Php Frameworks approach that redirects all URLs to a single index.php?

Edit
All URLs are validated and routed according to the url, with or without a html extension and can be rendered by calling a numeric, indexed table Id column

Each Url can render the document with a relevant http response code of 200 or 301

1 Like

Thanks for your input, not sure I’m following you though…

I have 700+ indexed urls, all ending with .html, changing the url structure of my site now and this will remove the .html from all urls. So need a 301 that redirects everything ending with .html so I can keep my indexed urls, ranking and PR.

1 Like

I am on a tablet at the moment and it is not easy to give lengthy replies.

The .htaccess can redirect all urls to an index.php which checks the input, removes any .html extensions, either includes the relevant web-page or renders output from your database.

The index.php will also insert into the output the correct Canonical link.

Edit
Try this url, change case, add a slash, add any extension such as .htm, .html, .php, .fred

View the source and check the canonical reference.

1 Like

kk,

Gee, that was a really old thread! In it, you were asking merely how to remove .html from the {REQUEST_URI} string (NOT a good thing to do, IMHO, as you’re forcing MultiViews (use of a script file in the middle of the requested path) which I loathe (a personal problem of mine).

Additionally, I’m sure you’ve seen me pan the addition of a superfluous trailing slash (except to identify the request as a DIRECTORY request, i.e., serve the DirectoryIndex from the requested directory).

Again, though, as long as you have other rules in your .htaccess to enable Apache to serve a file (which can handle your path’s additional information), then my personal problems are of no concern for your question but should serve as a warning to other members…

As for your question, yes. In fact, I’m not sure why I recommended ([^.]) where (/.+)$ is what I’d say might work better (after .html, you have /{anything, not just non-dot characters} and you need to capture up to the end anchor.

John’s suggestion is to use a file handler to handle all your requests. WordPress uses index.php to do EVERYTHING as it can examine the request and use modules to generate the appropriate response. If your “a-page-with.html” handles all your requests, that is what John was suggesting. That would work fine for you so long as you only have one .html file … but I fear that’s not the case.

Finally, my apologies if you felt picked upon back in '12. I do get very pedantic and have tried to give responses for ALL members, not just the OP so I’m often on my soapbox about what I perceive to be the best things to do (or not to do) AND provide justification for my biases. There is never anything personal (except when giving kudos where they’re deserved) in my posts.

Regards,

DK

2 Likes

Yeah, I know. I decided to stick with the .html ending after all with the last re-design as I kept the same url structure. But now it is time for another re-design, and I need/want to change the structure this time and would like for it to go.

“All” I wanna do is make sure to redirect, permanently, incoming links and indexed urls, that has the ending and embeded .html to same url but without it.

So then in my case the trailing slash is superfluous? As this is urls to Wordpress pages and posts not directories…or am I thinking wrong here?

Did you mean ([^.]*) as there is no ([^.]) ? If so I would end up with this:

RewriteRule ^([^.]+)\.html(/.+)$ $1$2 [R=301,L]

And this if I want traling slash:

RewriteRule ^([^.]+)\.html(/.+)$ $1$2/ [R=301,L]

Both seams to work fine on my test server, is this correct?

No need for apologies, no offence taken, neither then or now. Just happy you’re taking the time!

Best regards,
A

1 Like

kk/A,

Aha! A specification.

Don’t worry about my “personal problem” with believing that webmasters should know (and honor) the difference between file and directory requests. If your script can handle it, then there is no problem (other than mine).

I specifically choose (/.+)$ because it appeared to me that .html (in your format) must be followed by a slash then one or more characters (I’d guess [a-z/]+ but took the lazy route and captured EVERYTHING up to the end anchor.

I did not mean not dot because there was no reason for me to omit the dot character. As for the metacharacter for the atom, the * (in your question) was for zero or more characters (I believe that you need at least one after the /), my + was for one or more characters after the / and your ? was for zero or one “not dot” which would strop any further characters. Therefore, I would simplify your code to:

RewriteRule ^(.+)\.html(/.+)$ $1$2 (R=301,L]

Use the trailing / if you HAVE* to but remember that, without the 301 redirection, your trailing / would be altering the directory level for relative links within your script (that is the reason that some members’ optional trailing slash is self-defeating).

Thanks for easing my mind (about the smirk). When I am tired (it’s almost 20 to 1AM), I know that I get terse as well as pedantic (trying to be correct {best practice?] for ALL members) which may leave someone feeling that (s)he has been subject to a personal attack … which is NOT the case!

Regards,

DK

1 Like

DK!

Back on this after a hectic week, and thanks for the explanation, much appreciated.

But, it seams that your last rewriterule is causing my testserver to give a 500 server error, any idea why?

And yet again, thanks for taking the time.

Best regards,
A

Sorry, me being a little fast with the copy’n pasting, found it, a ( instead of [

But, this rule seams to deal only with the embedded .html and leaving the .html in the end?

kungknas,

I’m glad that you found the error yourself. 500 errors (especially after a change of the .htaccess file) are indicative of a syntax error - which you found and corrected.

As for other problems, they’re almost always caused by the order of your mod_rewrite blocks. If you’ll post your .htaccess code, I’m sure I’ll be able to spot the ordering problem quickly … if you can’t find it first! :smirk:

Regards,

DK

Not much in my .htaccess at the moment, everything besides your rule is what wordpress puts there:

RewriteEngine On
RewriteBase /site.com/

RewriteRule ^(.+)\.html(/.+)$ $1$2 [R=301,L]   

RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /site.com/index.php [L]

Only strips the embedded .html, but if I use your original one instead:

 RewriteRule ^([^.]+).html([^.]*) $1$2 [R=301,L]

It strips both embedded end ending .html, so not sure it is the order?

Best regards,
A

kungknas,

Oh, my!

  1. I am not a fan of using RewriteBase as it will often change the directory inadvertently. If you put your .htaccess in the domain’s DocumentRoot, then all the code should be relative to that.

  2. The first RewriteRule does nothing but strip the .html from the URI.

  3. The second RewriteRule does nothing and can be deleted (if it exists as a file, it will not be impacted by the third RewriteRule (because of RewriteCond #1).

  4. The third RewriteRult redirects all 404 requests to /site.com/index.php. This defeats the first RewriteRule because the extensionless URIs will NOT be a file and will NOT be a directory.

In short, the first and third RewriteRules are incompatible.

Okay, okay, if you do have .html files, you can reverse the order of the first and third (don’t forget to delete the second) BUT the result will still be an extensionless filename which WILL be redirected to index.php. (ARGH, please get rid of the site.com/ - it should be registered as a Virtual Host’s DocumentRoot so it’s not needed … at all).

If your intention was to strip the .html and have index.php serve the {whatever}.html, then your current code is fine (but will conceal the fact that index.php is the request handler for your .html files).

Back to square one, what was your original intent?

Regards,

DK

DK,

I was expecting a bit more then that! :smile:

Both these are added by wordpress, second I can remove without problem, but if I remove the third the site stops working, below is the default block wordpress ads in your .htaccess if one choose to have permalinks activated and something else shown besides the default site.com/?p=123 for post and pages.

Looks like this in a whole:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>    

Thing is I don’t have any .html files, never had, just a bunch of indexed urls with ranking and PR that ends with .html. I had ages ago, then I moved to worpress, and told wordpress to add .html to the end of the urls so I could have my pages and posts on the same url as when having a static site.

Yeah, I guess I could add a line in the host file and a new virtual hosts for each new site, I’m just too lazy for that and happily have them as locahost/site.com, locahost/site2.com etc, as only thing needed then is to create a new dir in xampps htdoc folder, and then the site.com/ is needed when working locally. Not there when on the live server.

To have all links and indexed urls that have .html embedded in them or are ending with .html permanently redirected to the same url but without .html. To retain linkjuice from backlinks, keep PR and my rankings when I remove the .html from the Wordpress permalink structure.

So these:

http://mysite.com/a-example-of-a-page-or-past.html
http://mysite.com/a-page-with.html/pagination/2

Permanently redirected to these:

http://mysite.com/a-example-of-a-page-or-past/
http://mysite.com/a-page-with/pagination/2/

And with a rule that is compatible with the rewriterules in the wordpress (if permalinks acitvated) default .htcaccess.

But for that the original you gave me seams to work just fine:
RewriteRule ^([^.]+).html([^.]*) $1$2 [R=301,L]

And then I started this thread to ask if my modification to add a trailing slash was correct…aaaand now we are here.

Best regards,
A

1 Like

kk/A,

What? You were hoping for a bit of rage? :grimacing:

Well, you’ve opened another can of worms admitting to using WP. For years, I’ve been railing on about how bad their code was (and explaining why every time):

  1. NEVER waste production machine cycles by making a test repeatedly (for every request and within every request). That means that … oh, well, I’d created a rant for that years ago:

[rant #4][indent]The definition of an idiot is someone who repeatedly does the same thing expecting a different result. Asking Apache to confirm the existence of ANY module with an … wrapper is the same thing in the webmaster world. DON’T BE AN IDIOT! If you don’t know whether a module is enabled, run the test ONCE then REMOVE the wrapper as it is EXTREMELY wasteful of Apache’s resources (and should NEVER be allowed on a shared server).[/indent][/rant 4]

Note: WP does that to prevent pseudo webmasters from “turning off” their website if mod_rewrite is not available, i.e., to make their code IDIOT-proof.

I hope that satisfies your first comment, too.

  1. RewriteBase is still superfluous …

  2. … as is the passthrough for index.php.

As for your “lazy” comment, may I suggest that you’d be far ahead in your localhost testing if you did use virtual hosts. IMHO, “lazy” here deserves the same type of rant as above … but that’s YOUR choice, of course. On second thought, being “lazy” may force you to reinstate the RewriteBase with site.com (but still not necessary in the index.php redirection).

I stand by my last post’s gratuitous statement that your order (get rid of the .html then redirect to index.php) WILL work for you and the redirections (with a 301 status) should retain your page ranking.

I still loathe the superfluous trailing slash (as previously explained) but, once again, that’s YOUR choice (and probably doesn’t matter to WP).

Regards,

DK

DK,

Reply much appreciated as usual, and since your other rant a few years back I’ll always remove the test for mod_rewrite.c, and from now on I’ll will also remove the passthrough for index.php. But if I remove ReWrite base I get redirected weird, although I believe this is only locally and because of my setup of the virtual host.

And of course I don’t wanna be an idiot, so I’ve set up a virtual host and this is what I have in my .htaccess now:

RewriteEngine On
RewriteBase /

RewriteRule ^(.+)\.html(/.+)$ $1$2 [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

But same thing still happens, which is nothing with a ending .html, the above only removes the embeded .html

This:

site.com/page.html/page

becomes this:

site.com/page/page

but this:

site.com/page/page.html

…nothing happens to more then a 404?

And still, if I replace the 1st rule with the one you gave me before both embedded and ending .html gets rewritten correctly so I still believe there is something not correct with this simplified version.

And regarding RewriteBase, if I remove it, any redirects leads to this:

site.com/D:/xampp/htdocs/site.com/page/page

I assume I have not set up my virtual hosts correctly, right?

Best regards and thanks again,
A

I had a problem with paths and using plus which was solved by using star

# Start NEW BIT THAT SEARCHES For THE .html extension
    RewriteCond %{DOCUMENT_ROOT}/_Cache/cached/%{REQUEST_URI}\.html -f
    RewriteRule ^(.*)$ /_Cache/cached/$1.html [L]

  # Problem
    # RewriteRule ^(.+)$ /_Cache/cached/$1 [L]
    # /+ id SINGLE FILE ONLY - IGNORES DIRECTORIES/PATHS

kk,

If the VirtualHost is not setup correctly, you SHOULD have a problem. Same with WP that requires you to let it know where it is (in your file structure). THEN, if your .htaccess code is in your VH’s DocumentRoot, RewriteBase is superfluous.

If you show me your code (httpd-vhosts.conf AND hosts files), I can let you know what the problem is with those. It’s been too long for me with WP so I’ll let you sort it out where you MUST tell WP where it is (no, it doesn’t have GPS :laughing: ).

When I’ve seen nonsense like your site.com/D:/xampp/htdocs/site.com/page/page, it’s always been immediately resolved with an Apache reboot (to clear accumulated errors, presumably).

@J_B,

Please use care with the metacharacters as the + means one or more and * means zero or more. BOTH metacharacters are “greedy” and that can quickly cause problems.

In kk’s case above, the intention was to match a trailing slash followed by everything which remained in the %{REQUEST_URI} string. In other words, kk did have a path divider (/) and more {crap} in the URI.

In your code, ^(.*)$ IS the same string as %{REQUEST_URI}. Of course, you already knew that because you used the %{REQUEST_URI} string in your RewriteCond statement.

In your gray code, ^(.+)$ will not match the DirectoryIndex of the DocumentRoot unless it’s specified as a file. Requesting http://www.example.com (with or without the trailing slash) is NOT the same thing and that has to be where you had your problem.

Same day, different problem. My problem with (.*) is that it is lazy and will capture NOTHING or EVERYTHING. Too many noobies do not understand how this causes problems for them so telling them it matches everything seems to solve their problem (not matching what they want) but causes so many more problems … so many that my first rant was for this!

[rant #1][indent]The use of “lazy regex,” specifically the :kaioken: EVERYTHING :kaioken: atom, (.*), and its close relatives, is the NUMBER ONE coding error of newbies BECAUSE it is “greedy.” Unless you provide an “exit” from your redirection, you will ALWAYS end up in a loop![/indent][/rant #1]

Okay, the :kaioken: were the old :fire: symbols but you get the idea.

Regards,

DK

DK,

Have already rebooted Apache, a few times, still the same. The .htcaccess is in the domain root dir, in the xampps htdoc dir.

Sure thing, below is my httpd-vhosts.conf:

<VirtualHost *:80>
    DocumentRoot "D:/xampp/htdocs/site.dev"
    ServerName site.dev
    ServerAlias www.site.dev
    ErrorLog "logs/site.dev"
</VirtualHost>

<VirtualHost *:80>
    DocumentRoot "D:/xampp/htdocs"
    ServerName localhost
</VirtualHost>`

and the hosts files:

    127.0.0.1       localhost
    127.0.0.1       www.site.dev

Thanks and best regards,
A