Modify rewrite rule that strips .html from urls to ad trailing slash

Yeah, I know. I decided to stick with the .html ending after all with the last re-design as I kept the same url structure. But now it is time for another re-design, and I need/want to change the structure this time and would like for it to go.

ā€œAllā€ I wanna do is make sure to redirect, permanently, incoming links and indexed urls, that has the ending and embeded .html to same url but without it.

So then in my case the trailing slash is superfluous? As this is urls to Wordpress pages and posts not directoriesā€¦or am I thinking wrong here?

Did you mean ([^.]*) as there is no ([^.]) ? If so I would end up with this:

RewriteRule ^([^.]+)\.html(/.+)$ $1$2 [R=301,L]

And this if I want traling slash:

RewriteRule ^([^.]+)\.html(/.+)$ $1$2/ [R=301,L]

Both seams to work fine on my test server, is this correct?

No need for apologies, no offence taken, neither then or now. Just happy youā€™re taking the time!

Best regards,
A

1 Like

kk/A,

Aha! A specification.

Donā€™t worry about my ā€œpersonal problemā€ with believing that webmasters should know (and honor) the difference between file and directory requests. If your script can handle it, then there is no problem (other than mine).

I specifically choose (/.+)$ because it appeared to me that .html (in your format) must be followed by a slash then one or more characters (Iā€™d guess [a-z/]+ but took the lazy route and captured EVERYTHING up to the end anchor.

I did not mean not dot because there was no reason for me to omit the dot character. As for the metacharacter for the atom, the * (in your question) was for zero or more characters (I believe that you need at least one after the /), my + was for one or more characters after the / and your ? was for zero or one ā€œnot dotā€ which would strop any further characters. Therefore, I would simplify your code to:

RewriteRule ^(.+)\.html(/.+)$ $1$2 (R=301,L]

Use the trailing / if you HAVE* to but remember that, without the 301 redirection, your trailing / would be altering the directory level for relative links within your script (that is the reason that some membersā€™ optional trailing slash is self-defeating).

Thanks for easing my mind (about the smirk). When I am tired (itā€™s almost 20 to 1AM), I know that I get terse as well as pedantic (trying to be correct {best practice?] for ALL members) which may leave someone feeling that (s)he has been subject to a personal attack ā€¦ which is NOT the case!

Regards,

DK

1 Like

DK!

Back on this after a hectic week, and thanks for the explanation, much appreciated.

But, it seams that your last rewriterule is causing my testserver to give a 500 server error, any idea why?

And yet again, thanks for taking the time.

Best regards,
A

Sorry, me being a little fast with the copyā€™n pasting, found it, a ( instead of [

But, this rule seams to deal only with the embedded .html and leaving the .html in the end?

kungknas,

Iā€™m glad that you found the error yourself. 500 errors (especially after a change of the .htaccess file) are indicative of a syntax error - which you found and corrected.

As for other problems, theyā€™re almost always caused by the order of your mod_rewrite blocks. If youā€™ll post your .htaccess code, Iā€™m sure Iā€™ll be able to spot the ordering problem quickly ā€¦ if you canā€™t find it first! :smirk:

Regards,

DK

Not much in my .htaccess at the moment, everything besides your rule is what wordpress puts there:

RewriteEngine On
RewriteBase /site.com/

RewriteRule ^(.+)\.html(/.+)$ $1$2 [R=301,L]   

RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /site.com/index.php [L]

Only strips the embedded .html, but if I use your original one instead:

 RewriteRule ^([^.]+).html([^.]*) $1$2 [R=301,L]

It strips both embedded end ending .html, so not sure it is the order?

Best regards,
A

kungknas,

Oh, my!

  1. I am not a fan of using RewriteBase as it will often change the directory inadvertently. If you put your .htaccess in the domainā€™s DocumentRoot, then all the code should be relative to that.

  2. The first RewriteRule does nothing but strip the .html from the URI.

  3. The second RewriteRule does nothing and can be deleted (if it exists as a file, it will not be impacted by the third RewriteRule (because of RewriteCond #1).

  4. The third RewriteRult redirects all 404 requests to /site.com/index.php. This defeats the first RewriteRule because the extensionless URIs will NOT be a file and will NOT be a directory.

In short, the first and third RewriteRules are incompatible.

Okay, okay, if you do have .html files, you can reverse the order of the first and third (donā€™t forget to delete the second) BUT the result will still be an extensionless filename which WILL be redirected to index.php. (ARGH, please get rid of the site.com/ - it should be registered as a Virtual Hostā€™s DocumentRoot so itā€™s not needed ā€¦ at all).

If your intention was to strip the .html and have index.php serve the {whatever}.html, then your current code is fine (but will conceal the fact that index.php is the request handler for your .html files).

Back to square one, what was your original intent?

Regards,

DK

DK,

I was expecting a bit more then that! :smile:

Both these are added by wordpress, second I can remove without problem, but if I remove the third the site stops working, below is the default block wordpress ads in your .htaccess if one choose to have permalinks activated and something else shown besides the default site.com/?p=123 for post and pages.

Looks like this in a whole:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>    

Thing is I donā€™t have any .html files, never had, just a bunch of indexed urls with ranking and PR that ends with .html. I had ages ago, then I moved to worpress, and told wordpress to add .html to the end of the urls so I could have my pages and posts on the same url as when having a static site.

Yeah, I guess I could add a line in the host file and a new virtual hosts for each new site, Iā€™m just too lazy for that and happily have them as locahost/site.com, locahost/site2.com etc, as only thing needed then is to create a new dir in xampps htdoc folder, and then the site.com/ is needed when working locally. Not there when on the live server.

To have all links and indexed urls that have .html embedded in them or are ending with .html permanently redirected to the same url but without .html. To retain linkjuice from backlinks, keep PR and my rankings when I remove the .html from the Wordpress permalink structure.

So these:

http://mysite.com/a-example-of-a-page-or-past.html
http://mysite.com/a-page-with.html/pagination/2

Permanently redirected to these:

http://mysite.com/a-example-of-a-page-or-past/
http://mysite.com/a-page-with/pagination/2/

And with a rule that is compatible with the rewriterules in the wordpress (if permalinks acitvated) default .htcaccess.

But for that the original you gave me seams to work just fine:
RewriteRule ^([^.]+).html([^.]*) $1$2 [R=301,L]

And then I started this thread to ask if my modification to add a trailing slash was correctā€¦aaaand now we are here.

Best regards,
A

1 Like

kk/A,

What? You were hoping for a bit of rage? :grimacing:

Well, youā€™ve opened another can of worms admitting to using WP. For years, Iā€™ve been railing on about how bad their code was (and explaining why every time):

  1. NEVER waste production machine cycles by making a test repeatedly (for every request and within every request). That means that ā€¦ oh, well, Iā€™d created a rant for that years ago:

[rant #4][indent]The definition of an idiot is someone who repeatedly does the same thing expecting a different result. Asking Apache to confirm the existence of ANY module with an ā€¦ wrapper is the same thing in the webmaster world. DONā€™T BE AN IDIOT! If you donā€™t know whether a module is enabled, run the test ONCE then REMOVE the wrapper as it is EXTREMELY wasteful of Apacheā€™s resources (and should NEVER be allowed on a shared server).[/indent][/rant 4]

Note: WP does that to prevent pseudo webmasters from ā€œturning offā€ their website if mod_rewrite is not available, i.e., to make their code IDIOT-proof.

I hope that satisfies your first comment, too.

  1. RewriteBase is still superfluous ā€¦

  2. ā€¦ as is the passthrough for index.php.

As for your ā€œlazyā€ comment, may I suggest that youā€™d be far ahead in your localhost testing if you did use virtual hosts. IMHO, ā€œlazyā€ here deserves the same type of rant as above ā€¦ but thatā€™s YOUR choice, of course. On second thought, being ā€œlazyā€ may force you to reinstate the RewriteBase with site.com (but still not necessary in the index.php redirection).

I stand by my last postā€™s gratuitous statement that your order (get rid of the .html then redirect to index.php) WILL work for you and the redirections (with a 301 status) should retain your page ranking.

I still loathe the superfluous trailing slash (as previously explained) but, once again, thatā€™s YOUR choice (and probably doesnā€™t matter to WP).

Regards,

DK

DK,

Reply much appreciated as usual, and since your other rant a few years back Iā€™ll always remove the test for mod_rewrite.c, and from now on Iā€™ll will also remove the passthrough for index.php. But if I remove ReWrite base I get redirected weird, although I believe this is only locally and because of my setup of the virtual host.

And of course I donā€™t wanna be an idiot, so Iā€™ve set up a virtual host and this is what I have in my .htaccess now:

RewriteEngine On
RewriteBase /

RewriteRule ^(.+)\.html(/.+)$ $1$2 [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

But same thing still happens, which is nothing with a ending .html, the above only removes the embeded .html

This:

site.com/page.html/page

becomes this:

site.com/page/page

but this:

site.com/page/page.html

ā€¦nothing happens to more then a 404?

And still, if I replace the 1st rule with the one you gave me before both embedded and ending .html gets rewritten correctly so I still believe there is something not correct with this simplified version.

And regarding RewriteBase, if I remove it, any redirects leads to this:

site.com/D:/xampp/htdocs/site.com/page/page

I assume I have not set up my virtual hosts correctly, right?

Best regards and thanks again,
A

I had a problem with paths and using plus which was solved by using star

# Start NEW BIT THAT SEARCHES For THE .html extension
    RewriteCond %{DOCUMENT_ROOT}/_Cache/cached/%{REQUEST_URI}\.html -f
    RewriteRule ^(.*)$ /_Cache/cached/$1.html [L]

  # Problem
    # RewriteRule ^(.+)$ /_Cache/cached/$1 [L]
    # /+ id SINGLE FILE ONLY - IGNORES DIRECTORIES/PATHS

kk,

If the VirtualHost is not setup correctly, you SHOULD have a problem. Same with WP that requires you to let it know where it is (in your file structure). THEN, if your .htaccess code is in your VHā€™s DocumentRoot, RewriteBase is superfluous.

If you show me your code (httpd-vhosts.conf AND hosts files), I can let you know what the problem is with those. Itā€™s been too long for me with WP so Iā€™ll let you sort it out where you MUST tell WP where it is (no, it doesnā€™t have GPS :laughing: ).

When Iā€™ve seen nonsense like your site.com/D:/xampp/htdocs/site.com/page/page, itā€™s always been immediately resolved with an Apache reboot (to clear accumulated errors, presumably).

@J_B,

Please use care with the metacharacters as the + means one or more and * means zero or more. BOTH metacharacters are ā€œgreedyā€ and that can quickly cause problems.

In kkā€™s case above, the intention was to match a trailing slash followed by everything which remained in the %{REQUEST_URI} string. In other words, kk did have a path divider (/) and more {crap} in the URI.

In your code, ^(.*)$ IS the same string as %{REQUEST_URI}. Of course, you already knew that because you used the %{REQUEST_URI} string in your RewriteCond statement.

In your gray code, ^(.+)$ will not match the DirectoryIndex of the DocumentRoot unless itā€™s specified as a file. Requesting http://www.example.com (with or without the trailing slash) is NOT the same thing and that has to be where you had your problem.

Same day, different problem. My problem with (.*) is that it is lazy and will capture NOTHING or EVERYTHING. Too many noobies do not understand how this causes problems for them so telling them it matches everything seems to solve their problem (not matching what they want) but causes so many more problems ā€¦ so many that my first rant was for this!

[rant #1][indent]The use of ā€œlazy regex,ā€ specifically the :kaioken: EVERYTHING :kaioken: atom, (.*), and its close relatives, is the NUMBER ONE coding error of newbies BECAUSE it is ā€œgreedy.ā€ Unless you provide an ā€œexitā€ from your redirection, you will ALWAYS end up in a loop![/indent][/rant #1]

Okay, the :kaioken: were the old :fire: symbols but you get the idea.

Regards,

DK

DK,

Have already rebooted Apache, a few times, still the same. The .htcaccess is in the domain root dir, in the xampps htdoc dir.

Sure thing, below is my httpd-vhosts.conf:

<VirtualHost *:80>
    DocumentRoot "D:/xampp/htdocs/site.dev"
    ServerName site.dev
    ServerAlias www.site.dev
    ErrorLog "logs/site.dev"
</VirtualHost>

<VirtualHost *:80>
    DocumentRoot "D:/xampp/htdocs"
    ServerName localhost
</VirtualHost>`

and the hosts files:

    127.0.0.1       localhost
    127.0.0.1       www.site.dev

Thanks and best regards,
A

kk,

Despite my aversion to using a TLD (even yours - easier for my testing to duplicate the domain name without WWW and without .com ā€¦ like your .dev), I see nothing wrong with what you have.

Therefore, I still recommend that you look at where youā€™d told WP it was located (D:/xampp/htdocs/site.dev/ should be the DocumentRoot) - Iā€™ll bet itā€™s not it (yet)).

Weā€™re sneaking up on it!

Regards,

DK

DK,

Isnā€™t that why I need the RewriteBase in the .htaccess, to tell WP where it is located? Because this only happens when RewriteBase is removed and a redirect happens. Everything else with WP and the permalinks for post/pages works just fine.

But still, RewriteBase or not, RewriteRule ^(.+).html(/.+)$ $1$2 [R=301,L] only removes embedded .html and not the .html at the end. Exact same conditions but using RewriteRule ^([^.]+).html([^.]*) $1$2 [R=301,L] rewrites both embedded and ending .html. Wouldnā€™t that imply that it is the RewriteRule rather then my virtual host config?

Best regards,
A

kk,

If memory serves (I may be losing it ā€¦ :scream: ), you must configure WP to get it to work properly. The mod_rewrite, remember, is optional (with their wrappers) so WP relies on its configuration file. As I said earlier, though, itā€™s been quite a while since Iā€™ve setup in a WP website (for a client).

Okay, I thought you always had something (specifially a / followed by anything) after .html. If thatā€™s not true, simply add a ? after the second atom (the (/.+)$ to make it (./.+)?$ - I added the end anchors to help you find the atom).

Regards,

DK

DK,

That would be to add the correct urls in for domain root and WP install dir, but that is correctly done and WP is working properly, its only when RewriteBase is removed and a rewrite accures that this weird redirect happens.

Oh sorry, no the embedded .html is always followed by a /, but then there is the ending .html and that is followed by nothing.

But yeah! That was it, now both embedded and ending .html gets redirected properly! :smiley:

Best regards,
A

kk,

Horray!

Well, if it works with the RewriteBase telling Apache to use the DocumentRoot (while in the DocumentRoot), whoā€™s to argue. Sounds ridiculous but, ā€œif it ainā€™t broke, donā€™t fix it.ā€

Regards,

DK

DK,

Once again, thanks for all the time, inputs and effort, and if Iā€™m not back soon screaming in panic over all my lost rankings and PRā€¦ take care and until next time!

Best regards,
A

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.