SitePoint Sponsor

User Tag List

Page 1 of 3 123 LastLast
Results 1 to 25 of 52

Hybrid View

  1. #1
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)

    Htaccess Vocabulary

    Sitepoint Memebers,
    What is the difference between
    RewriteCond %{SERVER_PORT} 443
    aand
    RewriteCond %{SERVER_PORT} ^443$

    From what I can understand from
    http://httpd.apache.org/docs/2.2/rewrite/intro.html

    the second line means that when the server receives a request with a string that begins with 443 and ends with 443.

    Is that right.?

    Does the first line mean - when the server receives a request with a string that contains 443

    Thanks,

    Chris

  2. #2
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,255
    Mentioned
    196 Post(s)
    Tagged
    2 Thread(s)
    haccess uses Perl flavor regex syntax, so yes, the ^ signifies "beginning" and the $ signiifies "ending".

    I've seen it used with URLs before but never with SERVER_PORT so I don't know if it would be valid or even required for that line.

  3. #3
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    The second line with ^ and $ I got from step 3 on this page
    http://www.seosandwitch.com/2012/08/...hat-to-do.html

    and also from the second section on this page
    http://www.seoworkers.com/seo-articl...and-https.html

    Do you mean you haven't seen that line at all or by itself?

    If that is sorted ouit, which .txt file is it applied to in the next line
    RewriteRule ^robots.txt$ robots_ssl.txt

    I can't see what RewriteRule ^robots.txt$ robots_ssl.txt is saying.

    Thanks

  4. #4
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,255
    Mentioned
    196 Post(s)
    Tagged
    2 Thread(s)
    Just because I've never seen it with SERVER_PORT doesn't mean it's incorrect, it mostly likely is.

    RewriteRule ^robots.txt$ robots_ssl.txt
    is saying
    (implied) if the previous condition(s) is/are met,
    rewrite the (URL) string "robots.txt" that begins with "r" and ends with "t" - that is it won't match frobots.txt, robots.txte, or obots.tx etc.
    to robots_ssl.txt

    Presumably Apache would then serve a different file for Requests that met the condition(s)

  5. #5
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots.txt$ robots_ssl.txt

    You wrote what it's saying but it doesn't look like you finished your thought.
    It looks like it's saying: If a request string that has the sequence 443 (a.k.a. https) is received then instead of sending the requestor to robots.txt send the requestor to robots_ssl\.txt. Is that right?

  6. #6
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,255
    Mentioned
    196 Post(s)
    Tagged
    2 Thread(s)
    Yes, except with the ^$ it would be "is" not "has" assuming the same regex syntax also works with SERVER_PORT

  7. #7
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    If a request string is the sequence 443 (a.k.a. https) then instead of sending the requestor to robots.txt send the requestor to robots_ssl\.txt. Is that right?

    If that's correct, then instead of sending the requestor to robots.txt can you incur a 404 some how (something tells me changing robots_ssl\.txt to 404\.html won't work) in hopes of https requests getting a 404 response?

  8. #8
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    I found this page
    http://stackoverflow.com/questions/2...04-in-htaccess
    Would you code it like this?

    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots.txt$ [R=404,L]

  9. #9
    Certified Ethical Hacker silver trophybronze trophy dklynn's Avatar
    Join Date
    Feb 2002
    Location
    Auckland
    Posts
    14,672
    Mentioned
    19 Post(s)
    Tagged
    3 Thread(s)
    Alan,

    %{SERVER_PORT} is (IMHO) the preferred way to determine whether the connection is simply http (80) or https (443). The other way is to test the %{HTTPS} variable for "on" or NULL but that can give strange results on some servers (meaning you need to be careful with the logic of matching "on" (without the quotes, of course) vs matching a NULL value OR non-existent variable.

    C77,

    Mittinaegue is quite correct about the use of ^ and $ and I'd advise you to use them for your 443 check.

    As for your last post, it will generate a 500 error because the syntax does not provide the redirection.

    Regards,

    DK
    David K. Lynn - Data Koncepts is a long-time WebHostingBuzz (US/UK)
    Client and (unpaid) WHB Ambassador
    mod_rewrite Tutorial Article (setup, config, test & write
    mod_rewrite regex w/sample code) and Code Generator

  10. #10
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Thanks for the info. I certainly don't want a 500 error. I just can't figure out this last part that would respond to a 443 request with a 404 error instead of a robots file. What code do you use to incur a 404 error? A 404.html page is just a page, not an actual 404 error. Can you incur a 404 error in Apache? Maybe that's it -send the requestor to a non existent page, how about nohttpsatall.html?

    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots\.txt$ nohttpsatall\.html

  11. #11
    Certified Ethical Hacker silver trophybronze trophy dklynn's Avatar
    Join Date
    Feb 2002
    Location
    Auckland
    Posts
    14,672
    Mentioned
    19 Post(s)
    Tagged
    3 Thread(s)
    C77,

    Don't you have a 404 script you use?

    Code:
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots\.txt$ 404.php [R=404,L]
    Regards,

    DK
    David K. Lynn - Data Koncepts is a long-time WebHostingBuzz (US/UK)
    Client and (unpaid) WHB Ambassador
    mod_rewrite Tutorial Article (setup, config, test & write
    mod_rewrite regex w/sample code) and Code Generator

  12. #12
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    The problem I'm having is google thinking I have https duplicates of all my http pages. A lot of programmers are saying

    step 3 on this page
    http://www.seosandwitch.com/2012/08/...hat-to-do.html

    and the second section on this page
    http://www.seoworkers.com/seo-articl...and-https.html

    is the solution. What it does, for my site, is turn the 70 https pages that don't exist on my ste but exist in google's head into 18 https search results complaining about robots.txt use, the most ridiculous search results being this

    Home
    https://xyzcom/
    A description for this result is not available because of this site's robots.txt – learn more.

    where google still shows I have https duplicates and changed the title of my home page to "Home" and for each of the https dupliates it has the same description about robots.txt use.

    I'm wondering if this

    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots\.txt$ nohttpsatall\.html

    will solve the problem since I have no page nohttpsatall.html and so would seem to lead all https requests to a 404 error.

    Thanks

  13. #13
    Certified Ethical Hacker silver trophybronze trophy dklynn's Avatar
    Join Date
    Feb 2002
    Location
    Auckland
    Posts
    14,672
    Mentioned
    19 Post(s)
    Tagged
    3 Thread(s)
    C77,

    The tutorial linked in my signature has example code for both secure and non-secure redirections, Personally, I'd use both (remember to ONLY redirect your scripts, not your support pages) but double check by using a PHP script in the header of the secure pages to redirect (via header() statement) if not requested or redirected by mod_rewrite).

    As for your secure pages, I'd list them (using ^(secure1|secure2|secure3|...)\.php$ ) to ensure that you're not redirecting non-secure pages to %{HTTPS}, too.

    Regards,

    DK
    David K. Lynn - Data Koncepts is a long-time WebHostingBuzz (US/UK)
    Client and (unpaid) WHB Ambassador
    mod_rewrite Tutorial Article (setup, config, test & write
    mod_rewrite regex w/sample code) and Code Generator

  14. #14
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Maybe I'm wrong, but you can't tell the server something is a 404 error, the server tells you something is a 404 error.

    If you have the Gettysburg Address on your 404.html error page and you redirect three different pages to the 404.html page, is google going to "think" those three pages don't exist any longer or is it going to think those three pages lead to the same page that contains the Gettysburg Address found on 404.html? I thinks it's goin to associate added the Getysburgh Address to it's survey of your site - your site is about X and the Getysburgh Address.

    Once I manage to get the 404 error triggered then it can be redirected to 404.html. If you skip triggering the 404 error and go directly to the 404.html page, google is not going to know the page no longer exists. Don't forget, 404.html (or .php) is not a special coding function. The file name of my 404 page on one site of mine is znewa.html and it works fine. It's necessary to trigger the 404 error in order to tell google the page no longer exists, sending google directly to 404.html does not tell google the page doesn't exist, what it does is creates multiple addresses to your 404.html page.

    So it looks like

    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots\.txt$ doesnotexist\..html

    is the way, using the non existent file/page doesnotexist.html to trigger the 404 error. Isn't this what normally happens? A page is removed, each visitor to the removed page triggers the 404 error, google is a visitor, google triggers the 404 error, google sees the 404 error, it removes the content and removes what the address of the content was from its servers' memory.

  15. #15
    SitePoint Wizard bronze trophy Jeff Mott's Avatar
    Join Date
    Jul 2009
    Posts
    1,314
    Mentioned
    19 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by Chris77 View Post
    Maybe I'm wrong, but you can't tell the server something is a 404 error, the server tells you something is a 404 error.
    You can totally tell the server something is a 404 error... Or, more accurately, you can tell the server to send a 404 response code. You already had that above in post #8 R=404 and dklynn showed in again in post #11.
    "First make it work. Then make it better."

  16. #16
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    I was hoping I was on the right track with [R=404,L]

    but DKs post after said
    "As for your last post, it will generate a 500 error ........"

    So is
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots.txt$ [R=404,L]

    at least syntactically correct?

    And barring syntax errors does it say, ' If a request string is the sequence 443 (a.k.a. https) then instead of sending the requestor to robots.txt trigger a 404 error'?

    In 11 Dk wrote
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots\.txt$ 404.php [R=404,L]

    but what does it say with 404.php placed before [R=404,L]

    Does it say if there's no 404.php trigger a 404 error?

    If there's no 404.php then there's no need for the [R=404,L] code right after because if there's no 404.php then 404.php is no different than doesnotexist\..html - it will automatically trigger a 404 error without [R=404,L] .

    So if that's true then

    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots.txt$ [R=404,L]

    would seem to be right - trigger a 404 error for all https requests.

    Thanks

  17. #17
    SitePoint Wizard bronze trophy Jeff Mott's Avatar
    Join Date
    Jul 2009
    Posts
    1,314
    Mentioned
    19 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by Chris77 View Post
    I was hoping I was on the right track with [R=404,L]

    but DKs post after said
    "As for your last post, it will generate a 500 error ........"
    DK was right about that, because rewrite rules require a substitution.

    Quote Originally Posted by Chris77 View Post
    So is
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots.txt$ [R=404,L]

    at least syntactically correct?
    Not yet. You still have to rewrite to somewhere.

    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots.txt$ 404.php [R=404,L]


    Quote Originally Posted by Chris77 View Post
    And barring syntax errors does it say, ' If a request string is the sequence 443 (a.k.a. https) then instead of sending the requestor to robots.txt trigger a 404 error'?
    Yes. However, the phrase "trigger a 404 error" is vague. We're at the point where we need to be clear about what's actually happening.

    A 404 error is nothing more than any HTTP response with a 404 status code. It's conceivable, for example, that you could send a 404 status code and still send the content of the resource that you're claiming is not found. When we use R=404, we achieve two important things: We set the response status to 404, and we prevent Apache from sending the content of robots.txt, because Apache won't send the resource's content if it thinks it's redirecting.

    Now, I admit, ideally that would be the end of it. A 404 response status and a blank response body is exactly what we want. But to get Apache to do those things, we had to trick it by telling it that we're redirecting. So now we have to give it somewhere to redirect to. So we pick 404.php. That file can exist or not exist, probably doesn't matter much.

    Quote Originally Posted by Chris77 View Post
    but what does it say with 404.php placed before [R=404,L]

    Does it say if there's no 404.php trigger a 404 error?
    It says to rewrite from robots.txt to 404.php. That, in conjunction with the R=404, should make the response headers look something like this:

    Status: HTTP/1.1 404 Not Found
    Location: 404.php
    "First make it work. Then make it better."

  18. #18
    SitePoint Wizard bronze trophy Jeff Mott's Avatar
    Join Date
    Jul 2009
    Posts
    1,314
    Mentioned
    19 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by Jeff Mott View Post
    A 404 response status and a blank response body is exactly what we want.
    As I was writing that, it occurred to me that a solution you proposed earlier would probably work better. Just rewrite to a non-existent page.

    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots\.txt$ nohttpsatall.html


    Assuming nohttpsatall.html doesn't actually exist, this should do exactly what you want.
    "First make it work. Then make it better."

  19. #19
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    That's why I was saying when 404.php doesn't exist it's the same as when doesnotexist.html is put in the apache htaccess code. Although this and the idea to redirect to a custom 404 page created confusion. It's not that I don't want or don't have a custom 404 page, it's that sending (redirecting) a robot (google robot) directly to a custom 404 page that exists doesn't trigger the actual 404 status and so is useless in trying to get google to stop indexing https pages that don't exist. All it does is get google to read your custom 404 page.

    So with
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots\.txt$ 404.html [R=404,L]

    assuming 404.html exists, this code says 'If a request string that has the sequence 443 (https) is received then instead of sending the requestor to robots.txt send the requestor to the custom 404 page AND be sure to trigger a 404 error status ([R=404,L])

    And the same can be done with
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots\.txt$ nohttpsatall\.html

    When nohttpsatall.html does not exist.

    You say the second way works better? I'm worried if I use the second way google in its Webmaster tools will be hounding me forever to fix nohttpsatall.htmll. What advantage do you see with the second way?

  20. #20
    SitePoint Wizard bronze trophy Jeff Mott's Avatar
    Join Date
    Jul 2009
    Posts
    1,314
    Mentioned
    19 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by Chris77 View Post
    So with
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots\.txt$ 404.html [R=404,L]

    assuming 404.html exists, this code says 'If a request string that has the sequence 443 (https) is received then instead of sending the requestor to robots.txt send the requestor to the custom 404 page AND be sure to trigger a 404 error status ([R=404,L])
    Correct.

    Quote Originally Posted by Chris77 View Post
    And the same can be done with
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots\.txt$ nohttpsatall\.html

    When doesnotexist.html does not exist.

    You say the second way works better? I'm worried if I use the second way google in its Webmaster tools will be hounding me forever to fix doesnotexist.html. What advantage do you see with the second way?
    Google will never see doesnotexist.html. That's the difference between an external redirect (using the [R] flag) and an internal rewrite. The only thing Google's bot will see is a request for robots.txt and a response with 404 status.
    "First make it work. Then make it better."

  21. #21
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,255
    Mentioned
    196 Post(s)
    Tagged
    2 Thread(s)
    A 404 page is a page and when it gets served it will return "found" headers - unless you send the "not found" headers with it.

    Having a 404 page is a good idea because instead of the visitor seeing a generic error screen you can have a TOC or search or something helpful for them on it and so increase the possibility that they'll stay at your site.

  22. #22
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    If ' The only thing Google's bot will see is a request for robots.txt and a response with 404 status.' means 'The only thing Google's bot will see, when doesnotexistatall.html is used, is a request for robots.txt and a response with 404 status.' then that's great. That's like a new tool, using an address that doesn't exist when dealing with searchengines who have hallucinations of pages that don't exist.
    Thanks a lot Mott, I'll carry the news.
    https://www.youtube.com/watch?v=VkqQj8Z_aVY

  23. #23
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Mr. Mott that is.

    I made the change and it looks like my webhost prevents the 404 error and directs me to the non https address. I'm not sure what google will think, especially since my webosts adds to the address my main website address, so now https pages are directed to http://xyz.mainsite.com. I'll try the other code, I bet it does the same thing.

    3minutes later...
    It does the same thing

    When my webhost was trying to get this to work they put in just before the code
    # For security reasons, Option followsymlinks cannot be overridden.
    #Options +FollowSymLinks
    Options +SymLinksIfOwnerMatch

    I have no idea if it's needed for the code to work or even if it disprupt the code in some way.

  24. #24
    SitePoint Guru
    Join Date
    Jan 2010
    Posts
    638
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    My wenhost worked on the problem and said you can direct http to https but you can't direct https to http. I told them I don't want to go to an http address but there seems to be no way to incur a 404 error without using an address in the code.

    Apparently this is not an uncoomon problem. They referred me to this page
    http://www.webmasterworld.com/google/3411545.htm

  25. #25
    SitePoint Wizard bronze trophy Jeff Mott's Avatar
    Join Date
    Jul 2009
    Posts
    1,314
    Mentioned
    19 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by Chris77 View Post
    My wenhost worked on the problem and said you can direct http to https but you can't direct https to http. I told them I don't want to go to an http address but there seems to be no way to incur a 404 error without using an address in the code.
    This doesn't sound right to me. If you're doing an internal rewrite, then there's no http<=>https switch going on. It's just a single https request that returns a 404 response.

    Also, I'm not sure the linked thread backs them up. In that thread, people are discussing serving a different robots.txt depending on whether it was requested through http or https, which they accomplished with an internal rewrite, same as your own in this thread. There's nothing implying that a request for robots.txt couldn't return 404.
    "First make it work. Then make it better."


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •