Suggestions on robots.txt

What should I put in the robots.txt file?

Some say to leave it empty, others say to put things in there?

Does it even serve a purpose anymore?

https://www.google.com/search?q=what+to+use+robots.txt+for&ie=utf-8&oe=utf-8

The robots.txt file defines how a
search engine spider like Googlebot should interact with the pages and
files of your web site. If there are files and directories you do not
want indexed by search engines, you can use a robots.txt file to define where the robots should not go.

A big example of this is test folders. Although search engines can ignore robots.txt.

I use it.

It serves a purpose for legitimate search engines like Google. But there are search engines that ignore robots.txt.

I've got a robots.txt file that has the following:

User-agent: *
Disallow: /{folder to ignore}/

User-agent: SemrushBot
Disallow: *

User-agent: nlcrawler
Disallow: *

User-agent: MJ12bot
Disallow: *

User-agent: FeedDemon
Disallow: *

User-agent: Awasu
Disallow: *

HTH,

^_^

@RyanReese and @WolfShade,

The point I was trying to make is that if I want spiders to crawl everything in my public_html folder then do I need anything in the robots.txt file?

Do I have to tell everyone, "Crawl everything in the document root"?

And looking at things from the other way, is there anything which I would NOT want people to crawl? (I have a directory outside of the web root where I store things like passwords and config files...)

If you want every bot to crawl everything, then you don't need a robots.txt file.

If you have a staging area (or developer sub-folder off the root) that shouldn't be available to the public (password protected folder, or a secure login), then you want to include that in a robots.txt file as a "Disallow". You don't have to put every folder under that folder, it's automatically recursive, in a sense.

HTH,

^_^

No smile .

Still not getting what to allow and disallow.

When spiders crawl my website, they can only see the name of the file or folder, right? (Or can they see the contents?)

Should I block access to things like my "css" directory? Or my "images" directory?

To me, it seems the only thing you would really want indexed are finished pages (e.g. index.php, account.php, some-article.php, faq.php, etc), right?

Basically a spider starts at some kind of root home page (home.html; home.php; home.cfm, etc.) and follows every link, recursively, from that main page. Unless, of course, it's a legit search engine and robots.txt prevents certain links from being followed. And, no, they don't just get the filename, they get contents, too (else the meta tag would be pretty much useless.)

No. Images can be spidered for Google's "images" section. CSS I'm not so sure about. But why bother trying to block that?

^_^

Sounds like you are saying spiders can see all of the HTML, but I was asking about the file contents (e.g. PHP code)...

I sure as hell would hope they can read my PHP otherwise I'd lose all security!

I'm big on security and trying to make sure I am not exposing PHP code or configuration settings or anything that would allow a hacker to do bad things to any of my websites.

I work in ColdFusion. AFAIK, the only way to get your PHP code would be to either A) hack the web server or FTP to the web server, or B) disable the PHP server portion so the web server (Apache, IIS) serves up the code. Spiders only see the on-the-fly generated HTML that the PHP (or CF) server sends.

^_^

So with that being said, then is there anything I wouldn't want spiders to see in my web root?

Also, just out of curiosity, if a person put all of there files in a directory outside of the web root - except for an index.php file - then would that prevent people from seeing your code if the webserver ever screwed up as you mention?

As I stated in the fifth post:
If you have a staging area (or developer sub-folder off the root) that
shouldn't be available to the public (password protected folder, or a
secure login), then you want to include that in a robots.txt file as a
"Disallow".

Anything else should be left alone.

Keep in mind.. this only applies to search engines that pay attention to robots.txt. Other search engines ignore robots.txt, so for them it won't matter, anyway.

As far as docs out of webroot, I don't know enough about how web servers work to answer that question. Hopefully, someone else who knows can answer that one. smile

^_^

Except that they may use it to decide where they want to look hoping to find something the site wants them to not see.

i.e. don't use robots.txt for security purposes.

2 Likes

So if I leave my robots.txt file blank then it won't hurt my security, right?

That's correct. Adding to it also won't help your security. Following robots.txt is a rule...however...it's more like a "guideline" and your robots.txt can be ignored, if people want.

I have a tremendous amount of junk on a particular site and extensively use robots.txt to alert bots and/or crawlers to omit particular folders hoping that the GWT -> Crawl -> Crawl errors are not shown.

Robots.txt is also used to disallow pages which are not Google Mobile Friendly in the hope that this will increase the mobile search ranking.

Basicaly I use robots.txt as a way to keep some pages from being indexed, not out of security but for the fact that they would be of no interest to anyone that is not a guest actually at our physical location where links are posted.

Why have a search engine index something not useful. If however it gets indexed no big deal.

One thing I do include is instructions to wayback machine to ignore the site because the results are sometimes horrible.

2 Likes

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.