Robots.txt

charlie_gab · April 20, 2010, 11:51am

What is robots.txt and how does it work?

kulonuwun · April 20, 2010, 2:15pm

i use robot.txt in order to tell the search engine crawler what pages should they crawl and what page is forbidden to crawl. but there is many more information that you can search on google.com

force · April 20, 2010, 9:42pm

Note that not all search engines obey the robots.txt file(s). Don’t count on it as being a security measure, as the more questionable search engines will ignore it.

laburke · April 20, 2010, 11:43pm

It generally takes the form:

User-agent: *
Disallow: /page-or-folder-you-dont-want-indexed.html

(The * means you’re directing the command to all search engines.)

Alex_Brooks · April 21, 2010, 12:39am

But of course, using robots.txt to ‘hide’ files from search engines, it still means that other people can find the files your trying to hide, which obviously is not a good thing. There are a number of ways around this of course, but you should never think robots.txt can be used for anything related to security, it’s purely there to tell search engines what to crawl and what not to.

AlexDawson · April 21, 2010, 6:40am

Not to sound snarky or rude here but I don’t know why people bother asking these kind of stupid questions… unless Charlie is that incapable of browsing the Internet that going to Google and typing “robots.txt” is too much? Because the first result for that is the robots specification (explaining everything he needs to know) and the second is the Wikipedia article which gives an almost complete guide in terms any dummy can understand. Did it really require a post in a forum? No.

seriocomic · April 21, 2010, 10:20am

Let me Google that for you…

hooperman · April 21, 2010, 11:09am

Yes you do. Yet despite the poster’s obvious motivation, their post remains.

system · April 21, 2010, 11:21am

lets say I have http://www.mysite.com/cute.html

User-agent: *
Allow: /cute.html

is it acceptable?

AlexDawson · April 21, 2010, 11:37am

bakers, the point of robots.txt is to disallow what you don’t want indexed, everything is set to allow unless otherwise specified, so what you posted is unnecessary.

PS: hooperman, well it does get tiring watching the same “lazy” questions reappear.

hooperman · April 21, 2010, 12:34pm

I agree. If only there was some way to get rid of sig link fluff…

shivani2 · April 21, 2010, 1:29pm

Robots.txt file tells the search engines which pages or which folder you do not want to crawl by the search engine and also suggest which pages you want to crawl by the search engines.

seriocomic · April 22, 2010, 2:04am

There is - click the little alert icon under the post…

ralphm · April 22, 2010, 2:22am

Regarding robots.txt… I’ve read that it is good practice to have a robots.txt file in your root folder even it it’s blank, supposedly because robots search for one and like to see it there. Does anyone have any views on that?

system · April 22, 2010, 5:25am

robot.txt is the notepad file. With the help of this is we can stop search engines crawlers by crawling in our web page or any specific area of our page.

hooperman · April 22, 2010, 7:26am

To be honest, I can’t be bothered anymore…

montananz · April 22, 2010, 8:28am

Robots.txt file is also called the robots exclusion file. It is primarily used to tell spiders which pages you don’t want them to index.