Robots.txt Help

Rebirth_Studios · February 18, 2010, 6:47pm

Okay, I want to make sure I have this right…

I’ve been analyzing a site I’ve worked on and it has this as the robots code


User-agent: *
Disallow: /
Allow: /directory/subdirectory/

This dynamic site doesn’t have an spiderable search interface, but users link to their pages which reside in the subdirectory level, so these should be getting picked up by search engines, correct?

Mittineague · February 18, 2010, 8:53pm

AFAIK some search bots don’t do “Allow” only “Disallow”.
So if that’s true, those bots will see the “disallow all” and not see the “allow” therefore not crawling them (if they bother with the robots.txt in the first place, I’ve had some bots that either don’t read them or ignore them, and I’ve heard some may even use it to find what you don’t want them to).

So it may be better to write the file with only "Disallow"s.

Rebirth_Studios · February 18, 2010, 9:41pm

It seems obvious to me that something is wrong because we’re only seeing links to pages, not pages themselves showing up in Google.

But i’m at a loss for documentation because Google themselves uses /allow in their own sitemap.

Mittineague · February 18, 2010, 10:11pm

Yes, the more I think about it, the more I think it’s the God-awful-long GET variables

?e=526f4433-44cf-4b36-96e9-90c8de2b1105

I suspect they go to Summary.aspx but get confused with the rest.

I think your best use of time would be to implement “friendly” URLs, it should be better for the bots and better for users too.

AlexDawson · February 20, 2010, 4:44am

Rebirth Studios: If you are having problems, check the robots.txt specification. Google and the others follow the guidelines fairly strictly so you should be OK.

http://www.robotstxt.org/ (The Robots.txt original spec).
http://www.conman.org/people/spc/robots2.html (Updated non-standard Spec includes allow etc supported by spiders).
http://en.wikipedia.org/wiki/Robots.txt

Rebirth_Studios · February 22, 2010, 9:34pm

I don’t see anything in the spec to address the way I have the code, where a global disallow is used followed by a single allow.

Does the allow only work where there isn’t a global allow or should it in theory work as I have it?

Mittineague · February 22, 2010, 10:27pm

You may have missed this http://www.robotstxt.org/robotstxt.html

This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory:

system · February 24, 2010, 7:30am

Ya i am agree with you. If u want to google does notcrowl your any page you use disallows.

AlexDawson · February 24, 2010, 1:26pm

OK let me clear up some of the misgivings in this thread:

Yes some search engines don’t allow the “allow:” command… however Google, Bing, Yahoo and most of the big names do… therefore you can be sure that 99% of people will be able to find pages that have been triggered using it. The major players all support it therefore it’s not worth getting petty about, it’s perfectly legitimate to use. This also stands for the other non-standard components in Wikipedia and the second generation spec I posted.

Even Google use “allow:” within their own robots.txt: http://www.google.com/robots.txt

As for your own, Rebirth Studios, it should work as you posted it. the allow directive gives explicit instructions that the path you specified is visible to spiders.

Topic		Replies	Views
Google Adsense/robots.txt issue Marketing	2	777	May 29, 2011
Robots.txt question Marketing	7	522	September 30, 2011
Robots.txt Code check Marketing	3	583	September 27, 2014
Problem with robots.txt PHP	3	280	October 27, 2011
Robots.txt help? Marketing	3	296	December 16, 2010

Robots.txt Help

Related topics