Robots.txt Help

Okay, I want to make sure I have this right…

I’ve been analyzing a site I’ve worked on and it has this as the robots code


User-agent: *
Disallow: /
Allow: /directory/subdirectory/

This dynamic site doesn’t have an spiderable search interface, but users link to their pages which reside in the subdirectory level, so these should be getting picked up by search engines, correct?

AFAIK some search bots don’t do “Allow” only “Disallow”.
So if that’s true, those bots will see the “disallow all” and not see the “allow” therefore not crawling them (if they bother with the robots.txt in the first place, I’ve had some bots that either don’t read them or ignore them, and I’ve heard some may even use it to find what you don’t want them to).

So it may be better to write the file with only "Disallow"s.

It seems obvious to me that something is wrong because we’re only seeing links to pages, not pages themselves showing up in Google.

But i’m at a loss for documentation because Google themselves uses /allow in their own sitemap.

Yes, the more I think about it, the more I think it’s the God-awful-long GET variables

?e=526f4433-44cf-4b36-96e9-90c8de2b1105

I suspect they go to Summary.aspx but get confused with the rest.

I think your best use of time would be to implement “friendly” URLs, it should be better for the bots and better for users too.

Rebirth Studios: If you are having problems, check the robots.txt specification. Google and the others follow the guidelines fairly strictly so you should be OK. :slight_smile:

http://www.robotstxt.org/ (The Robots.txt original spec).
http://www.conman.org/people/spc/robots2.html (Updated non-standard Spec includes allow etc supported by spiders).
http://en.wikipedia.org/wiki/Robots.txt

I don’t see anything in the spec to address the way I have the code, where a global disallow is used followed by a single allow.

Does the allow only work where there isn’t a global allow or should it in theory work as I have it?

You may have missed this http://www.robotstxt.org/robotstxt.html

This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory:

Ya i am agree with you. If u want to google does notcrowl your any page you use disallows.

OK let me clear up some of the misgivings in this thread:

Yes some search engines don’t allow the “allow:” command… however Google, Bing, Yahoo and most of the big names do… therefore you can be sure that 99% of people will be able to find pages that have been triggered using it. The major players all support it therefore it’s not worth getting petty about, it’s perfectly legitimate to use. This also stands for the other non-standard components in Wikipedia and the second generation spec I posted.

Even Google use “allow:” within their own robots.txt: http://www.google.com/robots.txt

As for your own, Rebirth Studios, it should work as you posted it. the allow directive gives explicit instructions that the path you specified is visible to spiders. :slight_smile: