Robots.txt Allow & Disallow Advanced Help

We are currently testing our content management system and how it works with Google Images and other sites that scrape your images for content. We noticed that none of our images were showing up in these results, my thoughts were because our images are located inside a folder located in: public_html/amass/images but our robots.txt file its set to: Disallow: /amass which means spiders would not crawl anything in that folder! Unfortunately there is stuff we can’t have crawled within the /amass folder, so we’re trying to find a way to allow spiders to crawl only the images folder. Is this possible? Would any of these scenarios below work for us, or are we just screwed? lol

ROBOTS.TXT #1
Sitemap: http://www.arthurassociates.com/sitemap.xml
User-agent: *
Allow: /amass/images/*?$
Disallow: /amass
Disallow: /cgi-bin

ROBOTS.TXT #2
Sitemap: http://www.arthurassociates.com/sitemap.xml
User-agent: *
Allow: /amass/images/
Disallow: /amass
Disallow: /cgi-bin

Think that you have to break down the file structure a little better to get the desired results. Please see below as an example:


Sitemap: http://www.arthurassociates.com/sitemap.xml
User-agent: *

Disallow: /amass/example1.aspx
Disallow: /amass/example2aspx
Disallow: /amass/js/example1js
Disallow: /amass/js/example2.js
Disallow: /cgi-bin

Allow: /amass/images/


Thanks for the reply austince! That could become a very tedious process, phew! I wish there was a definitive answer.

Simplifying your file structure in the beginning has always helped me in the development and the SEO process because it takes away the guess work. I do not want to reveal .js, .css, or any custom coding to the search engines. So I place them in the /css, /js or /scripts folder. Do a disallow on /css, /js or /scripts in the robot.txt in the beginning of the robot.txt file and allow in the rest. Too simple.

After allot of digging around I thought I would see what Google does. Once I took a look at what Google is doing in their robots.txt file, it all became clear to me. http://www.google.com/robots.txt

See this chunk?

Disallow: /safebrowsing
Allow: /safebrowsing/diagnostic
Allow: /safebrowsing/report_error/
Allow: /safebrowsing/report_phish/

Looks like what I needed to do to fix my current robots.txt was to change it to look something like this:

Disallow: /amass
Allow: /amass/images
Allow: /amass/skins/default/images

Yes, if you set Disallow to an entire folder but use Allow on sub-folders of the disallow, it’ll spider and index those regions of the Disallow that have been overwritten with the allow. There’s some other non-standard things you could add which may be helpful (see this link): http://en.wikipedia.org/wiki/Robots_exclusion_standard :slight_smile:

Thanks for the Wikepedia link AlexDawson! I should have looked there first, they have an excellent explanation there.