Hi everybody.
I have a weather forecast site that offers weather widgets to other sites.
The widget is an iframe that pulls data from, let’s say, mysite.com/widget/yourforecast.php
Search engines follow the link and land on yourforecast.php page (which is not cached) causing not cached requests and fatiguing the server.
Can I disallow /widget/ from robots.txt without this penalizing me?
Doesn’t Google like robots.txt that inhibit access to scripts?
Google’s spider doesnt care if your robots.txt blocks access to scripts, or images, or javascript… it cannot take into account content it can’t read. Which is what your robots.txt does.
Thanks rpkamp for your reply and thanks for the welcome!
I have no direct experience that allows me to believe it or not.
I just thought that because they are code injected on other sites, the crawlers wanted the ability to check that there was nothing suspect or “malicious”.
Thanks m_hurtley for your reply.
I thought that crawlers prefer to crawl some kind of resources:
Crawling CSS and JavaScript is absolutely critical as it allows Googlebot to properly render pages. (from SEJ quoting Mueller’s #askawebmaster video of Jul 20 2020
I think you’re perhaps seeing things as a binary condition when it isnt one, more of a trinary.
Google’s bot doesnt say “Well I cant see this content, so it must be bad, strike against the site”
It says “I can’t read this content. Treat it as null, move on.”
It’s not a good or bad switch - it’s a good, bad, or nothing.
Yes, a crawler wants to crawl everything - that’s its raison d’etre. But, there are some things it won’t be allowed to crawl.
If the page does not render in a way that properly reflects your site without the script, then it will need to crawl that script in order to get the ‘picture’ of your website that you want it to see.