Totally prevent crawling of anything which isn't something

bendqh1 · March 2, 2022, 4:55pm

I have a MediaWiki website.
MediaWiki creates about at least 15 webpages per webpage:

Talk webpage
History webpage
Revision webpages
Diff webpages
What-links-here webpage
Recent-changes-in-webpages-linked-from-here
Printable version webpage
Permalink version
Information about this webpage — webpage
Source code webpage / Edit webpage
Statistics webpages
And probably more

The total amount of webpages might arrive 150-1500 or much more.

Having so many webpages per webpage tremendously inhibited the crawling of my website to the extent that SEO is damaged although software performance is decent and content is abundant and rewarding with good feedback from readers.

Although most of the webpages I’ve exampled have a noindex attribute, I believe that I should still limit access to them backendly somehow.

I thought using robots.txt to allow access only into article and category pages.

Allow: /index.php/article/
Allow: /index.php/category/
Disallow: *

Is this syntax good? Would you do something otherwise?

DaveMaxwell · March 2, 2022, 6:08pm

looks like there is a big page already dedicated to this on the mediawiki site.
https://www.mediawiki.org/wiki/Manual:Robots.txt

SamA74 · March 2, 2022, 6:16pm

Any instruction to robots, either through a meta tag or robots.txt, is only ever advisory, it does not enforce anything.
This means that “good” robots should obey your rules, but there is nothing to stop any or “bad” robots from ignoring them.
The chances are, if robots are ignoring the rules in one place (meta tag) they will ignore them elsewhere too.

Have you established whether this is really the actual problem?
Have you seen that pages that should not be indexed are indexed?
It may just be you have unrealistic expectations about how quickly all your pages will be indexed.

bendqh1 · March 3, 2022, 5:39pm

Have you established whether this is really the actual problem?

I did read URLs of such pages in Google Search Console, which is for me personally a problem

Have you seen that pages that should not be indexed are indexed?

No, but my problem is in one earlier stage — that they are even crawled.

It may just be you have unrealistic expectations about how quickly all your pages will be indexed.

It may indeed be the case, but given the special nature of MediaWiki in that subject of meta-webpages, I would prefer not to take any chance with this.
If it was something minor like WordPress creating a webpage for each uploaded image (I don’t know if WordPress even keeps doing this) than I wouldn’t care but since there can easily be 100-200 URLs per one webpage… It’s a bit of an anxiety source for me.

SamA74 · March 3, 2022, 6:06pm

One way to prevent pages being crawled (as opposed to just preventing indexing) is to make links to those pages rel="nofollow". Again, this is only advisory to the bots, nothing is enforced.
But I have a feeling it will do more harm than good from an SEO perspective (assuming that is your concern) as it may inhibit the flow of the crawl from page to page that could lead to missing pages you want to index.

TechnoBear · March 3, 2022, 6:19pm

Have you looked at Google’s guidance?

bendqh1 · March 3, 2022, 6:23pm

I think that nofollow isn’t good when the links are internal.
Even if it can help, I don’t mean to try to intervene the MediaWiki PHP

bendqh1 · March 3, 2022, 6:29pm

Not before you posted but I didn’t find in that specific table of contents anything that covers my particular problem.

bendqh1 · March 3, 2022, 6:30pm

Perhaps I should just try to backendly prevent the link to all these webpages to be created.

So far, I’ve prevented human access to these pages by making each link to them display: none in CSS, but something backend should prevent bot access as well, I guess.

Will you agree @SamA74

DaveMaxwell · March 3, 2022, 6:36pm

Did you read the link I posted in #2?

It has a section that specifically tells you how to set your robots.txt file to not crawl non-article pages.

SamA74 · March 3, 2022, 6:39pm

If you don’t want these pages to be viewed at all, by anyone, human or bot, that is a different matter.
I’m not familiar with MediaWiki, so I don’t know if it has facility to disable these pages from even existing (or being served).
But it should be possible to set your server not to serve them to anyone.

bendqh1 · March 3, 2022, 8:37pm

I have read there already yes.

That chapter wasn’t very clear to me, especially why not a distinction between PHP as CGI to PHP as an Apache module was made.

Moreover, I didn’t find there a command in the pattern of “Prevent crawling of anything which isn’t something”, rather “Prevent crawling of anything which is something”, which I less desire.

DaveMaxwell · March 3, 2022, 8:40pm

It’s right here in this section.

Pretty much exactly what you asked for. It has different examples based on the types of URLS you are running.

bendqh1 · March 4, 2022, 10:48pm

Thanks.
I actually want to serve them, just not to link to them from anywhere.
It’s a tough case.

bendqh1 · March 4, 2022, 11:43pm

I have read there already.

I didn’t find there a command in the pattern of “Prevent crawling of anything which isn’t something”, rather, “Prevent crawling of anything which is something”.
I would prefer the first approach because if future versions of MediaWiki will include new patterns, these new patterns would be crawlable unless I disallow them in advance.
But, that might be a limitation of the current robots.txt syntax which I must fit myself to.

Also, it’s unclear to me why the distinction between PHP as CGI to PHP as an Apache module was made and especially if it’s connected to the two code examples there, this:

User-agent: *
Disallow: /index.php?diff=
Disallow: /index.php?oldid=
Disallow: /index.php?title=Help
Disallow: /index.php?title=Image
Disallow: /index.php?title=MediaWiki
Disallow: /index.php?title=Special:
Disallow: /index.php?title=Template
Disallow: /skins/

and this:

User-agent: *
Disallow: /index.php?
Disallow: /index.php/Help
Disallow: /index.php/MediaWiki
Disallow: /index.php/Special:
Disallow: /index.php/Template
Disallow: /skins/

DaveMaxwell · March 7, 2022, 1:00pm

That whole section I linked to is entitled Prevent crawling of non-article pages. Or to put it into your vernacular “Prevent crawling of anything which isn’t an article page”

Then it comes down to how your site it rendered. Are your links

www.example.com/wiki/Content 
www.example.com/index.php?title=Content
www.example.com/index.php/Content

Then put the appropriate robots.txt entry in yours and see what happens.

bendqh1 · March 7, 2022, 3:00pm

That’s how the section is titles yes,
But the command I seek isn’t included in that section.
As I later understood, because of syntax limitations from the current robots.txt syntax.

system · June 27, 2022, 5:00pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Disallow crawling of anything with a query string Server Config	3	1922	July 8, 2022
Prevent Page from being Indexed HTML & CSS	17	2059	October 8, 2014
Crawler not show content? Marketing	7	872	December 29, 2016
How to Block Web Spiders/Crawlers Marketing	14	3552	September 28, 2014
Help with robot.txt Marketing	2	233	September 19, 2010

Totally prevent crawling of anything which isn't something

Related topics