Totally prevent crawling of anything which isn't something

I have a MediaWiki website.
MediaWiki creates about at least 15 webpages per webpage:

  • Talk webpage
  • History webpage
  • Revision webpages
  • Diff webpages
  • What-links-here webpage
  • Recent-changes-in-webpages-linked-from-here
  • Printable version webpage
  • Permalink version
  • Information about this webpage — webpage
  • Source code webpage / Edit webpage
  • Statistics webpages
  • And probably more

The total amount of webpages might arrive 150-1500 or much more.

Having so many webpages per webpage tremendously inhibited the crawling of my website to the extent that SEO is damaged although software performance is decent and content is abundant and rewarding with good feedback from readers.


Although most of the webpages I’ve exampled have a noindex attribute, I believe that I should still limit access to them backendly somehow.

I thought using robots.txt to allow access only into article and category pages.

Allow: /index.php/article/
Allow: /index.php/category/
Disallow: *

Is this syntax good? Would you do something otherwise?

looks like there is a big page already dedicated to this on the mediawiki site.
https://www.mediawiki.org/wiki/Manual:Robots.txt

1 Like

Any instruction to robots, either through a meta tag or robots.txt, is only ever advisory, it does not enforce anything.
This means that “good” robots should obey your rules, but there is nothing to stop any or “bad” robots from ignoring them.
The chances are, if robots are ignoring the rules in one place (meta tag) they will ignore them elsewhere too.

Have you established whether this is really the actual problem?
Have you seen that pages that should not be indexed are indexed?
It may just be you have unrealistic expectations about how quickly all your pages will be indexed.

1 Like

Have you established whether this is really the actual problem?

I did read URLs of such pages in Google Search Console, which is for me personally a problem :slight_smile:

Have you seen that pages that should not be indexed are indexed?

No, but my problem is in one earlier stage — that they are even crawled.

It may just be you have unrealistic expectations about how quickly all your pages will be indexed.

It may indeed be the case, but given the special nature of MediaWiki in that subject of meta-webpages, I would prefer not to take any chance with this.
If it was something minor like WordPress creating a webpage for each uploaded image (I don’t know if WordPress even keeps doing this) than I wouldn’t care but since there can easily be 100-200 URLs per one webpage… It’s a bit of an anxiety source for me.

One way to prevent pages being crawled (as opposed to just preventing indexing) is to make links to those pages rel="nofollow". Again, this is only advisory to the bots, nothing is enforced.
But I have a feeling it will do more harm than good from an SEO perspective (assuming that is your concern) as it may inhibit the flow of the crawl from page to page that could lead to missing pages you want to index.

Have you looked at Google’s guidance?

I think that nofollow isn’t good when the links are internal.
Even if it can help, I don’t mean to try to intervene the MediaWiki PHP :flushed:

Not before you posted :smiling_face_with_three_hearts: but I didn’t find in that specific table of contents anything that covers my particular problem.

Perhaps I should just try to backendly prevent the link to all these webpages to be created.

So far, I’ve prevented human access to these pages by making each link to them display: none in CSS, but something backend should prevent bot access as well, I guess.

Will you agree @SamA74

Did you read the link I posted in #2?

It has a section that specifically tells you how to set your robots.txt file to not crawl non-article pages.

1 Like

If you don’t want these pages to be viewed at all, by anyone, human or bot, that is a different matter.
I’m not familiar with MediaWiki, so I don’t know if it has facility to disable these pages from even existing (or being served).
But it should be possible to set your server not to serve them to anyone.

I have read there already yes.

That chapter wasn’t very clear to me, especially why not a distinction between PHP as CGI to PHP as an Apache module was made.

Moreover, I didn’t find there a command in the pattern of “Prevent crawling of anything which isn’t something”, rather “Prevent crawling of anything which is something”, which I less desire.

It’s right here in this section.

Pretty much exactly what you asked for. It has different examples based on the types of URLS you are running.

1 Like

Thanks.
I actually want to serve them, just not to link to them from anywhere.
It’s a tough case.

I have read there already.

I didn’t find there a command in the pattern of “Prevent crawling of anything which isn’t something”, rather, “Prevent crawling of anything which is something”.
I would prefer the first approach because if future versions of MediaWiki will include new patterns, these new patterns would be crawlable unless I disallow them in advance.
But, that might be a limitation of the current robots.txt syntax which I must fit myself to.

Also, it’s unclear to me why the distinction between PHP as CGI to PHP as an Apache module was made and especially if it’s connected to the two code examples there, this:

User-agent: *
Disallow: /index.php?diff=
Disallow: /index.php?oldid=
Disallow: /index.php?title=Help
Disallow: /index.php?title=Image
Disallow: /index.php?title=MediaWiki
Disallow: /index.php?title=Special:
Disallow: /index.php?title=Template
Disallow: /skins/

and this:

User-agent: *
Disallow: /index.php?
Disallow: /index.php/Help
Disallow: /index.php/MediaWiki
Disallow: /index.php/Special:
Disallow: /index.php/Template
Disallow: /skins/

That whole section I linked to is entitled Prevent crawling of non-article pages. Or to put it into your vernacular “Prevent crawling of anything which isn’t an article page”

Then it comes down to how your site it rendered. Are your links

www.example.com/wiki/Content 
www.example.com/index.php?title=Content
www.example.com/index.php/Content

Then put the appropriate robots.txt entry in yours and see what happens.

That’s how the section is titles yes,
But the command I seek isn’t included in that section.
As I later understood, because of syntax limitations from the current robots.txt syntax.

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.