I have a MediaWiki website.
MediaWiki creates about at least 15 webpages per webpage:

  • Talk webpage
  • History webpage
  • Revision webpages
  • Diff webpages
  • What-links-here webpage
  • Recent-changes-in-webpages-linked-from-here
  • Printable version webpage
  • Permalink version
  • Information about this webpage — webpage
  • Source code webpage / Edit webpage
  • Statistics webpages
  • And probably more

The total amount of webpages might arrive 150-1500 or much more.

Having so many webpages per webpage tremendously inhibited the crawling of my website to the extent that SEO is damaged although software performance is decent and content is abundant and rewarding with good feedback from readers.

Although most of the webpages I’ve exampled have a noindex attribute, I believe that I should still limit access to them backendly somehow.

I thought using robots.txt to allow access only into article and category pages.

Allow: /index.php/article/
Allow: /index.php/category/
Disallow: *

Is this syntax good? Would you do something otherwise?

looks like there is a big page already dedicated to this on the mediawiki site.
https://www.mediawiki.org/wiki/Manual:Robots.txt

Any instruction to robots, either through a meta tag or robots.txt, is only ever advisory, it does not enforce anything.
This means that “good” robots should obey your rules, but there is nothing to stop any or “bad” robots from ignoring them.
The chances are, if robots are ignoring the rules in one place (meta tag) they will ignore them elsewhere too.

Have you established whether this is really the actual problem?
Have you seen that pages that should not be indexed are indexed?
It may just be you have unrealistic expectations about how quickly all your pages will be indexed.

Have you established whether this is really the actual problem?

I did read URLs of such pages in Google Search Console, which is for me personally a problem :slight_smile:

Have you seen that pages that should not be indexed are indexed?

No, but my problem is in one earlier stage — that they are even crawled.

It may just be you have unrealistic expectations about how quickly all your pages will be indexed.

It may indeed be the case, but given the special nature of MediaWiki in that subject of meta-webpages, I would prefer not to take any chance with this.
If it was something minor like WordPress creating a webpage for each uploaded image (I don’t know if WordPress even keeps doing this) than I wouldn’t care but since there can easily be 100-200 URLs per one webpage… It’s a bit of an anxiety source for me.

One way to prevent pages being crawled (as opposed to just preventing indexing) is to make links to those pages rel="nofollow". Again, this is only advisory to the bots, nothing is enforced.
But I have a feeling it will do more harm than good from an SEO perspective (assuming that is your concern) as it may inhibit the flow of the crawl from page to page that could lead to missing pages you want to index.

Have you looked at Google’s guidance?

I think that nofollow isn’t good when the links are internal.
Even if it can help, I don’t mean to try to intervene the MediaWiki PHP :flushed:

Not before you posted :smiling_face_with_three_hearts: but I didn’t find in that specific table of contents anything that covers my particular problem.

Perhaps I should just try to backendly prevent the link to all these webpages to be created.

So far, I’ve prevented human access to these pages by making each link to them display: none in CSS, but something backend should prevent bot access as well, I guess.

Will you agree @SamA74