I have a MediaWiki website.
MediaWiki creates about at least 15 webpages per webpage:
Talk webpage
History webpage
Revision webpages
Diff webpages
What-links-here webpage
Recent-changes-in-webpages-linked-from-here
Printable version webpage
Permalink version
Information about this webpage — webpage
Source code webpage / Edit webpage
Statistics webpages
And probably more
The total amount of webpages might arrive 150-1500 or much more.
Having so many webpages per webpage tremendously inhibited the crawling of my website to the extent that SEO is damaged although software performance is decent and content is abundant and rewarding with good feedback from readers.
Although most of the webpages I’ve exampled have a noindex attribute, I believe that I should still limit access to them backendly somehow.
I thought using robots.txt to allow access only into article and category pages.
Any instruction to robots, either through a meta tag or robots.txt, is only ever advisory, it does not enforce anything.
This means that “good” robots should obey your rules, but there is nothing to stop any or “bad” robots from ignoring them.
The chances are, if robots are ignoring the rules in one place (meta tag) they will ignore them elsewhere too.
Have you established whether this is really the actual problem?
Have you seen that pages that should not be indexed are indexed?
It may just be you have unrealistic expectations about how quickly all your pages will be indexed.
Have you established whether this is really the actual problem?
I did read URLs of such pages in Google Search Console, which is for me personally a problem
Have you seen that pages that should not be indexed are indexed?
No, but my problem is in one earlier stage — that they are even crawled.
It may just be you have unrealistic expectations about how quickly all your pages will be indexed.
It may indeed be the case, but given the special nature of MediaWiki in that subject of meta-webpages, I would prefer not to take any chance with this.
If it was something minor like WordPress creating a webpage for each uploaded image (I don’t know if WordPress even keeps doing this) than I wouldn’t care but since there can easily be 100-200 URLs per one webpage… It’s a bit of an anxiety source for me.
One way to prevent pages being crawled (as opposed to just preventing indexing) is to make links to those pages rel="nofollow". Again, this is only advisory to the bots, nothing is enforced.
But I have a feeling it will do more harm than good from an SEO perspective (assuming that is your concern) as it may inhibit the flow of the crawl from page to page that could lead to missing pages you want to index.
Perhaps I should just try to backendly prevent the link to all these webpages to be created.
So far, I’ve prevented human access to these pages by making each link to them display: none in CSS, but something backend should prevent bot access as well, I guess.
If you don’t want these pages to be viewed at all, by anyone, human or bot, that is a different matter.
I’m not familiar with MediaWiki, so I don’t know if it has facility to disable these pages from even existing (or being served).
But it should be possible to set your server not to serve them to anyone.
That chapter wasn’t very clear to me, especially why not a distinction between PHP as CGI to PHP as an Apache module was made.
Moreover, I didn’t find there a command in the pattern of “Prevent crawling of anything which isn’t something”, rather “Prevent crawling of anything which is something”, which I less desire.
I didn’t find there a command in the pattern of “Prevent crawling of anything which isn’t something”, rather, “Prevent crawling of anything which is something”.
I would prefer the first approach because if future versions of MediaWiki will include new patterns, these new patterns would be crawlable unless I disallow them in advance.
But, that might be a limitation of the current robots.txt syntax which I must fit myself to.
Also, it’s unclear to me why the distinction between PHP as CGI to PHP as an Apache module was made and especially if it’s connected to the two code examples there, this:
That whole section I linked to is entitled Prevent crawling of non-article pages. Or to put it into your vernacular “Prevent crawling of anything which isn’t an article page”
Then it comes down to how your site it rendered. Are your links
That’s how the section is titles yes,
But the command I seek isn’t included in that section.
As I later understood, because of syntax limitations from the currentrobots.txt syntax.