Googlebot is almost DDOSing my website

I noticed high CPU usage on my server, but not coming from regular traffic. Instead, the crawl stats from the Google Search Console look like this:

Is that a good thing, or should I worry about it (e.g. configuration issue on my site)?

How many pages does your site have?

Based on the Google Search Console’s Coverage report, around 690K, 43.1K of which are marked as valid, and 650K marked as Excluded, most of which for the reason: “Alternate page with proper canonical tag”.

And according to yourself and/or your database? It doesn’t really help to compare google against google to try and figure out what’s going on.

Sorry. I used Google data because it’s a bit difficult to answer. It’s a WordPress installation. I have over 9000 posts, a dozen of pages, about 10 authors, 190 categories, and around 1700 tags.

Right. Then those requests do seem a bit extreme. Do you have some sort of logs to see what was crawled? It is maybe ending up in some kind of loop somewhere?

Yes, I can provide the access log for November: https://mega.nz/file/64AASRjR#hfQg7Xz0uIdnoMgCxv1IHDOR8PTn2RfD2-DlKZ-V00c

The Google Search Console also shows several links with string parameters and redirects.

?amp=1 is normal. I don’t quite understand the wptouch_switch since I disabled that plugin several months ago, maybe it’s just on some other websites.

It would seem Google already has all those URLs in its database, so it will continue to crawl them.

Moreover, your website just serves the actual page a ignores those query parameters now that the plugin is no longer installed. Luckily, there is canonical meta tag to the URL without those query parameters, so I don’t think you’ll get in trouble with duplicate content or anything like that.

It might be advisable though to let any URL that contains wptouch_ in the query string return a 404 to get google to purge the URL from its search results. Though beware I’m not a SEO expert and can’t 100% tell if that’ll work and/or if it’s better to leave it as is.

Wouldn’t “410 Gone” be a better choice?

From: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#4xx_client_errors

410 Gone

Indicates that the resource requested is no longer available and will not be available again. This should be used when a resource has been intentionally removed and the resource should be purged. Upon receiving a 410 status code, the client should not request the resource in the future. Clients such as search engines should remove the resource from their indices. Most use cases do not require clients and search engines to purge the resource, and a “404 Not Found” may be used instead.

1 Like

That does sound better, yes.

Alternatively, I thought about using a 301 permanent redirect.
What surprised me is that Google crawler generated around 10k to 20k requests a day for months, if not years, and suddenly decided to send 300k-400k crawl request my way for no apparent reason.