Googlebot is almost DDOSing my website

cnxsoft · November 13, 2021, 12:43pm

I noticed high CPU usage on my server, but not coming from regular traffic. Instead, the crawl stats from the Google Search Console look like this:

Is that a good thing, or should I worry about it (e.g. configuration issue on my site)?

rpkamp · November 14, 2021, 10:43am

How many pages does your site have?

cnxsoft · November 14, 2021, 12:24pm

Based on the Google Search Console’s Coverage report, around 690K, 43.1K of which are marked as valid, and 650K marked as Excluded, most of which for the reason: “Alternate page with proper canonical tag”.

rpkamp · November 14, 2021, 1:24pm

And according to yourself and/or your database? It doesn’t really help to compare google against google to try and figure out what’s going on.

cnxsoft · November 14, 2021, 1:41pm

Sorry. I used Google data because it’s a bit difficult to answer. It’s a WordPress installation. I have over 9000 posts, a dozen of pages, about 10 authors, 190 categories, and around 1700 tags.

rpkamp · November 14, 2021, 1:52pm

Right. Then those requests do seem a bit extreme. Do you have some sort of logs to see what was crawled? It is maybe ending up in some kind of loop somewhere?

cnxsoft · November 15, 2021, 2:11pm

Yes, I can provide the access log for November: https://mega.nz/file/64AASRjR#hfQg7Xz0uIdnoMgCxv1IHDOR8PTn2RfD2-DlKZ-V00c

The Google Search Console also shows several links with string parameters and redirects.

?amp=1 is normal. I don’t quite understand the wptouch_switch since I disabled that plugin several months ago, maybe it’s just on some other websites.

rpkamp · November 15, 2021, 10:13pm

It would seem Google already has all those URLs in its database, so it will continue to crawl them.

Moreover, your website just serves the actual page a ignores those query parameters now that the plugin is no longer installed. Luckily, there is canonical meta tag to the URL without those query parameters, so I don’t think you’ll get in trouble with duplicate content or anything like that.

It might be advisable though to let any URL that contains wptouch_ in the query string return a 404 to get google to purge the URL from its search results. Though beware I’m not a SEO expert and can’t 100% tell if that’ll work and/or if it’s better to leave it as is.

SpacePhoenix · November 16, 2021, 9:27am

Wouldn’t “410 Gone” be a better choice?

From: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#4xx_client_errors

410 Gone

Indicates that the resource requested is no longer available and will not be available again. This should be used when a resource has been intentionally removed and the resource should be purged. Upon receiving a 410 status code, the client should not request the resource in the future. Clients such as search engines should remove the resource from their indices. Most use cases do not require clients and search engines to purge the resource, and a “404 Not Found” may be used instead.

rpkamp · November 16, 2021, 9:44am

That does sound better, yes.

cnxsoft · November 16, 2021, 9:49am

Alternatively, I thought about using a 301 permanent redirect.
What surprised me is that Google crawler generated around 10k to 20k requests a day for months, if not years, and suddenly decided to send 300k-400k crawl request my way for no apparent reason.

system · February 15, 2022, 4:50pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.