Additional hacked pages still showing in Google Two Weeks after Removal

Hi all, I haven’t come across this before, so I’m hoping someone can help.

Two weeks ago, I noticed a client’s PHP website was hacked. I easily found the rogue JS file, and removed it, which resulted in all of the Japanese pages being removed. There was approximately 5 thousand additional pages. This was two weeks ago. All of the pages now return 404s.
However searches in Google still return the results OVER TWO WEEKS LATER.

After one week, I edited the robots.txt file to block these pages. I can verify that the ‘extra pages’ (that now are removed, and do not exist) are blocked, by checking within Google Search Console.

Does anyone have any ideas on what I’m missing? Why are these removed pages still showing in Google, and why is Google not respecting the disallow instructions in robots.txt?

When something similar happened to one of my sites, I ended up having to remove the URLs from Google’s index via the option in Search Console. They will vanish eventually, but I found it was taking an awfully long time. I didn’t have as many URLs to deal with; around 350, IIRC.

The URLs won’t automatically disappear until Google recrawls them enough times to “decide” they no longer exist. (I don’t know how many times that is, but I’m pretty sure it doesn’t do it the first time it encounters a 404, which may simply be a one-off glitch.) That could take a long time with so many URLs.

Unfortunately, robots.txt doesn’t guarantee that Google will not crawl your pages. If there are links from other sites, Google will follow those. https://support.google.com/webmasters/answer/6062608?hl=en

1 Like

I had a similar problem. The site wasn’t actually hacked thanks to the firewall, but somehow the would-be hackers convinced Google, Bing, Yahoo etc that the pages did exist and they have been indexed, and despite them serving 404s they are still indexed months after the event.

1 Like

Hey, thanks for the reply.

Re the robots.txt, I’m ‘lucky’ in that the additional URLs are only reached via internal links - no external links at all. I do understand though that robots.txt is a ‘request’ and not a guarantee to a search engine.

Thanks for sharing your experience. It’s interesting you had to block the pages - did you have to do each of the 350 manually? I don’t fancy doing this for this website (the website itself is not particularly popular or big).

Is that because of external links to your website? I wonder would disavowing help in your situation?

Yes. I did them in batches over a couple of days.

This was several years ago, in the days of GWT, rather than Search Console, and the actual mechanism for removing a URL has changed slightly, but as far as I know, you’d still need to do them one at a time.

I don’t think that’s the issue here. The issue is that the spurious URL on the site needs to be removed; disavow only tells Google to ignore the incoming link as a backlink (and they advise it should only be used if you have received a notice regarding low-quality links).

1 Like

This may be relevant…

Quite some time ago I was using the canonical reference to redirect numerous pages and read somewhere that Google gave more weight to pages with a 301 redirect.

1 Like

I don’t think that’s the issue here. The issue is that the spurious URL on the site needs to be removed; disavow only tells Google to ignore the incoming link as a backlink (and they advise it should only be used if you have received a notice regarding low-quality links).

Yes, I meant, that it might’ve helped the with an issue like gandalf458’s, I agree that it would be no use for my issue.

Yes. I did them in batches over a couple of days.

Ugh, horrible!!

In my case, it is a dormant domain with no content with noting other than a home page saying

This website is dead.
It is a ex-website.
It has passed on.
This website is no more.
It has ceased to be.
It has expired and gone to meet its maker.

so it’s not something I have spent much time or energy on, but I am struggling to understand how the search engines can be persuaded that content exists when it blatantly doesn’t.

2 Likes

It can seem to take forever for 404 pages to get dropped from the index. I’ve never had a problem with hacking, but I have seen old, obsolete, removed pages appearing as 404 errors in GSC years after they were gone.
The only thing I have found that fixes these is to 301 them, but that’s only really a valid option if there is existing equivalent content, which in the case of “hack pages” there is not.

1 Like

You can also use 410 for pages that have ceased to be.

1 Like

I tried that too, thinkking that would be a message to the spiders to say “Hey, forget about this, it’s gone, drop it form the index.” but they still appear in “Crawl Errors” in GSC.

1 Like

That’s odd. I understood that to be the main purpose of 410. Clearly these bots aren’t as clever as they are made out to be!

2 Likes

One thing that maybe I should have mentioned in my OP, is that in GSC, crawl errors (with these 'hacked, now-deleted, pages) appear every day. For example, in this pic, you can see the error was detected on the 10th of this month, but the pages (and links to the hacked pages) were removed back in July. I don’t understand how errors are being detected on a day where there is no error. It seems impossible.

Interesting, and do they still appear in actual Google search? When you do a search like site:example.com in Google, do you see your hacked pages, even after the 410?

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.