By Mihaela Lica

More Google SEO Myths Exposed

By Mihaela Lica

Google and wizard.Google’s Susan Moskwa wrote in 2008 on the Google Webmaster Central Blog an article titled Demystifying the “duplicate content penalty”. One year later many webmasters still look at “duplicate content” without really understanding what it is and what it does.

An older article, on the same site, reads Deftly dealing with duplicate content and clarifies most of the issues related to this topic. Although the article is dated 2006, things haven’t changed that much in the “duplicate content” camp. Google still wants you to optimize your sites and block all on-site duplicate content appropriately. Google still wants you to keep your internal linking consistent, they still want you to handle country-specific content with appropriate TLDs and they still want you to use the preferred domain feature in webmaster tools. Boilerplate repetition rules haven’t changed, and Google still doesn’t like publishing stubs (although many sites still get away with such practices, but not for long).

It’s still astonishing that Google advises not to worry about scrapers, since these rank often above the original content in its SERPs (image below – click to enlarge).


As a general rule, scraper sites don’t hurt as much as on-site duplicate content – and did you know that your website template counts in the duplicate content calculation too? Behind what the eye sees there is the HTML code that uses “words” to generate the visible layout. If those words in the code outnumber the actual text of an article by 70 % you might have duplicate content issues. So, if you have this possibility, don’t be too lazy to write longer texts.

Last but not least, you should know that on-site duplicate content can also influence your PageRank. You probably thought that when you publish a new site the PageRank is zero. Wrong. Every time you publish a new page it will have a PageRank greater than zero. Google assigns PageRank to sites just for existing – when they first appear the PageRank is based on internal linking structure and content, and not calculated based on external links. What influences the PageRank in such a situation is the content of the site and the number of non-duplicate pages (the more the merrier). I hope this gives at least one answer to those who saw their new sites having a PR4 for example, and then got puzzled because their PR dropped. PageRanks drops because Google’s generosity doesn’t last long. If new pages get a PR greater than usual, take advantage of the opportunity and try to get as many external links as possible to support them, or else, Google’s next PR update will make you wonder WTF!

  • Are you sure that website templates counted in the content duplication calculation? I would think Google does the calculation based on the ‘real words’ as they are the information that searchers read and respond to.

    Calculation of duplicated html codes seems un-necessary and would probably misjudge a site that provides lots of useful information but built with one of those common templates, i.e. WordPress. What do you think?

  • What you’ve underlined in the SERP screenshot isn’t the original article; as far as I can tell, Google Blog Search doesn’t seem to know about the original article at all, so saying that others rank above is is a bit misleading.

  • You are right Filip, I didn’t even notice that the link was not right. However, this only proves that scrapers are the worse thing that can happen to a site, if Google blog search completely ignores the original article. Google organic search shows the article at the top of the results, but as you scroll down you will see many scrapers. (http://www.google.com/search?hl=en&q=how+google+really+wants+you+to+optimize+your+site&btnG=Search)
    I think Google is misleading us when they advise webmasters not to sweat over scrapers. Obviously they are not able to identify the original content.

  • I don’t think it proves anything about scrapers; but it does indicate that Google Blog Search might be buggy.

    Don’t get me wrong: scrapers are bad, but let’s at least dislike them for the right reasons.

  • As I said, look also at the organic results – are those “buggy” too? I think it is totally unfair that scrapers rank anyway. And what this proves is that Google is unable to recognize the original site – so the “duplicate content penalty” is not quite a myth after all, when the victim is the original content producer.

  • I still believe in the Google sandbox (which some say is a myth), and in other ways, I think their search results improperly rate sites (even other people’s sites).

    It’s hard to tell a new customer that they aren’t going to have good organic search results for a while, and there’s nothing they can do about it, short of purchasing online advertising.

  • I think the organic results look fine. I get one scraper at #5 (keeping in mind that we may get notably different sets of results). I don’t think that substantially affects the traffic to or perception of the original SitePoint article.

    Google is indeed unable to recognize which is the original content. That’s the problem. It’s an extremely difficult problem to solve, but I think Google handles it reasonably well in this case. I would love it if they did a lot better. But I also agree with Google’s advice not to fret too much about it. (Of course, reporting offenders you do stumble upon is still a very good idea.)

    Naturally, the best thing would be if nobody republished content without permission. It really is unfair.

  • @skunkbad – I believe in the sandbox too and I am with you on the rating/ranking part. Obviously Google blog search uses similar algorithm to rank sites, Google will not develop new algorithms for “blog search” or other Google Search venues – it’s not profitable.

    I have the same issue with convincing customers that good SEO is a long-term commitment. But sometimes even online advertising fails to bring results (including Google AdWords) because people simply don’t like clicking on ads. It’s getting harder and harder to “make a mark” – the only way I see is to create powerful brands and information hubs. Google has the tendency to give brands priority over other sites, and information hubs (preferably multiple author sites) are also preferred.

  • @Filip, I guess we do get different results, I live in Germany, and I see three scrapers on the first page, searching on google.com (as I always do). In my view Google doesn’t handle this good, because they fail to deliver what they promise: quality, relevant content. Duplicates, and scraper sites are not quality. Also, for people who depend on their content to monetize, Google being unable to recognize the original is really a hammer.

  • ricktheartist

    “Google assigns PageRank to sites just for existing – when they first appear the PageRank, and not calculated based on external links.”

    Is this a complete sentence? I am trying to make sure I fully understand the point here. It seems like you meant to explain what the initial page rank is based on, but that explanation is missing. Or maybe it is a simple typo and the “, and” should have been “is”, making the sentence “Google assigns PageRank to sites just for existing – when they first appear the PageRank is not calculated based on external links.”

    Clarification will be greatly appreciated.

  • orokusaki

    You guys always wine about non-sense that you don’t know very well. (including SEO).

    Firstly, PageRank hasn’t had anything to do with the number of pages on your site for a very long time. A certain amount of PageRank was given out in such a manner in the beginning (years ago), but is no longer done that way. PageRank is only given based on incoming links from other pages with PageRank. For instance, I own conficker.com and have 7 pages linking into it that are recognized by Google, and many more that have not been re-spidered since. I only have one page on the site,you can give it a try by visiting: google.com/search?q=site:conficker.com. The domain has only been registered for about 4.5 months, and already I have a PR2.

    Secondly, you’re confused about your other myth because of your search. You searched with quotes around your query. Google isn’t looking for a page themed about that phrase when you do that. It’s literally looking for that exact string. The scraper sites likely mentioned your title multiple times on the page, and therefore Google gave them temporary rank. Now your site ranks #1 for the query because inbound links. I know this is tough stuff to understand. Do a little research before you post next time.

  • Hi Rick, you are right, the sentence was different – I cannot figure what happened and why it got cut like that. Probably when I edited the article in WP dashboard. The complete sentence was:
    “Google assigns PageRank to sites just for existing – when they first appear the PageRank is based on internal linking structure and content, and not calculated based on external links.”

    Thank you for pointing that out. I corrected the sentence now.

  • thegamecat

    RE scrapers – surely identifying the real content is the easiest thing ever – whichever is indexed first is the original – or extend the webmaster tools so that you have a key and every page you create pings webmastertools with your key and so you “book” the ownership of the article.

  • @thegamecat: Both methods would probably make the situation worse in most cases.

    Since republishing at a scraper site can be done with almost no delay at all, which page gets indexed first would be almost entirely random. (Also, different search engines would get different results.)

    With pinging – well, what stops a scraper from pinging with their own key, thus easily “owning” the content of everyone who doesn’t have automatic pinging in place? (And, again, should we ping all (whatever that means) search engines?)

    What Google already does – looking at inbound links and a number of other factors – is probably one of the better ways to handle this. It doesn’t stop scraping, but it does in most cases rank the “better” (which is hopefully the original) site higher. Couple it with reporting the worst offenders, and you’ll be fine.

  • @thegamecate, that sounds like a good idea, but I think a scraper might be designed more efficiently than most sites. A person that has a static site with original content can be scraped, and put on another site… and that other site can ping Google first. Only websites designed by advanced developers would have any advantage (well.. that could be good for business if you are an advanced developer!)

    It still comes down to competing with scrapers.

  • If you think the secret of success is doing well in Google…I wish you luck.

    All you need to do is follow Google’s simple guidelines. No more, no less. Don’t spend too much time on it. Spend that time getting products and services to customers!

  • bidhire

    WTF at the end???

    I thought this was a respectable website, WTF! :)

  • orokusaki

    If you have anything bad to say, don’t say it. If you say anything negative on here about a post, they simply delete it like a bunch of communists.

  • @orokusaki – I never deleted a comment, not even those that obviously criticized me, my style and my grammar errors, which are inherent since I am not an English native speaker. I am sure no one from SitePoint did either. We have a very strict comment policy. I find offensive the insinuation that we delete comments. Also, considering that I am a Romanian native, and that we lived over 50 years under communism, I am hurt by your somehow racist-political innuendo. Let’s stick to the topics of the article, shall we? We are not here to hurt each other, but to learn.

  • Sorry, scraper sites have helped build up too many of my blogs. I just make sure and include a footer link in each post, therefore getting credit(mostly) for the content that is scraped via RSS. Because of this I’ve got a lot of backlinks to specific articles and my PR has climbed to PR6 in recent roll outs.

    Although this is not specifically about duplicate content, it shows that scrapers do not always hurt you. There are a lot of tools in a black hats arsenal and just looking at a few results in the SERPs is not enough to blanket blame or throw duplicate content under the bus. I’d be surprised if any scraper sites stay in the results for longer than a few weeks. You contradict yourself in regards to early PR and rank, but then dont’ consider that when showing how powerful scraper sites are.

  • @cldnails – I do not think you are talking about the blog linked from your signature, which is PR2 as I see. I also seriously doubt that the PR6 you are talking about is a result of your site being scraped.

  • @Michaela Lica of course I’m not talking about my personal blog. I’m making reference to my actual money making sites and blogs. The site linked is simply a ‘for me’ site and has nothing to do with my actual marketed sites that have been PR6.

    Furthermore, I’m not stating that the PR6 was JUST because of the scrapers, but they did help. That is based on link tracking and other marketing methods.

  • @cldnails – this is the most interesting statement I’ve ever heard, in 7 years of SEO, that scraper sites boost PR. I am really curios to learn what other “marketing methods” support this statement. Forgive me for not being a believer, but I really don’t see any good coming from scrapers (AKA site plagiarists).

  • @Mihaela Lica really the concept is simple, I encouraged people and automatic splogs to scrape my RSS feed to publish on their sites. The blog in question uses WordPress as a backend blogging script, so there are plenty of plugins available to help ensure that I get credit for what’s scraped.

    I use RSS Footer (http://yoast.com/wordpress/rss-footer/) to automatically add a link in the RSS feed for each post, with a link back to the original post and to my blog. Now, I understand that it can easily be removed, but more often than not I found that scraper sites won’t take the time to remove it from each individual post. Furthermore, my feed was scraped by many sites and continues to be and not just from a single host. Therefore, until deindexed in Google, those sites were giving me a backlink.

    Again, this is one small part of my marketing method. Yes, I did encourage people to still my feed, since there is no way to keep them from doing it anyway. However, I did help ensure that a link was placed back to my site when automated, which that link held just as much weight as any other link from a site. Thus, this was a part of my overall link building strategy.

    You don’t have to be a believer to understand that every link counts, until deindexed. So past that, I’m not sure I can convince you.

  • @cldnails, well, from the perspective of “every link counts untill deindexed” I guess the theory is viable. However, I would not encourage other people to use this technique. PageRanks are not as important as SERP ranks. Many times scrapers rank on top of original content, stealing traffic – which is vital for those who monetize based on number of visitors. Besides, once these links are “deindexed” the PageRanks drop – naturally. So the glory is only temporary.

  • @Mihaela lol. It’s just a tool, not the whole arsenal. And not encouraging is one thing, but the point is that you cannot stop scrapers. So, why not make the best of it?

    As for rankings, my sites have lasted and been on top for great keywords for more than 2 years. I have two sites that have used this method and they are still going strong and standing in the serps.

    I’m not interested in converting anyone, only pointing out that you can benefit for the jerk scrapers. Putting your head in the sand will undoubtedly do you no good.

  • @cldnails I never assumed it was all the arsenal. :) I guess using this only depends on what each of us is ready to do for greater Google PR. As far as I am concerned, allowing scrapers to steal my content is not an option, no matter how many links they might give me. I’d rather have an editorial link on any site, than 100 links from scrapers.

  • @orokusaki – I have a site that has almost no back links, but it has a PR4, how do you explain that? I tell you how: it is a blog, with very rich content, optimized by me. I am doing SEO since 2002, successfully for ALL my clients.

  • @wiseweb – I made that not especially for those who run WP blogs – it is actually very easy to proove if we get a WP site with very short entries – we can check how many pages are actually indexed and how many are listed as “supplementary results”

  • jemple

    I am still amazed that scraped content ranks as well as it does. My admin team over at ibl builder do a manual check for each new site submitted to the link network, and that involves pasting text into google to check for duplication, and the amount of scraped content ranking happily above the obviously original content is staggering.

Get the latest in Entrepreneur, once a week, for free.