Avoid Duplicate Content with These Three Techniques

Avoid Duplicate Content with These Three Techniques

This article is part of an SEO series from WooRank. Thank you for supporting the partners who make SitePoint possible.

Since Google’s Panda update, webmasters have been trying to avoid a “duplicate content penalty.” You still need to take the issue of duplicate content seriously — it affects your whole site, not just the pages that host it. While Google has said they don’t penalize pages with duplicate content, if you’ve got a lot of it you can seriously hinder your ability to rank in search results.

If you’re not careful, you could be inadvertently publishing duplicate content a few different ways:

  • Multiple URLs pointing to the same content
  • Multilingual versions of the same page
  • Paginated content

The good news here is that there are some on-page methods you can use to get rid of duplicate content on your site. They are known as rel="canonical”, hreflang and rel=”prev”/rel=”next” (pagination).

It’s well worth your time and effort to implement these fixes to make your site more findable on search engines. Let’s get started!

hreflang: Find Your Targeted Audience

What is it?

Introduced by Google in 2011, the hreflang tag lets you tell a search engine that a page is related to other pages in different languages and/or regions. If your website is https://example.com, and you’ve got the same page in Spanish on https://example.com/es, use the hreflang tag to tell search engines to serve that page to Spanish-speaking searchers.

It’s important to note that hreflang is a factor, not a directive, in search results. So if you have pages that are too similar (like English pages targeting the US and Canada) you run the risk of the wrong version ranking for a search term. Multilingual sites need to be a part of your overall marketing strategy.

How do I do it?

The hreflang annotation is implemented in the section of an HTML page. For non-HTML pages the tag can be placed in the HTTP header. When done correctly the hreflang tag should look like this:

  • HTML: <link rel="alternate” hreflang=”en” href=”https://www.example.com”>
  • HTTP: link: <https://www.example.com/>; rel="alternate”; hreflang=”en”

You must include links to every version of your page. If you have English, Spanish and French copies, put links to all three in the page .

If you have two or more pages in the same language but targeted to different geographies (say, the US, Canada and UK) you can extend the hreflang variable to include the country code like this:

  • <link rel="alternate” hreflang=”en-us” href=”https://www.example.com”>
  • <link rel="alternate” hreflang=”en-ca” href=”https://www.example.com/ca”>
  • <link rel="alternate” hreflang=”en-gb” href=”https://www.example.com/uk”>

If you’ve got a non-HTML page in multiple languages, separate each hreflang annotation using commas like this:

  • link: <https://www.example.com/>; rel="alternate”; hreflang=”en-us”,
  • link: <https://www.example.com/ca/>; rel="alternate”; hreflang=”en-ca”,
  • link: <https://www.example.com/uk/>; rel="alternate”; hreflang=”en-gb”,

There’s also a third option to implement hreflang tags: your XML sitemap. Instead of adding markup to your pages, include the foreign language versions of your URLs in your sitemap. Just like with the other annotations, include a URL for each language.


What could go wrong?

A common problem when inserting hreflang annotations are “Return Tag Errors.” These errors come from hreflang annotations that don’t link to each other. Annotations are a two-way street; if your English page links to your German page, your German page must link back to your English page. Possibly the most common Return Tag Error is omitting the self-reference — your English page needs to link to itself.

To check for Return Tag Errors, look in Google Search Console’s International Targeting data under Search Traffic. This will tell you how many hreflang tags Google found and how many have errors.

Return Tag Errors in Google Search Console

Another common problem implementing hreflang annotations is incorrect language or country codes. The hreflang value must be in ISO 639–1 format for language and ISO 3166–1 Alpha 2 format for country. Using ‘uk’ for the United Kingdom is the most common culprit; in this system the value should be ‘gb’ for Great Britain.

Note that your hreflang value must start with the language code and that region targeting is limited to countries — you can’t target the European Union or North America, for example.

rel="canonical”: Which Page is the Original?

What is it?

If you use a content management system, syndicate content or have an e-commerce shopping site, it’s easy to wind up with multiple URLs or domains all pointing to the same content. To combat this, tell search engines where they are to find the original using the rel="canonical” tag. When a search engine sees this annotation, they know the current page is a copy and where to find the canonical content.

How do I do it?

Start by deciding which URL you want to be canonical. In general, you should pick your best optimized URL as your canonical URL. Take it a step further and set your preferred domain in Google Search Console.

A nice benefit of setting a preferred domain is that search engines will take this into account when crawling links to your page; links to example.com will pass link juice to your preferred domain of www.example.com. The same goes for other indexing factors, such as trust and authority.

Set preferred domain in Google Search Console

To properly tell a search engine that content is copied from your canonical URL, place the rel="canonical” annotation in the of your page. It should look like this:

  • <link rel="canonical” href=”https://www.example.com”>

If you’ve got a non-HTML version of a document (like a PDF available for download) you can include the canonical reference in the HTTP header like this:

  • Link: <https://www.example.com/document.html">; rel=”canonical”

What could go wrong?

While the rel="canonical” tag seems simple enough to implement, getting it wrong can have a major impact on your search performance. There are a few common misapplications of canonicalization that you need to be sure to avoid:

Paginated content all pointing to page one: When you add the canonical annotation to paginated content match your page 1 URL to your canonical page 1 URL, page 2 to page 2, etc. We’ll cover this in a bit more detail later.

Canonical URLs that are not 100% exact matches: If your site uses protocol relative links, leaving off http/https will still result in search engines seeing duplicate content at those two addresses. Always make your preferred URLs 100% exact matches.

Pointing to canonical URLs that return a 404 error: Search engines will ignore tags that point to a dead page.

Multiple canonical tags: Search engines only support one rel="canonical” annotation per page. You can end up with multiple when a webmaster copies a page template that already includes rel=”canonical” or a plugin inserts a rel=”canonical” automatically. In cases of multiple canonical tags, Google will simply ignore all of them.

rel="prev”/”next”: Avoid Duplicate Title Tags & Meta Descriptions

What is it?

There are a few reasons you might want to break your content into multiple pages: you’ve got a long article or series of articles, your retail site has a long list of products within a category or you’re hosting a discussion forum with a lot of large comment threads. Paginated content generally won’t cause many problems with duplicate content in the body of a page, but will affect one very important aspect of your on-page SEO: title tags and meta descriptions. You can find any instances of duplicate titles and descriptions in Search Console in the HTML Improvements report under Search Appearance.

Find Duplicate Title Tags in Google Search Console

To tell search engines that you’ve got paginated content use the rel="prev” and rel=”next” annotations. These tags tell Google that your pages make up a connected series, consolidating their index properties (links, authority, etc.) and sending search visitors to page one.

How do I do it?

As with hreflang and rel="canonical”, the rel=”prev”/”next” tags go in the of a page. They work by indicating the preceding and succeeding pages in the series. For www.example.com/page2 the annotations look like this:

  • <link rel="prev” href=”www.example.com/page1”>
  • <link rel="next” href=”www.example.com/page3”>

There’s no need to include the rel=“prev” for the first page in the series or rel=”next” for the last. You should also note that Google will see rel=”prev” and rel=”previous” as the same thing, but it’s best practice to use ”prev”.

If you’ve got a canonical version of your URL series, use the rel="canonical” tag along with pagination like this:

  • <link rel="canonical” href=”www.example.com/page2>
  • <link rel="prev” href=”www.example.com/page1”>
  • <link rel="next” href=”www.example.com/page3”>

What could go wrong?

As with the rel="canonical” annotation, pagination tags are relatively straightforward. However, there are a few wrong turns you can take that will impact how search engines interact with your site.

Broken chains: Paginated links must maintain an unbroken chain from the first page to the last. The rel="prev” link for page 2 must point to page 1 and the rel=”next” link for page 1 must point to page 2. If they don’t, the chain is broken and the search engines won’t be able to find the rest of the series. Pages can only be part of one pagination chain at a time and pages can only have one rel=”prev” and one rel=”next” attribute.

URL Parameters: Pagination attributes can only link URLs with matching parameters. A pagination chain for www.example.com/page2?referrer=facebook of:

  • <link rel="prev” href=”www.example.com/page1”>
  • <link rel="next” href=”www.example.com/page3”>

… is a broken chain due to the missing referral parameters. To solve this, dynamically insert prev and next links based on fetched URLs.

Erroneous rel="canonical” tags: A mistake people often make when indicating a canonical URL for paginated content is to include a rel=”canonical” tag pointing to page one. This is an attempt to tell search engines to display page one in search results, which is unnecessary when using rel=”prev”/”next” annotations. What actually ends up happening is that bots don’t bother looking at the pages, skipping the paginated content and any of their links.

It isn’t technically an error, but pagination gets quite complicated if your site uses a lot of sorted or filtered lists (like e-commerce listings or an online shopping engine). If this is the case, you must create a pagination chain for each filter and/or sort option if you want the filtered results to rank separately.

For example, if you’re a clothing shopping site, you’ve most likely got a paginated list of men’s shoes. Since athletic shoes differ quite a bit by use, you probably want filtered lists of basketball shoes to rank separately from running shoes. In this case, you need to create a paginated chain of pages for each filter option.

Wrapping It All Up

If you have a large, complicated site or regularly use dynamic URL parameters you need to be especially on guard against duplicate content. While Google and other search engines aren’t going to actively penalize you for having a bunch of identical pages, you’ll run the risk of showing a less-than-optimal page to a searcher, being left out of the search rankings altogether.

Be sure to make the most of the on-page SEO tags mentioned above: rel=canonical, hreflang and rel="prev”/”next”, to make sure you’re always pointing search engines to the right URLs. You’ll avoid being included in the dreaded omitted entries and help ensure that users are seeing the most relevant pages in search results.

Have you implemented the rel="canonical”, hreflang or rel=”prev”/”next” tags on your pages? What issues did you have with implementation? What fixes did you find?