Google Seeing Duplicate Titles/Pages Due to Links with Capital Letters (wtf?)

I was going through Google Webmasters and stumbled across the fact that I have tons of duplicate title tags, as shown in these links:


/The-Nightmare-Before-Christmas/blu-ray-3d-bonus-the-process
/the-nightmare-before-christmas/blu-ray-3d-bonus-the-process

/Nights-in-Rodanthe/letters-like-this
/nights-in-rodanthe/letters-like-this

/The-Twilight-Saga-Breaking-Dawn-Part-2/trailer
/the-twilight-saga-breaking-dawn-part-2/trailer

I’m running on nginx and doing rewrites, and have absolutely NO idea why these links are coming about and how Google is picking them up. I’m drawing a blank.

Any feedback you can suggest would be MUCH appreciated. I’m at a loss.

Primary domain: www.traileraddict.com

I should point out that it’s always the first keyphrase representing the film (which is being drawn from a lowercase string).

Cheers!
Ryan

I just tried the last two URLs and both led to slightly different pages with virtually identical content.

Do you have a CMS site? How are these titles and URLs generated?

One way to eliminate the duplicates is to select the preferred title page and create a 301 redirects from duplicate page to the preferred page.

A temporary method is to copy all the duplicate URLs to the robots.txt file and mark them as disallowed.

I had similar problems and it took Google a very long time before the following status now appears:

HTML ImprovementsLast updated Feb 27, 2014

We didn’t detect any content issues with your site. As we crawl your site, we check it to detect any potential issues with content on your pages, including duplicate, missing, or problematic title tags or meta descriptions. These issues won’t prevent your site from appearing in Google search results, but paying attention to them can provide Google with more information and even help drive traffic to your site. For example, title and meta description text can appear in search results, and useful, descriptive text is more likely to be clicked on by users. More Information

I’ve gone through all my code and I can’t find a single page that would do a ucfirst to the keyphrase of the film. But it’s always the film keyphrase in every link. I can’t tell if google made it up as a test or what. So my best option is to now have php check to see if capitalized and 301 redirect if it is? I really wish I could find the source of whatever is creating the capitalized links.

Also, how did you find the pages to be slightly different? They should be exact.

[FONT=monospace]

// “[B]http://www.traileraddict.com/The-Twilight-Saga-Breaking-Dawn-Part-2/trailer[/B]
[/FONT]<meta property=“og:url” content=“http://www.traileraddict.com/The-Twilight-Saga-Breaking-Dawn-Part-2/trailer”/>
// "
http://www.traileraddict.com/the-twilight-saga-breaking-dawn-part-2/trailer[COLOR=#000000][FONT=monospace]"[/FONT][/COLOR]<meta property=“og:url” content=“http://www.traileraddict.com/the-twilight-saga-breaking-dawn-part-2/trailer”/>[FONT=monospace]

[/FONT]
Post #2
[COLOR=#333333]>>> Do you have a CMS site? How are these URLs generated?
Do you store the titles and data in a table?

If you use a database table have you set the character set to case-insensitive?

How is your page rendered?[/COLOR]

The server is nginx and we do rewrites. The film keyword, /Nights-in-Rodanthe/letters-like-this (nights in rodanthe), in in the database as “nights in rodanthe” for selecting the film, where we have the php strip the slashes and make spaces.

However, in none of our coding do we have anything telling the URL to become capitalized. I’ve done a sitewide coding search with Macromedia and never is the variable from the database told to be uppercased every word. We just re-launched the site and this is a new issue. Can’t figure it out, as it’s very odd.
Seems like the only fix is to tell nginx to run 301 redirect and just redo the url without capitals, but wish I could find the source of the issue.

Cheers!
Ryan

I think this could be the cause of the problem:

http://stackoverflow.com/questions/7830846/ogurl-is-driving-me-crazy

The proper case links are Facebook cached and from the meta og:url property.

og:url basically tells the FB scraper “ignore anything on this page, and scrape this url instead” So it’s doing exactly what it’s supposed to do. If you want the like button to point to a different url, use the hrefparameter and have it point to a different url.

[COLOR=#333333][FONT=monospace]// "http://www.traileraddict.com/The-Twi…Part-2/trailer"

[/FONT][/COLOR]
[B]<meta property=“og:url” content=“http://www.traileraddict.com/The-Twilight-Saga-Breaking-Dawn-Part-2/trailer”/>
// "
http://www.traileraddict.com/the-twilight-saga-breaking-dawn-part-2/trailer[COLOR=#000000][FONT=monospace]"[/FONT]
<meta property=“og:url” content=“http://www.traileraddict.com/the-twilight-saga-breaking-dawn-part-2/trailer”/>[/B]

[/COLOR]

That issue is only created when the page is loaded in the capital letters, which are not created by the site/server itself.

<meta property=“og:url” content=“http://www.traileraddict.com/the-twilight-saga-breaking-dawn-part-2/trailer”/>

That’s what og:url looks like under the proper, lowercase, link. Without the first bug, this wouldn’t be an issue.

I’ve discovered something else weird. I’ve been having an email sent to me everytime the URL has a capital letter in it. First off, not a single one has a referrer, and, second, 5 out of 6 times it is Google Bot requesting the URL or Pagespeed. It’s like Google itself created this issue.

I’m just telling php to drop the variable to lowercase and put into og:url and canonical to prevent the duplicate page for google. But still weird.

Cheers!
Ryan

So now besides the traffic coming from this post, and about a dozen errors on our film keyphrase column (where we accidentally did have a capital letter or two), all requests to capital letter URLs are only coming from Google Bot. I can’t figure out why Google Bot decided to make the URL capital letters.

Try looking in your raw site logs for a capitalised URL and it may give the referrer:



173.32.237.89 - - [02/Mar/2014:06:43:51 -0600] "GET /joke/of_the_day/Quotes_I_like/1039 HTTP/1.1" 301 223 "http://www.stumbleupon.com/refer.php?url=http%3A%2F%2Fjohns-jokes.com%2Fjoke%2Fof_the_day%2FQuotes_I_like%2F1039" "Mozilla/5.0 (iPad; CPU OS 7_0_4 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Mobile/11B554a"

Checking now. PHP reporting claims only google was seeing/creating these capitalized URLs. So odd. Forcing to lowercase in the canonical has slowed down the alerts a ton, though Google Bot still checks in a couple times a minute with a capitalized URL. Should have a log update soon.

Yep, the only referrer is Google Bot and a couple people sent from Google to capitalized address. So weird.

cb,

I’ve been following this thread for a few days so I apologize for being a day late - I just hope I’m not also a dollar short!

Apache’s mod_rewrite has a RewriteMap function which contains a tolower function. I would use tolower to compare the filename with the requested filename and redirect to the tolower version if they don’t match. Of course, this is predicated on you using ONLY lowercase letters in your URIs.

If you don’t have access to the server or vhosts configuration files (if not, you can’t use a RewriteMap), then you can change individual CAP letters to low letters via a series of RewriteRules (abusive of the server but it would resolve your issue). Alternatively, create a 404 script to look at the request’s URI, convert using strtolower() and use the header(location) function to redirect from within your 404 script. Be sure to 301 your redirections (to avoid duplication penalties from SE’s).

Regards,

DK

Thanks for reaching out DK. I was looking into the immediate 301 redirect/rewrite and hit sort of a wall with NGINX. Have to install a perl module to make it happen, so my next option was to handle the problem through PHP index file, and have it check for uppercase long before it does anything else. However, I now make sure my canonicals are reporting the page back in lowercase, which a publisher buddy of mine said should do the trick.

it was the easiest solution, though I’m still contemplating the PHP 301 redirect option.

Cheers!
Ryan