LiveZilla Live Help

Deduping Duplicate Content

duplicate content

One interesting thing that came out of SES San Jose’s Duplicate Content and Multiple Site Issues session is the massive amount of duplicate content on the Internet. They shined a light as to just how much there is. It was astounding and eye-opening. Ivan Davtchev, Yahoo’s lead product manager for search relevance, said “more than 30 percent of the Web is made up of duplicate content.”

With that in mind, it is no wonder trust and linkage play such valuable parts in determining a domain’s overall authority and consequent relevancy in the search engines. Wow. Three out of every ten things on the web is a dupe.  Boggles your mind.

There are three basic types of duplicate content on the web:

  1. Accidental content duplication: This occurs when Webmasters unintentionally allow content to be replicated by non-canonicalization (define), session IDs, soft 404s (define), and the like.
  2. Dodgy content duplication: This primarily consists of replicating content across multiple domains.
  3. Abusive content duplication: This includes scraper spammers, weaving or stitching (mixed and matched content to create “new” content), and bulk content replication.

Greg Grothaus from Google’s search quality team recently addressed the “penalty myth” regarding duplicate content, noting that Google “tries hard to index and show pages with distinct information.”

Everyone knows Google uses a checksum-like method for initially filtering out replicated content. For example, most Web sites publish a regular and print version of every article. Google prefers to serve up only one copy of the content, which is determined via linking prowess. Because most print-ready pages are dead-end URLs sans site navigation, it’s relatively simply to equate which page Google prefers to serve up in its search results.

Once in a while the powers-that-be at Google notice a particularly egregious attempt at gaming the system. Manipulating rankings, deceptions of users, stuff like that. In these cases, Google will “make appropriate adjustments” to the indexation and rankings of the sites involved, according to Grothaus. So be careful of that.

Test and Tune

How do you know if duplicate content is popping up all over your site? There are a few ways to be sure:

* If you have multiple URLs for you home page, you have duplicate content.
* If you go to any page on your site and remove the “www” from the URL and the same content is still served up, you may have duplicate content.
* If you create an error by appending gibberish to a URL string or remove a directory path and still serve up the same content without cuing a 404 error page, you probably have duplicate content.
* If you can isolate a URL construct from your print pages and run an advanced indexation check, such as “site:example.com inurl:/print/” in Google, then you definitely have duplicate content indexed.

Everyone knows duplicate content is usually an accident and not part of some nefarious scheme. To inform the browsers and get things squared away is relatively easy.  Simply read and employ all the best practices delineated on the search engines Webmaster blogs and forums. Here are the big three:

You also have to make certain that you properly canonicalize your site, 301 (permanent) redirect to any duplicate pages. Use use robots.txt and robot tags to eliminate the rest of them.

If this sounds simple, it’s because it is. With a few easy steps and minimal time investment, you can have your sites duplicate -free in no time. Remember, test and tune your results often. If you have any problems beyond that, it may be time to call in the professionals.



Related Posts

Leave a Reply