This month was the Google patent named Duplicate document detection in a web crawler system. The patent explains how a content filter from the search engine can work with a duplicate content server.
What is duplicate content?
The patent contains a definition of duplicate content:
Duplicate documents are documents that have substantially identical content, and in some embodiment completely identical content but different document addresses.
The patent describes three scenarios where duplicate documents that are encountered by a Web crawler:
1. Two sides, which consists of a combination of regular web page (s) and temporary redirect page (s), is as documents if they share the same page contents, but have different URLs.
2. Thurs temporary redirect pages are just documents if they share the same target URL, but have different source URLs.
3. A common website and a temporary redirect page is to copy documents if the URL to the usual site is the target URL of the temporary redirect page or the content of the regular web-page is the same as the temporary redirect page.
A permanent redirect page is not directly involved in duplicate document detection because crawlers are configured to not download content on the redirect page.
How does Google detect duplicate content?
According to the patent description, Google’s crawler consults the duplicate content server to check if a retrieved page is a copy of another document. The algorithm then decides which version is the main version.
Google may use different methods to detect duplicate content. For example, Google can take content fingerprint and compare them when a new web site is found.
Interestingly, it’s not always the side that has the highest PageRank is selected as the main URL for your content:
How does this affect your website?
If you want to get high ranking, it’s easier to do it with unique content. Try to use as much original content as possible on your web pages.
If your site has unique content, do not you have to worry about potential duplicate content penalties? Optimize content for search engines and make sure that your site has good inbound links. It’s difficult to outrank a site with good optimized content and good inbound links.