John Mueller, a Google employee, recently responded to a query about whether or not the search engine utilizes a minimum percentage of duplicate content to determine which results to filter out.
Actually, it all started on Facebook when Duane Forrester (@DuaneForrester) posed the question of whether or not any search engine had published a threshold percentage of content overlap beyond which content is deemed duplicate.
Bill Hartzer (bhartzer) asked John Mueller a question on Twitter, and Mueller replied very quickly.
“Hey @johnmu is there a percentage that represents duplicate content?
For example, should we be trying to make sure pages are at least 72.6 percent unique than other pages on our site?
Does Google even measure it?”
At which point, Google’s John Mueller responded:
There is no number (also how do you measure it anyway?)
How Google Determines Duplicate Content
For many years, Google’s strategy for identifying and eliminating duplicate content has remained relatively consistent.
Starting off the video, he said that it is common for a lot of duplicate content to exist on the Internet.
“It’s important to realize that if you look at content on the web, something like 25% or 30% of all the web’s content is duplicate content.
…People will quote a paragraph of a blog and then link to the blog, that sort of thing.”
He continued by saying that Google won’t punish content with duplication because many copies are created accidentally or without malicious intent.
According to him, if websites were penalized for having some duplicate content, it would have a negative impact on the relevance and quality of search engine results.
Matt Cutts also went on to say that
“…[we] try to group it all together and treat it as if it’s just one piece of content. It’s just treated as something that we need to cluster appropriately. And we need to make sure that it ranks correctly.”
He went on to say that Google uses this information to improve the user experience by prioritizing relevant results and eliminating duplication.
What’s Google’s Procedure On Tackling Duplicate Content?
In the year 2020, Google released an edition of its Search Off the Record podcast in which the same topic was presented in almost identical terminology. At around minute six, this exchange happened:
“Gary Illyes: And now we ended up with the next step, which is actually canonicalization and dupe detection.
Martin Splitt: Isn’t that the same, dupe detection and canonicalization, kind of?
Gary Illyes: [00:06:56] Well, it’s not, right? Because first you have to detect the dupes, basically cluster them together, saying that all of these pages are dupes of each other,
and then you have to basically find a leader page for all of them.
…And that is canonicalization.
So, you have the duplication, which is the whole term, but within that you have cluster building, like dupe cluster building, and canonicalization. “
Gary then goes into technical detail about how they accomplish this. Google isn’t necessarily concerned with percentages but instead compares checksums.
Simply put, a checksum is a representation of content in the form of a string of numbers or letters. Because of this, if the content is duplicated, the checksum sequence will also be similar.
“Gary Illyes: So, for dupe detection what we do is, well, we try to detect dupes.
And how we do that is perhaps how most people at other search engines do it, which is, basically, reducing the content into a hash or checksum and then comparing the checksums.”
It’s, therefore, unlikely that a fixed percentage could be used to define when content is considered to be duplicated.
Instead, checksums are used to describe the content and then compared to identify instances of duplicate data.
An additional nugget of information is that it appears that there is a difference between partial and complete content duplication.