Search Engine Patents On Duplicated Content and Re-Ranking Methods
Abstract: Search engines use special filtering techniques for detecting duplicates/near-duplicates and for re-ranking results using, for example, resemblance measures (shingles), local cosine similarities (snippet term vectors), edit distances and similar methods. Although well known in IR circles, these filtering techniques and the snippet optimization processes (SOP) involved are known by few search engine marketers. This presentation focuses on several patents published by Google and Altavista in which these filters are described. Duplicating contents or nearly dublicateds seem like spam method and it will prevent best web site search engine ranking on search engine result pages. We can help you about this critical point of view.
Keywords: search engine patents, duplicated content, re-ranking methods, snippet optimization processes, snippet extraction, resemblance measures, shingles method
Even you have best content for visitors if you dublicate your keywords so many times and you build a keyword massing it is negatively effect your web site search engine ranking and your positions.
This is a summary of my presentation before the Search Engine Strategies Conference & Expo (San Jose, California; August 8 - 11, 2005). If you attended the conference probably you have enjoyed this SES as much as me. Before continuing any further, I would like to thank my dear friend Mike Grehan for inviting me to the private party at the Marriott vice presidential suite and for the private dinner at the luxurious Valencia Hotel on Santana Row.
Having the privilege of meeting and conversing during dinner with some friends of Mike and now of mine was such a wonderful experience. After pouring some wine, I found myself dinning in an exclusive table headed by Mike. Sitting on my left was a senior six sigma statistician while on my right was Dr. Pavel Berkhin, Yahoo! senior director of data mining. We have so good time talking about almost everything, from science and technology to clustering analysis and fractal patterns in databases.
Then during the conference,which is about best web site search engine ranking,I have the opportunity of listening to several top-notch speakers and search engineers, of visiting Google for the GoogleDance2005, and of mingling with so many known faces during the AskJeeves and Yahoo! parties. Life is good, after all.
For those of you that missed this SES or were not able to attend my presentation, I hope this summary helps you to grasp what you have missed. A complimentary pdf version has been sent in advance to those that provided me with their business cards (400+ and counting). Now without further introduction, here is my summary. Enjoy it.
Before I start this presentation I would like to make some observations. First, the term "filters" is used during this presentation in reference to a set of procedures aimed at filtering search results. Filtering is important for the legal web sites for their best ranking on search engine result pages.Second, that patent publication does not default to implementation. Only because a search engine has published a patent that does not mean that the patent has been implemented or will be implemented. However, understanding how these patents work at the basic level would protect search engine optimizers and marketers from unexpected surprises.
The Google 2003 Patent
Few years ago -in September of 2003- the USPTO.gov site published the patent Detecting query-specific duplicate documents , herein referred to as the "2003 patent". In that patent, Google describes a procedure for the identification of duplicates from a set of pre-ranked candidate results. This patent is slightly different from the patent Detecting duplicate and near-duplicate files and published few months later in December of 2003. Identifying duplicated and/or near duplicated files banned from indexes because of preventing other legal web sites search engine ranking.
In a nutshell, the 2003 patent describes the following. First a user submits a query to a search engine and the system returns a set of candidate results . This set is ranked according to a relevance scheme. Then, a query-dependent filter is applied. The user is then presented with a final answer set that does not display duplicates or near duplicates. In a sense, this could be viewed as a re-ranking method, not because relevance scores are recomputed, but because avoiding the displaying of dupes has the net effect of pushing results up.
The 2003 patent fulfills two important requirements; i.e.
Uniqueness: A query-specific filter is used to identify and remove duplicates and near duplicates. This is very important since for users the notion of duplicated information is subjective and depends on what they are searching for. What some users may consider duplicated content may not be so for others searching for a different piece of information contained in similar documents. Since previous art on duplicate detection methods is not based on queries or users' intentions while searching this makes the invention unique.
Usefulness: The detection of duplicates is not accomplished by comparing entire documents but rather by extracting and comparing specific query relevant snippets (QRs) from documents. Thus, high threshold values can be used and large collection of documents can be compared in a cost-efficient manner. This makes the invention useful.
As we will see during this presentation, understanding this snippet extraction technology and the snippet optimization processes (SOP) involved is important for SEOs, SEMs, copywriters and developers.
According to this patent, Google's snippet extraction technology can be summarized as follows:
Each of the candidate results (CRs) undergoes a process known as document linearization. The goal is to represent each CR as a linearized stream of terms (a text stream).
Next a sliding window of a given length is superimposed over this text stream and shifted. The patent makes provision for using a sliding window consisting of about 15 terms or of about 100 characters.
If a term window is used then the idea is to shift this window, one term at a time and counting either term frequencies (occurrence) of the queried terms or number of unique queried terms inside the current window.
The window is shifted until one reaches the end of the text stream. Then all the windows as sorted according to either term frequencies or number of unique queried terms. After sorting, the two windows with the highest counts are used to define a query relevant (QR) snippet. Thus, each snippet should consist of roughly 30 terms or 200 characters.
Once all snippets are generated, these are compared against each other. Rather than comparing a current snippet from the CR set with previous snippets from this set, a snippet is compared against snippets already tested and placed in the final answer set. This is done to avoid a phenomenon known as transitivity, which may work as an unfair penalty for results that might exhibit in-transit similarity.
To actually test for duplicated content, the 2003 patent suggests that one could use different measures, even measures from other patents. For instance, one could use
Cosine similarity measures; e.g., local snippet vectors.
Resemblance measures; i.e., Altavista's "shingles method" (Broder's Method).
Edit distances and other measures.
I would like to discuss at least the first two measures: (a) cosine similarity and (b) resemblance measures.
Cosine Similarity Measures
To use cosine similarity, we create a term space from all QRs. This is done by removing stop words and by creating a local index of terms from the QRs. A term-snippet matrix is required for this purpose. In this term space each term represents a dimension and term weights are defined using local information. The idea is to treat snippets as local vectors "living" in this term space and as follows.
We first represent any two snippets as points in this multidimensional term space. From the coordinates of each point we define a DOT Product and from the displacement of each point from the origin we define a magnitude associated to a vector. With this information the cosine similarity is computed.
What is important about the cosine similarity is that it runs from 0 to 1. As the measure approaches 1 this means that the angle between the two vectors shortens, meaning that the two vectors are very close, meaning that the two snippets are very similar or roughly the same. Thus, we can set a threshold value and if the cosine similarity between any two local vectors is above this threshold value we can make a decision and say "well, these two snippets are near duplicates" or we can conduct additional tests.
When using this method, however, one must be aware that snippets containing the same terms but in different order may produce false positives and, thus, alternate dupe methods should be conducted. An alternate method worth to mention is based on the resemblance between any two text streams.
The Altavista 2001 Patent
The 2003 patent suggests that we could also use resemblance measures to test any two snippets. The resemblance measure is described in two patents from the late 90's, Method for determining the resemblance of documents (3) and Method for determining the resemining the resemblance of documents (4).
Resemblance analysis (shingling) was suspected to be part of one of the infamous Altavista anti-spam filters from the 90's. Back then, resemblance and containment measures were well known by IR folks but -and perhaps luckily- unheard by SEOs and spammers. In those days, the few of us that knew about this patent have a lot of fun playing and testing resemblance-based anti-spam filters and devising some "toy filters".
The resemblance measure consists of the following. First two text streams are selected and compared, again by superimposing a q-gram or q-shingle, where q is the number of terms used to define the shingle or sliding window. So let say we use a 4-term shingle to compare the following two snippets, A and B:
A = a rose is a rose is a rose
B = a rose is a flower which is a rose
Computing Resemblance Measures
To compute the resemblance of the A B pair the shingle is shifted, one term at a time. The idea is to compute the total number of individual shingles, Sa and Sb, that correspond to each text stream and the number of common shingles, Sab. The patent mentions that we could also define Sab as the number of unique common shingles.
This shingles method (known also as Broder's Method) is more complex than this and involves Rabin's fingerprinting, shingle randomization and a minwise algorithm. I will not discuss nor will accept questions on these subjects to simplify the discussion. However, if some in the audience is interested in learning more about these topics, feel free to contacting me and I'll be happy to discuss these subjects in details.
Once the Sa, Sb and Sab quantities are computed, we need to estimate the resemblance measure, r(Sa, Sb). This is expressed in terms of a Jaccard's Coefficient, which is computed by taking the intersection/union ratios from the corresponding Venn Diagram. Since the resemblance measure runs from 0 to 1, we can set a threshold value, for instance, r(Sa, Sb) > 0.80. Thus, in this case if any two snippets share more than 80% of the generated shingles we would assume that the snippets are near duplicates.
I must point out that the chunking methods described in the Google and Altavista patents are sensitive to the scale of observation utilized. This means that the methods can lead to false positive or false negative. For instance, small windows can result in false positive, meaning that unrelated text streams could appear to be similar. On the other hand, large windows can lead to false negative, meaning that small edits can have a large impact.
A good question is if Google is implementing this patent. Well, only Google knows the answer to this question and whether or not they look for the best-optimized snippets in documents. I was curious as to what would be an optimum snippet length so I conducted several tests.
I queried Google using the default FINDALL mode for several common queries such as email marketing, search engines, real estate, fractal patterns and the like. Then I counted the length of the snippets from the top 10 results and computed the average length of these, counting both terms and characters (with and without spaces). Since the 2003 patent suggests the inclusion of titles when counting windows but mentions nothing about counting urls, I did not include any url in the analysis.
Interestingly the average snippet length was a bit below 30 terms (around 26 to 28) and below 200 characters (with spaces). Surprisingly this is in good agreement with the suggested length described in the 2003 patent in which two 15-term or two 100-char windows are recommended.
Some may claim that Google is not using this patent at all or that the experimental 30 and 200 marks are not a proof of implementation. Others may argue that after removing duplicates, Google may be post-refining the snippets in order to improve their readability and flow. Even others may claim that Google is using a modified version of the 2003 patent. The bottom line is that all these claims are pure speculation and only Google knows if they are using or not the 2003 patent. However, if Google is not implementing this patent at all, then why should we care about snippet optimization processes (SOP)? Good question.
Benefits Derived from SOP
Well, it turns out that understanding the technology behind snippet optimization processes might be important to SEOs, SEMs and copywriters. With the proper knowledge members of this industry can understand the how, the what, and the why of snippets displayed by search engines, so that they can design highly optimized snippets that would improve click-throughs in the organic results. This is important since often users are more compelled to click on a result based on how relevant or readable these snippets look like.
Understanding SOP permits the optimizer to conduct some tests just in case he or she suspects that a given search engine is using snippet-based filtering techniques. The optimizer or copywriter not only may be able to avoid unexpected surprises but can actually identify specific portions of text and optimize these according to the local contextual information or surrounding text.
From the development side, understanding SOP enables developers and marketers to design hierarchical clustering interfaces for content categorization. Currently, there are several web properties displaying such interfaces next to the SERPs. For instance, Mooter, Kartoo, IBoogie, SnakeT, Copernic, Vivisimo and other web properties have interfaces that produce categorized information in the form of clustered terms.
These interfaces are indeed based on the extraction of relevant phrases and terms from SERP snippets. Owners of search-driven sites can mimic similar interfaces by implementing SOP rather than by merely extracting terms from meta description or meta keyword tags. In this way, the hierarchical interface is query-sensitive driven rather than meta data driven. This provides users with a more contextual and relevant search experience.
Developers familiar with SOP can provide KWIC-based programs and services. KWIC, short for keywords-in-context, is a legacy idea from the old IR days. Here the goal is to match paid results or services with snippets triggered by a query. One would expect to improve click-throughs when snippets displayed in paid and organic results are well matched.
Last but not least, developers can use SOP to design keyword suggestion tools, meta description generators and other type of text summarization tools. In each case the tools can be used to generate terms that have been extracted from the query-triggered snippets.
Q Can my sites be recognized as duplicates if I host these using the same IP address?
A The answer to this question can be found in a different patent published by Google. Dubbed by some SEOs as "LocalRank", that patent describes an algorithm for re-ranking results. Unfortunately we do not have enough time to discuss LocalRank. However, let me say this. According to the LocalRank patent, sites belonging to the same IP or host and shown in the candidate set of results are not included in the final answer set presented to the user. Just in case, you may want to consider this and use different IPs and hosts for your domains.
Q You mentioned that once a pair of documents is identified as dupes one is not displayed in the final answer set. My question is which of the two documents is not included in the final answer set?
A Good question. According to the 2003 patent, the least relevant one is not included in the final answer set. This is very important since the 2003 patent says nothing about PageRank or the age of the documents displayed in the set of candidate results but not displayed in the final answer set. The determining factor here is whether the documents are considered duplicates or near duplicates.
To illustrate, suppose that for a given query a set of candidate results is ranked as follows:
A > B > C > D > E
Now let say that A ~ B and C ~ D, where the "~" symbol means that these are duplicates or near duplicates. Let also assume that B has a higher PageRank than C and that D is older than E; i.e.
B PageRank > C PageRank
D is older than E
Since B and D are duplicates these will not be included in the final answer set. Thus, after submitting the query the user should see something like this:
A > C > E
I have read in discussion forums many theories about PageRank or the so-called Google "Sandbox Effect". The very same promoters don't even know how to explain the nature of these theories. This "sandbox" is just a preliminary waiting time and scrutinity period in which most (not all) new sites are placed until they are properly evaluated. It is not a dupe filter or a penalty. There is no such thing as placing old, ranked sites in a mythical sandbox. Personally, I would stay away from sites that promote these theories to explain the sudden disappearance of documents from the SERPs.
Q I found all this very fascinating. How do I explain to a client or to my SEOs how to use this information when optimizing? Can you give me some optimization advices?
A Sure. This is something that has worked for me and you may want to give it a try. I hope I don't spill the beans; yet, I'm not going to give away my family jewels. First, I write my intended keywords at the beginning of the title tag and meta description tag. I then include my <h1> and <h2> header tags right after the body tag.
If necessary these are styled with CSS. I define the <h1> identical to the <title> tag and the <h2> identical to the meta description tag. Then I count the total number of terms in the <h1> plus in the <h2>, making sure that the total number of terms is within the 30-term mark.
After optimizing the rest of the document, I submit this to Google and once indexed and ranked I search Google for the intended query. Often, Google displays the <h1> and <h2> as my intended snippet in its entirety; i.e., without breaking it or adding any ellipsis. When writing these, I make sure that the entire snippet is optimized and user's friendly from the readability or flow standpoint. Again, this has worked just fine for me. Not only this, but often the documents rank very high. You may want to try this suggestion or a variant of it.
Of course, for some alternate queries, Google will either include ellipses in the intended snippet or will break it. In this case I go back to the document source code and find the portion of text associated to the sliding window or snippet and edit the text. I then either wait for Google to reindex the document or resubmit the document to Google (not necessary but optional). Then I re-query Google to see if the intended snippet is displayed in its entirety. Two or three cycles of re-submissions/edits are required to have the target snippet displayed as intended. I call this trial-and-error technique snippet targeting. I found this approach quite useful for snippet optimization and for testing search engine retrieval behaviors.
Q Is there any correlation between ranking results and snippets?
A Not necessarily. These snippets are used to identify duplicates, not to score relevancy. In addition, these are generated by applying sliding windows over text streams that represent documents and by sorting these windows. The top two windows are then used to define a snippet. This is done after ranking the documents. Thus, we cannot go by the content of these snippets; e.g., term distances or keyword density values inside the snippets or the like.
Q Do you know if there are other online resources I can read about these subjects?
A I'm not aware of online resources that discuss in layman's terms the material herein presented. I'm planning to put online some material to familiarize SEOs, SEMs and copywriters with SOP.
Detecting query-specific duplicate documents; Google, Inc.; Patent No. 6,615,209 (2003).
Detecting duplicate and near-duplicate files; Google, Inc.; Patent No. 6,658,423 (2003).
Method for determining the resemblance of documents; Digital Equipment Corporation, Inc.; Patent No. 5,909,677 (1999).
Method for determining the resemining the resemblance of documents; Altavista, Inc.; Patent No. 6,230,155 (2001).
Search Engine Strategies 2005 Conference & Expo, San Jose
Dr. E. Garcia
The Google 2003 Patent
Secure Online SHOPPING with you Credit or Debit Card