“Hi Gord, Google updated our newly developed Website’s Pagerank. On the toolbar, we’re now a 5, but we’re not seeing good rankings or Google referals. I can’t figure Google out. What do they want?”
It takes a long time for popular opinion to adjust to a new reality. In the world of search engine optimization, Google’s Pagerank calculation has assumed mythical qualities, but popular understanding lags behind what Google is really doing now. The search engine has changed how it calculates rankings. Whether you’re a developer/writer/SEO/Webmaster or marketer, an understanding of topic-sensitive Pagerank is essential to ensure your Web pages are considered relevant. If you’re looking for a good boost from links from other Web sites, you’d better understand the concept of “link reputation.”
This article describes in simplified terms an information retrieval process that is very complex. However, that complexity isn’t going to stop us from attempting to gain some kind of understanding. No one outside Google really knows the intimate details of each of Google’s algorithm components. There’s a lot of speculation. You don’t need a degree in information science or artificial intelligence to understand what search engines like Google are trying to achieve. They want to study the relationships between documents to distinguish legitimate links from the billions of spam links they crawl every day. What we want to know: how does Google analyze links?
Studying Underlying Meanings in Links
Google uses deep analytical techniques to look into documents and the links between them, as it works to understand the topics they discuss. Yes, Google reads the pages, and counts keywords in title tags, body copy and links. It’s reading between the lines to understand the context of the links it finds in the documents it collects, and analyze them for topical relevance. This is where the term “semantics” comes in: “latent semantics” simply refers to the underlying meaning of content, and this is exactly what Google’s after.
This means that all those paid text links you see people buying today may not have any effect on rankings, simply because the links may ultimately appear in the wrong context on the wrong site: they’re off topic, perhaps, or the advertiser’s site has other issues that prevent those particular links from boosting the target page’s ranking.
Currently, Google is overly dependent on outbound anchor link text for its ranking system. Spammers and SEOs have taken full advantage of this fact, so the search engine keeps digging deeper by assessing documents for “topical relevance”. Yes, Pagerank is still relevant, but only in how it relates to a keyword topic.
Google uses semantic analysis, among other methods, to read deeper into the meaning of links, and is applying Pagerank to keywords themselves. In fact, Google is even applying it, to some degree, to synonyms and related words. This allows the search engine to be more certain that your site is legitimately focused on a given topic, and that the links leading to it are legitimate.
So, although your homepage Pagerank may be seven, for example, it may only rank a two when it comes to a particular keyword phrase. To create and maintain your rankings, you’ll need to understand how Google evaluates links, and assesses Pagerank with respect to individual keywords.
Although most SEOs know about link reputation, few have thought any further than to put keywords into link anchor text, and hope for the best. Rather than worrying about precision work, they just go out and buy a ton of keyword-loaded links. Sometimes this “muscle” approach actually works, but more often the results are disappointing. It seems the search engine isn’t fooled by this glut of links that all seem to suggest that your Web page/site is the top resource on that keyword topic. But how does Google filter out the effect of these links, many of which are from high Pagerank pages on other sites?
This is a pioneering area of SEO and information science. Do a search and you’ll find precious few resources that discuss link reputation analysis (or link citation analysis) or anything about how search engines collect and process link information. Few people know how the information is collected and stored by Googlebot as it spiders, or how it’s processed afterwards. To make matters worse, some research papers use fancy IS language that basically says nothing. Is it a trick?
Google Pagerank: from Simple to Multi-vector
The original, simple Pagerank-dominated algorithm gave equal weight for links between topically unrelated sites. Then, owners of high Pagerank sites began to use their sites’ high Pageranks to boost rankings of other Websites, which made search results less relevant.
To mitigate that problem, Google (and other search engines) went beyond a “popularity” system, and began to study the underlying meaning of links and the context in which each link appeared. They realized, for example, that if a link from a marketing site pointed to another site on marketing, it was probably more relevant that a link from a non-marketing site.
In 2002, Berkley graduate Taher Haveliwala proposed replacing the single Pagerank vector system with a set of Pagerank vectors that would help assign Pagerank relative to keywords to generate query-specific importance scores for pages at query time. It was the first advance of a simple popularity-based search engine.
Analyzing the meaning of a hyperlink has progressed tremendously over the past 5 years. Latent semantic indexing was introduced into the algorithm to look beyond the obvious, to “filter in” relevant link information, so that the most relevant links to valuable web pages could be discovered. This means content is being analyzed more deeply. LSI was introduced to control web site owner’s attempt to misrepresent what a link was saying about the page it pointed at. As LSI becomes more strongly implemented, sites will have to be concerned about using related words, synonyms and stemmed variations in their site’s copy and links.
Yes, Google does analyze links. It’s the core of the business, and affects organic search results as well as AdSense ad content. Every link contains a meaning that goes far beyond what we see on the two pages involved. The most important calculation is the reputation or authority of the site that contains the link. If it is a trusted site, with a lot of links pointing to it that are on the same topic, then it may have a high authority rating. Its outbound links will be considered worthy and, therefore, will be influential in rankings.
Here’s the thing, though. Even if the topic-relevance element is increasing, site content can still be manipulated to increase topic relevancy, and links from other topic-unrelated sites can still play a role in boosting rankings. Thus, the search engine must look beyond the two Web pages and the link that exists between them. It has to consider more than the raw Pagerank that flows between links: it must also draw conclusions about what those links say. Your site should be well-linked from other sites within your topic “community”, in particular, those sites that are considered to be at the center of that community: “authority sites”. Coincidentally, there is a strong emerging trend toward authority site spam, which is something else that Google has to contend with. So, when you establish or buy links on popular Websites, you should study the topical reputation of that site to know whether a link to your site is actually going to improve your rankings. The Pagerank of the site won’t help.
Pagerank: a Digital Currency
If you’re not familiar with the Pagerank ranking system, this analogy should make things clear: Pagerank (PR) is like a currency. It is liquid and can be granted to any site. It’s a simple number derived from a calculation, so it’s easy to store in a database and credit to a particular Web page.
Google applies this point-based system to all the Web pages within its collective index of pages. A link is considered a ‘vote’ for the site to which it leads, sort of like giving that site some money. And those sites that have the most inbound links get the most ‘money’, so to speak. This currency flows into the site’s Pagerank bank account. The site owners could use that ‘money” or Pagerank in any way they choose, for instance, to boost the rankings of another Website they own. You can see how this might not be good for Google’s search results quality.
Now, with topic-sensitive or topic-related Pagerank, we have multiple forms of currency; however, one of these forms of currency is not easily converted into another. A high Pagerank for one keyword phrase does not mean you’ll have a high rank on another, different phrase.
Popularity Systems are Dead! Quantity is no Longer Relevant
In the late nineties, Pagerank was a popularity-based system based solely on the sheer number of links, and which sites were linked to the most. Little consideration was given to what the link from one site or another was truly “saying”. Links do speak: they communicate something about the page to which they lead. But there are billions of misleading, irrelevant links on billions of spam pages created only to divert the search engine. The problem wasn’t just getting rid of worthless links, either. The question of finding the best was the real issue for Google. We’re accustomed to thinking of search engines as focusing on the best content, but Google is now focused on finding the best links.
Topic-sensitive filtering systems attempt to focus on a topical community of Web pages. Because they focus on a topic — looking at the use and presence of that topic on a Website — these systems can filter out the irrelevant pages that pretend to be something they’re not. In this way, search engines can read between the lines. They don’t even have to rely on the keywords used in the pages’ links. They can read the material around the links, assess how the link is written, and consider whether it uses words that are related to the topic itself.
So, when you see a Web page’s Pagerank in the Google toolbar today, it’s an estimate that’s independent of link reputation. In reality, it measures the Pagerank of the pages that are allowed through the filters. This could be the sandbox filter, or other filters that block out what Google refers to as irrelevant links.
You may have seen sites with a PR of seven that have relatively few links pointing to them. In such cases, those links are considered relevant and right on topic, and Google values them highly. Other sites might have a PR of four, yet have thousands of links pointing to them. Why the low PR, then? Those links have obviously been stripped of most or all of their influence through a topic-sensitive filter. If the site that publishes the link has no “reputation” for the keyword in question, little Pagerank is passed to the linked site.
Topic-sensitive filters are an ingenious and highly effective tool for the search engines. Filters are among the most highly resistant methods of controlling link spam. Yet you might be wondering if a filter can be beaten, and if so, whether your competitors are using this knowledge to their advantage. Chances are that your competitors are taking advantage of the loophole, but probably more through accident than design.
Topic Sensitive Pagerank is a system in which each little portion of your site’s Pagerank (PR) is associated with certain keywords. If your site has a high keyword reputation (e.g. for the keyword “debt collection”), a separate PR has to be applied to those keywords and recorded in Google’s database. This approach makes for a lot of computing and storage. Google can only store so much information as it spiders sites, and during its post-spidering processing activities.
How much Link Information is Stored?
We know that PR passes through a thousand links, but how far can topical information be carried along with links? If Google has enough processing power, it could store link reputation information for every link it finds. This means that the link reputation of the first Web page would be transferred, along with the Pagerank, through a thousand Websites to the very last site in the link chain. In this system, the real meaning of a link in its entirety is evaluated and credited to that last site in the chain, along with the associated Pagerank.
In reality, though, Google and other search engines don’t have that computing power to undertake such analyses. Therefore, even though Pagerank is topic-sensitive, only a small amount of information about that page’s theme or topic can be collected and used in the algorithm. That weakness can be exploited by companies that have huge networks of Websites. A big media company may have 150 Websites, and might link them together to maximize their rankings. In this way, Pagerank can be pooled and delivered strategically to a variety of targeted sites.
If you type “PHP Web developers” into Google search, you’ll note that php.net ranks first. Why? The word “developers” only appears 4 times in the home page, so something is obviously afoot. If you check the backlinks, you can see that 331,000 links point to php.net. This might lead you to think that it is Pagerank alone that determines the site’s ranking. Take a look at Zend.com, an important site that links to php.net, and you’ll note that this site is about all about PHP development. The word “developer” appears on 60,000 pages in the Zend site, the word “PHP” appears on 884,000 pages, and the word “Web” appears on 77,000 pages in the site. The word “company” is semantically associated with “developers” and the word “company” appears in 216,000 pages on zend.com. So the links from Zend.com provide a good boost to php.net’s rankings on the phrase “PHP Web developers”.
You can see how the keyword reputation of a site can give it the power to boost another site’s ranking on a similar topic. Zend.com’s backlinks show links from powerful “authority type” sites such as www.infoworld.com, www.phpindex.com, and www.phpindex.com. Check out the links on those sites that point to Zend and you’ll see frequent use of the keywords “PHP” and/or “Web” and/or “developers”, along with words that are semantically related to those words.
At some point, the Pagerank you see in the Google toolbar will be almost irrelevant, as the actual determinants of rank will be hidden within a massive mathematical equation that includes topic-sensitive Pagerank.
Collecting Link Reputation
Link reputation is collected and stored as the Googlebot crawls the Web. Since the amount of accumulated link reputation data is so vast, Google needs a way to synthesize it all. It does so by passing a simplified version of the keyword phrase, and that “link reputation data bit” is passed on through perhaps hundreds of sites consecutively, along with thousands of other link reputation data bits. That’s why getting a link from a site with a high reputation for a one- or two-keyword phrase (e.g. computers) is better than a link from a site with a high reputation for a three- or four-keyword phrase (e.g., cheap laptop computers). The one- or two-keyword link reputation value is more prevalent on the Web, and more of that two-keyword specific value will be passed onto your site.
Conclusion
Topic-sensitive Pagerank calculation is very complicated, involving link anchor text and the keyword content of the linking page itself. The system uses keywords as well as synonyms and related words. It may also involve a Trustrank factor that Google assigns sites. Although the Google toolbar Pagerank display is given credit for high rankings, the truth is that Google analyzes the links that lead to a particular page. When they’re on-topic, these links pass high levels of topic-sensitive Pagerank to the target page. The Web page will then top the search results for that particular phrase.
Further Reading
A PDF report, by Taher H. Haveliwala, formerly of Stanford University university, now employed by Google. The first page in the paper states:
In the original PageRank algorithm for improving the ranking of search-query results, a single PageRank vector is computed, using the link structure of the Web, to capture the relative importance” of Web pages, independent of any particular search query.
[Indicates that the rankings were not based on keywords, but rather the popularity or quantity of links pointing to a particular page.]
In Google’s technology overview, the third paragraph notes:
Google uses PageRank to examine the entire link structure of the web and determine which pages are most important. It then conducts hypertext-matching analysis to determine which pages are relevant to the specific search being conducted. By combining overall importance and query-specific relevance, Google is able to put the most relevant and reliable results first.
Google’s adsense information refers to filtering content, copy and links to determine the appropriate ads to display on an advertiser’s site.
SEOBook.com suggests Google is using LSI, although Google has never stated it does. This Google query for “zoo” shows that the search engine has LSI features that it can draw upon to better qualify searches (zoo is related to wildlife, animals or aquarium using this query feature).
This PDF from CiteSeer describes the algorithm component known as “expert documents.” This is thought to have been integrated into Google’s algorithm mix.
Google’s discussion of Pagerank for laypersons.
A report on Topic Sensitive PageRank, which mentiones that Google had hired Taher H. Haveliwala, formerly of Stanford University, who had written a paper on the topic (here in PDF).
An article on link reputation, by Michael Marshall.
Gord Collins has been an SEO Specialist since 1998 and has written two books on the topic of search engine optimization. His firm, Bay Street SEO provides consulting and Web development services for companies across North America.