Google PageRank – Democracy or Corporate Muscle?

Google has become the world’s favorite search engine, and on average it probably brings Websites over 50% of their new visitors (when you take into account visitors from Yahoo Web page searches that are also provided by Google). For many Websites, mine included, Google brings nearer to 90% of all new traffic.

Recently, Google PageRank has attracted some controversy. Now that the dust has settled a little, this article attempts to take a more rational look at PageRank and its strengths and weaknesses, and to consider where Google could go from here.

What Is PageRank?

Google make big claims for PageRank. They explain the concept of PageRank as follows:

PageRank relies on the uniquely democratic nature of the Web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search...

The original algorithm for calculating PageRank was published by the founders of Google, Sergey Brin and Lawrence Page, in the paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine". Although Google may well have refined the algorithm since then, we know from this paper that the PageRank of a Web page is a number calculated using a recursive algorithm in which the page receives a share of the PageRank of each page that links to it. The share that page A receives from page B depends on the number of outgoing links on Page B (as the number of links increases, the value of each link decreases).

In other words, PageRank is a mathematical calculation that takes into account only the number of pages and the number of links on those pages in the whole Web of hyperlinks that lead to the page in question. Content is not taken into account when PageRank is calculated. Content is taken into account when you actually perform a search for specific search terms.

Who Benefits?

So how do Google make the leap from this relatively simple concept, to claiming that "Important, high-quality sites receive a higher PageRank"? Well, as they say, they interpret a link from page A to page B as an indication of the importance and quality of page B. But of course, there are many other reasons why page A might link to page B:

• The owner of page A wants to promote page B because it is part of his own Website
• The owner of page A wants to promote page B because it is another Website that he owns
• The owner of page B pays for an ad on page A
• The owners exchange reciprocal links specifically to boost PageRank
• The owner of page A is an affiliate of page B and receives commission on sales
• Page A is a news story (good or bad) about page B’s Website

In most of these cases, the importance or quality of page B has little to do with its link being placed on page A. Worse still, in many cases it is simply commercial interest that drives the number of links to page B.

The result is that PageRank favors business, and particularly big business. A business that sells a product or service on its Website will naturally receive PageRank because of affiliate links, advertising and resources devoted to Web promotion. A Website that offers information or free services will find it much more difficult to attract incoming links, and therefore, to achieve a good PageRank. It does seem that corporate muscle is useful when it comes to winning PageRank.

But That’s Not All…

When you actually perform a search on Google, PageRank is only one of the factors that are taken into account in deciding which results are prese, and in what order. Google’s own explanation continues as follows:

Of course, important pages mean nothing to you if they don’t match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page’s content (and the content of the pages linking to it) to determine if it’s a good match for your query.

What this means is that it’s a combination of content and PageRank that determines the sequence or the ranking of the search results that Google returns. The ranking of search results is very important, as most users won’t bother to look beyond the first 20 results or so. It’s important to the user, because if the search engine doesn’t return the mostly relevant results in the first 20, the user gives up on the search — and loses faith in that search engine. It’s obviously important to a Website to be listed in the first 20 results for relevant search terms, otherwise that Website will receive very little traffic from the search engines.

For most searches, Google’s ranking algorithm works very well for the user, and indeed I personally use Google for almost all my searches. Google usually returns relevant results, and often returns what I would consider to be the most important Website first. It is the PageRank factor that ensures that when you search for "Amazon", it’s Amazon.com’s home page that is returned first (although I’m not sure why anyone would need a search engine to find Amazon!).

Unexpected Effects

However, PageRank is such an important factor in the ranking of search results that it can have some very significant effects.

Restricted Competition

One effect is illustrated by a search on Google for the words apparel store. Within two months (and maybe sooner) of Amazon opening its apparel store, this search returned Amazon’s apparel store first in the results. The reason that Amazon’s page is ranked first is of course that as well as being relevant, it has gained massive PageRank from being on Amazon’s Website. The Amazon apparel store may well be an important, high-quality site, but that is not the reason it has acquired its PageRank. It has gained its PageRank by being part of the huge Amazon.com site for books, CDs, etc., and all those affiliate links to the books section in particular.

Does this matter to the user? In the short term, probably not. The user has received a relevant set of search results, and may even be pleased to have found Amazon’s apparel store. In the longer term, however, it may matter more. Small companies and small Websites find it hard to gain PageRank, and therefore, top rankings in the search results, no matter how relevant their sites may be. This represents a barrier to new entrants in the market, which in the longer term restricts competition and damages consumer choice. With Google’s increasingly dominant position, that side-effect is something to be concerned about.

Decreased Relevancy

In some cases, the effect of PageRank does actually damage the relevancy of results. If you search for free Web page in Google’s standard search, the top ranking result is for digits.com, offering free page counters, and only 5 out of the first 10 results offer free Web pages. The others offer free search engine submission, free translation, and free font downloads (from Microsoft.com). In the case of digits.com, it is the back links that are required on all sites using the digits.com page counter that has given their site a huge PageRank, bringing it to the top of the results.

The search results for this search can be improved if you use an exact phrase search on Google’s advanced search page or put the search phrase in quotes, but I suspect only a tiny proportion of users ever use these options. In an exact phrase search for "free Web page", digits.com drops to number 4 in the results. However, this number 4 ranking is still a little surprising, given that the phrase "free Web page" does not appear on the page at all, and appears only in links pointing to that page. This illustrates the importance of link text in search engine optimization for Google.

Suppose you are looking for office space in New York. You might search for New York Office. In this case the top ranking page, whether you use an exact phrase search or not, is the page "New York Governor George E. Pataki", which again does not contain the exact search phrase on the page, but only in links pointing to the page. The page does however have a Google Toolbar PageRank of 9 to account for its position. In fact if you use an exact phrase search for "New York Office", I don’t think any of the top 20 pages contain the exact search phrase other than in links pointing to them!

Of course if you try hard enough, you can get all sorts of odd search results! How about a search on Google for Biggest Garden On Earth! Guess what Google returns first? Yes — the Amazon.com home page! Why? Because the page title is "Amazon.com–Earth’s Biggest Selection" containing two of the keywords, another keyword "Garden" is on the page, and of course it has massive PageRank.

What this shows is that if you have a Website with Google ToolBar PageRank of 9 or 10, like Amazon, Microsoft, Adobe, etc. then you are virtually guaranteed a Google top ranking for the keywords of your choice on a new Web page, if you put those keywords in the title of your page and link to it from the rest of your site. The content of the page would not matter at all. It does make you wonder why Amazon bothers to use Google Adwords so much!

These examples are of course the exceptions. As I’ve said, Google is the search engine many people prefer to use, and for the vast majority of searches, Google returns relevant search results that satisfy the user. And this is exactly what Google needs to do to continue to win market share.

Fortunately for most Webmasters there are plenty of search terms where you are not competing with the likes of Amazon and Microsoft. With some careful attention to page titles, page content and link text, it is possible to achieve reasonable rankings within the search results. For instance, one of my Websites has the top ranking for the two best relevant traffic-generating search phrases, and another has a number 3 ranking for my preferred search phrase. Yes, you do need to make sure you obtain links from pages with reasonable PageRank, but it is not usually necessary to go to the extremes of search engine optimization. In fact you need to be careful not to go beyond the bounds of what Google consider to be ethical search engine optimization techniques, otherwise you will receive the dreaded PageRank Zero penalty!

However, with Google’s increasingly dominant position, the search giant will come in for more and more criticism if its search results are seen to work in favor of big business and against free market competition.

Google are of course working hard all the time to improve their algorithms and it will be interesting to see whether these sorts of concerns are taken on board and addressed.

In the short term, Google may need to consider these points:

• Increase the weighting of proximity of keywords, which would increase the rankings of exact phrase matches, even if an exact phrase search was not specified.
• Increase the weighting of keywords in visible text on the page in order to reduce the number of times pages are included in the results only with keywords in links pointing to the page.
• Consider capping the weighting of PageRank at some value so that pages with a very high PageRank are less overwhelming. Alternatively, vary the scale so that as you move up the PageRank ladder, the increase in weighting does not increase proportionally. The scale is probably already logarithmic, but it doesn’t seem to have the desired effect.
• Default searches to look for both the singular and plural forms of search words. This is a controversial suggestion, as some searches work better this way, while for others i’s a detrimental step. However, I believe more searches will be more successful if this approach is taken. It could perhaps be introduced as a selectable option in the advanced search.
• Reduce the weighting of keywords in the page title. This is one part of the Web page that users hardly look at, and is therefore easy to manipulate. This suggestion will therefore be unpopular with Webmasters!

Now, for the big one!

What is really needed is content- or topic-sensitive PageRank. In other words, PageRank should be calculated for each search term used, so that PageRank is only accumulated from links from relevant pages all the way back through the whole Web of links. The problem is that the content factors of the search ranking algorithm are only evaluated at the time of the search, and to calculate PageRank at search time would be impossibly slow, especially as it is a recursive algorithm.

However there have been research papers published on proposals for calculating content-sensitive or topic-sensitive PageRank at crawl-time. One such paper is "Topic-Sensitive PageRank" by Taher H. Haveliwala (be prepared for some mathematics if you read that paper!). Haveliwala proposes that for each Web page, a separate PageRank is calculated for each relevant topic represented by the categories of the Open Directory Project. By limiting the number of topics to Open Directory categories, and as most Web pages will not have content relevant to many topics within this engine, the amount of computing power required is not impossible.

Another paper is "The Intelligent Surfer: Probalistic Combination of Link and Content Information in PageRank" (pdf) by Matthew Richardson and Pedro Domingos, who propose pre-calculating separate PageRanks for all search terms. Their experiments suggest that even for millions of search words, the computing power and storage is (only!) between 100 and 200 times that needed for calculation of a single PageRank.

The problem for Google is that the last thing they want to do is to increase the time it current time it takes for the crawl and the update. At the moment their efforts are spent finding ways to update the index more frequently so that their search results reflect what’s in the Web today, and not what was there last month.

Google is going to find it hard to balance all the demands and pressures it faces, but I’m pretty sure they are better equipped to succeed than most. Time will tell…