Search Engine Indexing Limits: Where Do the Bots Stop?

The SEO community boasts a multitude of different opinions as to the volume of text indexed by the search engines on a single Web page. The question is, how large should the optimized page be? At what point is the balance between a page so short that SEs disregard it as "non-informative", and one that’s so long that it leaves potentially important content beyond the spiders’ attention?

As far as I know, no one has yet tried to answer this question through their own experimentation. The participants of SEO forums typically confine themselves to quoting guidelines published by the engines themselves. Today, the belief that the leading search engines limit the volume of indexed text by the notorious "100 KB" limit is still is still widely held within the SEO community, leaving SEOs’ customers scratching their heads as they try to figure out what to do with the text that extends beyond this limit.

Running the Experiment

When I decided to set up an experiment to answer this question practically, my goals were:

  • determine the volume of Web page text actually indexed and cached by the search engines
  • find out if the volume of text indexed depends on the overall size of the HTML page

Here’s how this experiment was actually conducted. I took 25 pages of different sizes (from 45 KB to 4151 KB) and inserted unique, non-existent keywords into each page at 10 KB intervals (that is, a unique keyword was included after each 10 KB of text). These keywords were auto-generated exclusively for this experiment and served as "indexation depth marks". The pages were then published, and I went to make myself some coffee because waiting for the robots to come promised to be a slow process! Finally I saw the bots of the Big Three (Google, Yahoo!, and MSN) in my server logs. The site access logs provided me with the information I needed to proceed with the experiment and finish it successfully.

It’s appropriate to note that I used special, experimental pages for this test. These pages reside on a domain that I have reserved for such experiments, and contain only text with keywords that I needed for the experiment. Such pages — with senseless text stuffed with abracadabra words every now and then — would certainly cause eyebrows to raise, if a human happened to see them. But human visitors were definitely not the expected audience here.

After I reviewed the log files and made sure the bots had dropped in, the only thing left was to check the rankings of each experimental page for each unique keyword I’d used. (I used Web CEO Ranking Checker for this). As you’ve probably guessed, if the search engines index only a certain part of the page, they will return this page in search results for the search terms that are above the scanning limit, but will fail to return the page in results provided for the keywords that appeared below the limit.

Test Results

This chart shows where the Big Three stopped returning my test pages.

1525_performance

Now that I had the data about the amount of page text downloaded by the SE bots, I could determine the length of page text indexed by the search engines. Believe me, the results are unexpected — to say the least! But this makes it even more pleasant to share them with everyone interested in the burning questions of search engine optimization.

As you can see from the table below, the bronze medal is awarded to Yahoo! with the result of 210 KB. Any page content above this limit won’t be indexed.

1525_yahoo

The second place belongs to the Great (by the quality of search) and Dreadful (by its attitude to SEO) Google. Their Googlebot is able to carry away to its innumerable servers more than 600 KB of information. At the same time, Google’s SERPs (search engine result pages) only list pages on which the searched keywords were located not further than 520 KB from the start of the page. This is the exact page size that, in Google’s opinion, is the most informative and provides maximum useful information to visitors without making them dive into overly lengthy text.

This chart shows how much text has been scraped by Google on the test pages.

1525_google

The absolute champion of indexing depth is MSN. Its MSNBot is capable of downloading up to 1.1MB of text from one page. Most importantly, it is able to index all this text and show it in the results pages. If the page size is greater than 1.1MB, the content that appears on the page after this limit is left unindexed.

Here’s how MSN copes with large volumes of text.

1525_msn

MSN showed a remarkable behavior during its first visit to the experimental pages. If a page was smaller than 170KB, it was well-represented in the SERPs. Any pages above this threshold were not presented in the SERPs for my queries, although the robot had downloaded the full 1.1MB of text. It seems that if a page was above 170KB, it barely had a chance to appear in MSN’s results. However, over a period of 4-5 weeks, the larger pages I’d created started to appear in MSN’s index, revealing the engine’s capacity to index large amounts of text over time. This research makes me think that MSN’s indexing speed depends on the page size. Hence, if you want part of your site’s information to be seen by MSN’s audience a.s.a.p., place it on a page that’s smaller than 170 KB.

This summary chart shows how much information the search engines download, and how much is then stored in their indexes.

1525_ranking

Thus, this experiment established the fact that the leading search engines differ considerably in terms of the the amount of page text they’re able to crawl. For Yahoo!, the limit is 210KB; for Google, 520KB; and for MSN, it’s 1030KB. Pages smaller than these sizes are indexed fully, while any text that extends beyond those limits will not be indexed.

Exceeding the Limits

Is it bad to have text that exceeds the indexing limit?

Definitely not! Having more text than the search engine is able to index will not harm your rankings. What you should be aware of is that such text doesn’t necessarily help your search engine rankings. If the content is needed by your visitors, and provides them with essential information, don’t hesitate to leave it on the page. However, there’s a widespread opinion that the search engines pay more attention to the words situated at the beginning and end of a Web page. In other words, if you have the phrase "tennis ball" in the first and last paragraphs of your copy, it makes your page rank higher for "tennis ball" than if you typed it twice in the middle of the page text.

If you intend to take advantage of this recommendation, but your page is above the indexation limits, the important point to remember is that the "last paragraph" is not where you stopped typing, but where the SE bot stopped reading.

Win an Annual Membership to Learnable,

SitePoint's Learning Platform

  • http://twitter.com/the_mikepayne Mike Payne

    Interesting test. from what I’ve seen Google will index the first 101K of the page. Unfortunate for footer links i suppose.