Brion Vibber on Wikipedia and Mediawiki

Looking at the top 20 of alexa’s global 500 popular sites, one thing that stands out is the majority are primarily “read only” sites – news, search or otherwise where updates to content are primarily managed by those running the site.

The big three exceptions here though are myspaces (running .NET now I believe – was Coldfusion), ebay (have they migrated fully to J2EE yet or is some of home-grown C++ still around?) and wikipedia (LAMP). All of these are, in some way, collaborative sites where content is created primarily by users. In other words, they have to be able to support a significant volume of writes as well as reads. That’s interesting because, in terms of scaling, the more volatile the data you’re providing, the harder it gets to scale – it raises questions like “how do you cache?”, “how do you handle transactions / locking?”, “how to you distribute updates” etc.

Anyway that wikipedia runs LAMP makes it somewhat of a poster-child and, as you may know, the software used on wikipedia is mediawiki, written in PHP. Given the scale of the technical problem the wikimedia foundation has had to solve, what’s been a little frustrating in the past finding detail from those involved on how they do it. Thanks to Brion Vibber we now have more information…

First up is his talk to Google, delever at the end of last month. Some fascinating details and trivia in there (e.g. they’re currently averaging about 1 update / sec) and, considering they “only” have about 100 application servers (running the mediawiki code), the overall impression is almost “is that all it takes? How small the Internet is” – Brion plays down the effort that has gone into making it possible with remarks like “It takes a little work”. He also mentions some of the issues they’re having with their wiki syntax parser, which has similar issues to those we’ve seen before elsewhere – they seem to be attempting to replace it with a C-based parser exposed as a PHP extension but given the date of last change, is that an effort which has stalled? Also, wryly noted, was the number of questions related to how wikimedia is financed – given it was a technical talk and the location, makes you go “Hmmmm…”.

Following that, more detail (with stronger PHP slant) comes from php architects webcast Interview with Brion VibberMarcus does a great job of asking pertinent questions – perhaps the biggest item was that wikipedia servers are already running PHP 5, even if the code isn’t yet taking advantage of the fact. Side note: imagine if wikipedia was running on something like .NET – can you imagine how much marketing noise there’d be following a successful move to the latest version? Funny how the LAMP world moves differently. Anyway – lots more detail in there you’ll have to listen to.

Great stuff and thanks to Brion for doing it.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Ren

    I know there is some work being done on WYSIWYG editor for mediawiki. I guess this could be a possibility for replacing parsing.

  • rexruff

    Thanks to Brion, he rocks!

  • http://www.phppatterns.com HarryF

    I know there is some work being done on WYSIWYG editor for mediawiki. I guess this could be a possibility for replacing parsing.

    How so? You mean eliminate use of wiki markup completely and use (X)HTML purely?

  • Ren

    From the small conversation I had with someone implementing it it seems the wiki markup has been XML-ified.

  • http://www.phppatterns.com HarryF

    From the small conversation I had with someone implementing it it seems the wiki markup has been XML-ified.

    OK – now that makes more sense of some of the things I’ve seen on their mailing list. They already seem to have some kind of mediawiki to xml parsing going on in here and the changes look recent. Interesting.

  • Etnu

    Yahoo has more writes / sec than Wikipedia, and it’s mostly PHP (although not completely, of course). I’d use that as the “posterchild” for PHP before Wikipedia.