Looking at the top 20 of alexa’s global 500 popular sites, one thing that stands out is the majority are primarily “read only” sites – news, search or otherwise where updates to content are primarily managed by those running the site.
The big three exceptions here though are myspaces (running .NET now I believe – was Coldfusion), ebay (have they migrated fully to J2EE yet or is some of home-grown C++ still around?) and wikipedia (LAMP). All of these are, in some way, collaborative sites where content is created primarily by users. In other words, they have to be able to support a significant volume of writes as well as reads. That’s interesting because, in terms of scaling, the more volatile the data you’re providing, the harder it gets to scale – it raises questions like “how do you cache?”, “how do you handle transactions / locking?”, “how to you distribute updates” etc.
Anyway that wikipedia runs LAMP makes it somewhat of a poster-child and, as you may know, the software used on wikipedia is mediawiki, written in PHP. Given the scale of the technical problem the wikimedia foundation has had to solve, what’s been a little frustrating in the past finding detail from those involved on how they do it. Thanks to Brion Vibber we now have more information…
First up is his talk to Google, delever at the end of last month. Some fascinating details and trivia in there (e.g. they’re currently averaging about 1 update / sec) and, considering they “only” have about 100 application servers (running the mediawiki code), the overall impression is almost “is that all it takes? How small the Internet is” – Brion plays down the effort that has gone into making it possible with remarks like “It takes a little work”. He also mentions some of the issues they’re having with their wiki syntax parser, which has similar issues to those we’ve seen before elsewhere – they seem to be attempting to replace it with a C-based parser exposed as a PHP extension but given the date of last change, is that an effort which has stalled? Also, wryly noted, was the number of questions related to how wikimedia is financed – given it was a technical talk and the location, makes you go “Hmmmm…”.
Following that, more detail (with stronger PHP slant) comes from php architects webcast Interview with Brion Vibber – Marcus does a great job of asking pertinent questions – perhaps the biggest item was that wikipedia servers are already running PHP 5, even if the code isn’t yet taking advantage of the fact. Side note: imagine if wikipedia was running on something like .NET – can you imagine how much marketing noise there’d be following a successful move to the latest version? Funny how the LAMP world moves differently. Anyway – lots more detail in there you’ll have to listen to.
Great stuff and thanks to Brion for doing it.
Harry Fuecks is the Engineering Project Lead at Tamedia and formerly the Head of Engineering at Squirro. He is a data-driven facilitator, leader, coach and specializes in line management, hiring software engineers, analytics, mobile, and marketing. Harry also enjoys writing and you can read his articles on SitePoint and Medium.